how to go from an short read alignment file to a SNPs table for population genetic analysis
1
0
Entering edit mode
Mao Jianfeng ▴ 290
@mao-jianfeng-3598
Last seen 9.6 years ago
Dear Bioconductor listers, I am new to genomics and bioinformatics. In my current study, we have sequenced the genomes of tens of accessions of a plant, using Illumina next generation sequencer. The short reads of a specific accession have been aligned to the reference. The SNPs and shor indels have been predicted for a specific accession genome to the reference. we got the data sets for SNPs like the following format (in text file, the column names were listed, the accession name will not change for a specific accession): <accession name=""><chromosome><position><reference base=""><cons base=""><quality><support><concordance><avg_hits> But usually, we need to align all the accessions in the following format for classical population genetic analysis: <accessions><snp_1><snp_2><snp_3><snp_...> accession_1, a,t,g,,, accession_2, a,t,c,,, accession_3, t,a,c,,, accession_,,,,,,,,,,,,, I need to get helps, suggestions on how to do this format conversion, or if there are any alternative choices for me, by using R and bioconductor? If it need database operations, and how to do that? Thanks in advance. -- Jian-Feng, Mao
genomes genomes • 1.0k views
ADD COMMENT
0
Entering edit mode
@sean-davis-490
Last seen 3 months ago
United States
On Mon, Dec 6, 2010 at 9:54 AM, Mao Jianfeng <jianfeng.mao@gmail.com> wrote: > Dear Bioconductor listers, > > I am new to genomics and bioinformatics. In my current study, we have > sequenced the genomes of tens of accessions of a plant, using Illumina > next generation sequencer. The short reads of a specific accession > have been aligned to the reference. The SNPs and shor indels have been > predicted for a specific accession genome to the reference. we got the > data sets for SNPs like the following format (in text file, the column > names were listed, the accession name will not change for a specific > accession): > > <accession name=""><chromosome><position><reference base=""><cons> base><quality><support><concordance><avg_hits> > > > But usually, we need to align all the accessions in the following > format for classical population genetic analysis: > > <accessions><snp_1><snp_2><snp_3><snp_...> > accession_1, a,t,g,,, > accession_2, a,t,c,,, > accession_3, t,a,c,,, > accession_,,,,,,,,,,,,, > > I need to get helps, suggestions on how to do this format conversion, > or if there are any alternative choices for me, by using R and > bioconductor? If it need database operations, and how to do that? > > Thanks in advance. > > Hi, Jianfeng. You might save yourself some trouble by using a format such as VCF, something that is approaching an standard for reporting and databasing variants. If you write a script to convert your variant format to a VCF, then combining them can be done with vcftools or potentially other tools dealing with VCF. Sean [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
On Mon, Dec 6, 2010 at 11:28 AM, Sean Davis <sdavis2 at="" mail.nih.gov=""> wrote: > On Mon, Dec 6, 2010 at 9:54 AM, Mao Jianfeng <jianfeng.mao at="" gmail.com=""> wrote: > >> Dear Bioconductor listers, >> >> I am new to genomics and bioinformatics. In my current study, we have >> sequenced the genomes of tens of accessions of a plant, using Illumina >> next generation sequencer. The short reads of a specific accession >> have been aligned to the reference. The SNPs and shor indels have been >> predicted for a specific accession genome to the reference. we got the >> data sets for SNPs like the following format (in text file, the column >> names were listed, the accession name will not change for a specific >> accession): >> >> <accession name=""><chromosome><position><reference base=""><cons>> base><quality><support><concordance><avg_hits> >> >> >> But usually, we need to align all the accessions in the following >> format for classical population genetic analysis: >> >> <accessions><snp_1><snp_2><snp_3><snp_...> >> accession_1, a,t,g,,, >> accession_2, a,t,c,,, >> accession_3, t,a,c,,, >> accession_,,,,,,,,,,,,, >> >> I need to get helps, suggestions on how to do this format conversion, >> or if there are any alternative choices for me, by using R and >> bioconductor? If it need database operations, and how to do that? >> >> Thanks in advance. >> >> > > Hi, Jianfeng. ?You might save yourself some trouble by using a format such > as VCF, something that is approaching an standard for reporting and > databasing variants. ?If you write a script to convert your variant format > to a VCF, then combining them can be done with vcftools or potentially other > tools dealing with VCF. I will add here that there is very rudimentary code for transforming VCF to SnpMatrix instances in the devel branch of GGtools: called vcf2sm The intention is to speed the path from variant representations for multiple subjects as given in the 1000 genomes files to structures analyzable with the snpMatrix2 facilities. However, the specific implementation in vcf2sm requires that system("tabix") works. Rsamtools facilities for working with bcf are also relevant but have not been connected to the SnpMatrix representation yet. > > Sean > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLY
0
Entering edit mode
On Mon, Dec 6, 2010 at 9:40 AM, Vincent Carey <stvjc@channing.harvard.edu>wrote: > On Mon, Dec 6, 2010 at 11:28 AM, Sean Davis <sdavis2@mail.nih.gov> wrote: > > On Mon, Dec 6, 2010 at 9:54 AM, Mao Jianfeng <jianfeng.mao@gmail.com> > wrote: > > > >> Dear Bioconductor listers, > >> > >> I am new to genomics and bioinformatics. In my current study, we have > >> sequenced the genomes of tens of accessions of a plant, using Illumina > >> next generation sequencer. The short reads of a specific accession > >> have been aligned to the reference. The SNPs and shor indels have been > >> predicted for a specific accession genome to the reference. we got the > >> data sets for SNPs like the following format (in text file, the column > >> names were listed, the accession name will not change for a specific > >> accession): > >> > >> <accession name=""><chromosome><position><reference base=""><cons> >> base><quality><support><concordance><avg_hits> > >> > >> > >> But usually, we need to align all the accessions in the following > >> format for classical population genetic analysis: > >> > >> <accessions><snp_1><snp_2><snp_3><snp_...> > >> accession_1, a,t,g,,, > >> accession_2, a,t,c,,, > >> accession_3, t,a,c,,, > >> accession_,,,,,,,,,,,,, > >> > >> I need to get helps, suggestions on how to do this format conversion, > >> or if there are any alternative choices for me, by using R and > >> bioconductor? If it need database operations, and how to do that? > >> > >> Thanks in advance. > >> > >> > > > > Hi, Jianfeng. You might save yourself some trouble by using a format > such > > as VCF, something that is approaching an standard for reporting and > > databasing variants. If you write a script to convert your variant > format > > to a VCF, then combining them can be done with vcftools or potentially > other > > tools dealing with VCF. > > I will add here that there is very rudimentary code for transforming VCF to > SnpMatrix instances in the devel branch of GGtools: called vcf2sm > > The intention is to speed the path from variant representations for > multiple subjects as given in the > 1000 genomes files to structures analyzable with the snpMatrix2 > facilities. However, the specific > implementation in vcf2sm requires that system("tabix") works. > Rsamtools facilities for working > with bcf are also relevant but have not been connected to the > SnpMatrix representation yet. > > The snpMatrix(2) package looks interesting. It would be great if it were better integrated with IRanges/GenomicRanges. For example, the snp.support object could be a RangedData or GRanges. > > > Sean > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 821 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6