how to go from an short read alignment file to a SNPs table for population genetic analysis

0

Entering edit mode

Mao Jianfeng ▴ 290

@mao-jianfeng-3598

Last seen 11.5 years ago

Dear Bioconductor listers, I am new to genomics and bioinformatics. In my current study, we have sequenced the genomes of tens of accessions of a plant, using Illumina next generation sequencer. The short reads of a specific accession have been aligned to the reference. The SNPs and shor indels have been predicted for a specific accession genome to the reference. we got the data sets for SNPs like the following format (in text file, the column names were listed, the accession name will not change for a specific accession): <accession name=""><chromosome><position><reference base=""><cons base=""><quality><support><concordance><avg_hits> But usually, we need to align all the accessions in the following format for classical population genetic analysis: <accessions><snp_1><snp_2><snp_3><snp_...> accession_1, a,t,g,,, accession_2, a,t,c,,, accession_3, t,a,c,,, accession_,,,,,,,,,,,,, I need to get helps, suggestions on how to do this format conversion, or if there are any alternative choices for me, by using R and bioconductor? If it need database operations, and how to do that? Thanks in advance. -- Jian-Feng, Mao

genomes genomes • 1.4k views

ADD COMMENT • link updated 15.2 years ago by Sean Davis 21k • written 15.2 years ago by Mao Jianfeng ▴ 290

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 1 day ago

United States

On Mon, Dec 6, 2010 at 9:54 AM, Mao Jianfeng <jianfeng.mao@gmail.com> wrote: > Dear Bioconductor listers, > > I am new to genomics and bioinformatics. In my current study, we have > sequenced the genomes of tens of accessions of a plant, using Illumina > next generation sequencer. The short reads of a specific accession > have been aligned to the reference. The SNPs and shor indels have been > predicted for a specific accession genome to the reference. we got the > data sets for SNPs like the following format (in text file, the column > names were listed, the accession name will not change for a specific > accession): > > <accession name=""><chromosome><position><reference base=""><cons> base><quality><support><concordance><avg_hits> > > > But usually, we need to align all the accessions in the following > format for classical population genetic analysis: > > <accessions><snp_1><snp_2><snp_3><snp_...> > accession_1, a,t,g,,, > accession_2, a,t,c,,, > accession_3, t,a,c,,, > accession_,,,,,,,,,,,,, > > I need to get helps, suggestions on how to do this format conversion, > or if there are any alternative choices for me, by using R and > bioconductor? If it need database operations, and how to do that? > > Thanks in advance. > > Hi, Jianfeng. You might save yourself some trouble by using a format such as VCF, something that is approaching an standard for reporting and databasing variants. If you write a script to convert your variant format to a VCF, then combining them can be done with vcftools or potentially other tools dealing with VCF. Sean [[alternative HTML version deleted]]

ADD COMMENT • link 15.2 years ago Sean Davis 21k

0

Entering edit mode

On Mon, Dec 6, 2010 at 11:28 AM, Sean Davis <sdavis2 at="" mail.nih.gov=""> wrote: > On Mon, Dec 6, 2010 at 9:54 AM, Mao Jianfeng <jianfeng.mao at="" gmail.com=""> wrote: > >> Dear Bioconductor listers, >> >> I am new to genomics and bioinformatics. In my current study, we have >> sequenced the genomes of tens of accessions of a plant, using Illumina >> next generation sequencer. The short reads of a specific accession >> have been aligned to the reference. The SNPs and shor indels have been >> predicted for a specific accession genome to the reference. we got the >> data sets for SNPs like the following format (in text file, the column >> names were listed, the accession name will not change for a specific >> accession): >> >> <accession name=""><chromosome><position><reference base=""><cons>> base><quality><support><concordance><avg_hits> >> >> >> But usually, we need to align all the accessions in the following >> format for classical population genetic analysis: >> >> <accessions><snp_1><snp_2><snp_3><snp_...> >> accession_1, a,t,g,,, >> accession_2, a,t,c,,, >> accession_3, t,a,c,,, >> accession_,,,,,,,,,,,,, >> >> I need to get helps, suggestions on how to do this format conversion, >> or if there are any alternative choices for me, by using R and >> bioconductor? If it need database operations, and how to do that? >> >> Thanks in advance. >> >> > > Hi, Jianfeng. ?You might save yourself some trouble by using a format such > as VCF, something that is approaching an standard for reporting and > databasing variants. ?If you write a script to convert your variant format > to a VCF, then combining them can be done with vcftools or potentially other > tools dealing with VCF. I will add here that there is very rudimentary code for transforming VCF to SnpMatrix instances in the devel branch of GGtools: called vcf2sm The intention is to speed the path from variant representations for multiple subjects as given in the 1000 genomes files to structures analyzable with the snpMatrix2 facilities. However, the specific implementation in vcf2sm requires that system("tabix") works. Rsamtools facilities for working with bcf are also relevant but have not been connected to the SnpMatrix representation yet. > > Sean > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 15.2 years ago Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

On Mon, Dec 6, 2010 at 9:40 AM, Vincent Carey <stvjc@channing.harvard.edu>wrote: > On Mon, Dec 6, 2010 at 11:28 AM, Sean Davis <sdavis2@mail.nih.gov> wrote: > > On Mon, Dec 6, 2010 at 9:54 AM, Mao Jianfeng <jianfeng.mao@gmail.com> > wrote: > > > >> Dear Bioconductor listers, > >> > >> I am new to genomics and bioinformatics. In my current study, we have > >> sequenced the genomes of tens of accessions of a plant, using Illumina > >> next generation sequencer. The short reads of a specific accession > >> have been aligned to the reference. The SNPs and shor indels have been > >> predicted for a specific accession genome to the reference. we got the > >> data sets for SNPs like the following format (in text file, the column > >> names were listed, the accession name will not change for a specific > >> accession): > >> > >> <accession name=""><chromosome><position><reference base=""><cons> >> base><quality><support><concordance><avg_hits> > >> > >> > >> But usually, we need to align all the accessions in the following > >> format for classical population genetic analysis: > >> > >> <accessions><snp_1><snp_2><snp_3><snp_...> > >> accession_1, a,t,g,,, > >> accession_2, a,t,c,,, > >> accession_3, t,a,c,,, > >> accession_,,,,,,,,,,,,, > >> > >> I need to get helps, suggestions on how to do this format conversion, > >> or if there are any alternative choices for me, by using R and > >> bioconductor? If it need database operations, and how to do that? > >> > >> Thanks in advance. > >> > >> > > > > Hi, Jianfeng. You might save yourself some trouble by using a format > such > > as VCF, something that is approaching an standard for reporting and > > databasing variants. If you write a script to convert your variant > format > > to a VCF, then combining them can be done with vcftools or potentially > other > > tools dealing with VCF. > > I will add here that there is very rudimentary code for transforming VCF to > SnpMatrix instances in the devel branch of GGtools: called vcf2sm > > The intention is to speed the path from variant representations for > multiple subjects as given in the > 1000 genomes files to structures analyzable with the snpMatrix2 > facilities. However, the specific > implementation in vcf2sm requires that system("tabix") works. > Rsamtools facilities for working > with bcf are also relevant but have not been connected to the > SnpMatrix representation yet. > > The snpMatrix(2) package looks interesting. It would be great if it were better integrated with IRanges/GenomicRanges. For example, the snp.support object could be a RangedData or GRanges. > > > Sean > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 15.2 years ago Michael Lawrence ★ 11k

Login before adding your answer.