Bioc Interface for 1,000 Genomes SNP Frequencies
1
0
Entering edit mode
Timothy Duff ▴ 40
@timothy-duff-5396
Last seen 10.4 years ago
Hi. I have a set of rsIDs for which I'm interested in obtaining allele frequencies for (in a particular population surveyed by 1kGP.) The corresponding .vcf files are pretty memory consumptive. Do there currently exist basic BC tools that could help with this sort of thing? Thank you all for your consideration. -- Tim [[alternative HTML version deleted]]
• 2.1k views
ADD COMMENT
0
Entering edit mode
@vincent-j-carey-jr-4
Last seen 3 days ago
United States
Val Obenchain can reply more definitively, but, briefly, VariantAnnotation package has a scanVcf capability that allows survey of vcf archives with managed memory consumption. If you do not see enough details in the vignette to solve this use case efficiently, let us know. On Tue, Jul 17, 2012 at 10:53 AM, Timothy Duff <timothy.duff@ncf.edu> wrote: > Hi. I have a set of rsIDs for which I'm interested in obtaining allele > frequencies for (in a particular population surveyed by 1kGP.) The > corresponding .vcf files are pretty memory consumptive. Do there currently > exist basic BC tools that could help with this sort of thing? Thank you all > for your consideration. > > -- > Tim > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Hi Tim, 1000 Genomes data can be read in with readVcf(). The 'param' argument allows you to specify chromosomes and/or genome positions you are interested in Using a sample file from 1000 genomes, library(VariantAnnotation) fl <- "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/supporting/ AFR.2of4intersection_allele_freq.20100804.genotypes.vcf.gz" genomes <- readVcf(TabixFile(fl), "hg19", param=GRanges("12", IRanges(101266000, 101422000))) The data are read into a VCF obejct. See ?VCF for class details. If a snp has an rsid, it will appear as the rowname. If not, the rowname is a concatenation of chromosome:position. > genomes class: VCF dim: 1325 174 genome: hg19 exptData(1): header fixed(4): REF ALT QUAL FILTER info(6): AF DP ... AFR_R2 ASN_R2 geno(7): AD DP ... GD OG rownames(1325): rs75462666 rs78597949 ... 12:101421722 rs67962959 rowData values names(1): paramRangeID colnames(174): HG00553 HG00554 ... NA19982 NA20414 colData names(1): Samples We don't have a function that computes the frequencies but that can be accomplished fairly easily. The genotype data can be accessed with the 'geno' function. > geno(genomes) SimpleList of length 7 names(7): AD DP GL GQ GT GD OG > geno(genomes)$GT[1:5,1:3] HG00553 HG00554 HG00637 rs75462666 "0|0" "0|0" "0|0" rs78597949 "0|0" "0|0" "0|0" rs78693793 "0|0" "0|0" "0|0" 12:101266725 "0|0" "0|0" "0|0" 12:101266729 "0|0" "0|0" "0|0" The frequency of the alternate allele is computed as [ 0|1 + 2*(1|1)] / 2*(total number of individuals). Maybe something like, heterozyg <- apply(geno(genomes)$GT, 2, function(x) x %in% c("0|1", "1|0")) homozyg <- apply(geno(genomes)$GT, 2, function(x) x %in% "1|1") nsub <- ncol(geno(genomes)$GT) freq <- (heterozyg + 2*homozyg) / 2*nsub Valerie On 07/17/2012 08:09 AM, Vincent Carey wrote: > Val Obenchain can reply more definitively, but, briefly, VariantAnnotation > package has a scanVcf capability that > allows survey of vcf archives with managed memory consumption. If you do > not see enough details in the vignette > to solve this use case efficiently, let us know. > > On Tue, Jul 17, 2012 at 10:53 AM, Timothy Duff<timothy.duff at="" ncf.edu=""> wrote: > >> Hi. I have a set of rsIDs for which I'm interested in obtaining allele >> frequencies for (in a particular population surveyed by 1kGP.) The >> corresponding .vcf files are pretty memory consumptive. Do there currently >> exist basic BC tools that could help with this sort of thing? Thank you all >> for your consideration. >> >> -- >> Tim >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
On 07/17/2012 08:36 AM, Valerie Obenchain wrote: > Hi Tim, > > 1000 Genomes data can be read in with readVcf(). The 'param' argument > allows you to specify chromosomes and/or genome positions you are > interested in > Using a sample file from 1000 genomes, > > library(VariantAnnotation) > fl <- > "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/supportin g/AFR.2of4intersection_allele_freq.20100804.genotypes.vcf.gz" > genomes <- readVcf(TabixFile(fl), "hg19", param=GRanges("12", > IRanges(101266000, 101422000))) > > The data are read into a VCF obejct. See ?VCF for class details. If a > snp has an rsid, it will appear as the rowname. If not, the rowname is > a concatenation of chromosome:position. > > > genomes > class: VCF > dim: 1325 174 > genome: hg19 > exptData(1): header > fixed(4): REF ALT QUAL FILTER > info(6): AF DP ... AFR_R2 ASN_R2 > geno(7): AD DP ... GD OG > rownames(1325): rs75462666 rs78597949 ... 12:101421722 rs67962959 > rowData values names(1): paramRangeID > colnames(174): HG00553 HG00554 ... NA19982 NA20414 > colData names(1): Samples > > We don't have a function that computes the frequencies but that can be > accomplished fairly easily. The genotype data can be accessed with the > 'geno' function. > > > geno(genomes) > SimpleList of length 7 > names(7): AD DP GL GQ GT GD OG > > > geno(genomes)$GT[1:5,1:3] > HG00553 HG00554 HG00637 > rs75462666 "0|0" "0|0" "0|0" > rs78597949 "0|0" "0|0" "0|0" > rs78693793 "0|0" "0|0" "0|0" > 12:101266725 "0|0" "0|0" "0|0" > 12:101266729 "0|0" "0|0" "0|0" > > The frequency of the alternate allele is computed as [ 0|1 + 2*(1|1)] > / 2*(total number of individuals). Maybe something like, > > heterozyg <- apply(geno(genomes)$GT, 2, function(x) x %in% c("0|1", > "1|0")) > homozyg <- apply(geno(genomes)$GT, 2, function(x) x %in% "1|1") > nsub <- ncol(geno(genomes)$GT) > freq <- (heterozyg + 2*homozyg) / 2*nsub > or this more elegant solution from Martin, > m = geno(vcf)$GT > freq = rowSums((m == "0|1" | m == "1|0") / 2 + (m == "1|1")) / ncol(m) which saves a couple of apply's. Valerie > Valerie > > On 07/17/2012 08:09 AM, Vincent Carey wrote: >> Val Obenchain can reply more definitively, but, briefly, >> VariantAnnotation >> package has a scanVcf capability that >> allows survey of vcf archives with managed memory consumption. If >> you do >> not see enough details in the vignette >> to solve this use case efficiently, let us know. >> >> On Tue, Jul 17, 2012 at 10:53 AM, Timothy Duff<timothy.duff at="" ncf.edu=""> >> wrote: >> >>> Hi. I have a set of rsIDs for which I'm interested in obtaining allele >>> frequencies for (in a particular population surveyed by 1kGP.) The >>> corresponding .vcf files are pretty memory consumptive. Do there >>> currently >>> exist basic BC tools that could help with this sort of thing? Thank >>> you all >>> for your consideration. >>> >>> -- >>> Tim >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
This is just what I needed - thank you both so much! On Tue, Jul 17, 2012 at 11:09 AM, Valerie Obenchain <vobencha@fhcrc.org>wrote: > On 07/17/2012 08:36 AM, Valerie Obenchain wrote: > >> Hi Tim, >> >> 1000 Genomes data can be read in with readVcf(). The 'param' argument >> allows you to specify chromosomes and/or genome positions you are >> interested in >> Using a sample file from 1000 genomes, >> >> library(VariantAnnotation) >> fl <- "ftp://ftp.1000genomes.ebi.ac.**uk/vol1/ftp/release/20100804/** >> supporting/AFR.**2of4intersection_allele_freq.**20100804.genotypes. vcf.gz<ftp: ftp.1000genomes.ebi.ac.uk="" vol1="" ftp="" release="" 20100804="" suppo="" rting="" afr.2of4intersection_allele_freq.20100804.genotypes.vcf.gz=""> >> " >> genomes <- readVcf(TabixFile(fl), "hg19", param=GRanges("12", >> IRanges(101266000, 101422000))) >> >> The data are read into a VCF obejct. See ?VCF for class details. If a snp >> has an rsid, it will appear as the rowname. If not, the rowname is a >> concatenation of chromosome:position. >> >> > genomes >> class: VCF >> dim: 1325 174 >> genome: hg19 >> exptData(1): header >> fixed(4): REF ALT QUAL FILTER >> info(6): AF DP ... AFR_R2 ASN_R2 >> geno(7): AD DP ... GD OG >> rownames(1325): rs75462666 rs78597949 ... 12:101421722 rs67962959 >> rowData values names(1): paramRangeID >> colnames(174): HG00553 HG00554 ... NA19982 NA20414 >> colData names(1): Samples >> >> We don't have a function that computes the frequencies but that can be >> accomplished fairly easily. The genotype data can be accessed with the >> 'geno' function. >> >> > geno(genomes) >> SimpleList of length 7 >> names(7): AD DP GL GQ GT GD OG >> >> > geno(genomes)$GT[1:5,1:3] >> HG00553 HG00554 HG00637 >> rs75462666 "0|0" "0|0" "0|0" >> rs78597949 "0|0" "0|0" "0|0" >> rs78693793 "0|0" "0|0" "0|0" >> 12:101266725 "0|0" "0|0" "0|0" >> 12:101266729 "0|0" "0|0" "0|0" >> >> The frequency of the alternate allele is computed as [ 0|1 + 2*(1|1)] / >> 2*(total number of individuals). Maybe something like, >> >> heterozyg <- apply(geno(genomes)$GT, 2, function(x) x %in% c("0|1", >> "1|0")) >> homozyg <- apply(geno(genomes)$GT, 2, function(x) x %in% "1|1") >> nsub <- ncol(geno(genomes)$GT) >> freq <- (heterozyg + 2*homozyg) / 2*nsub >> >> > or this more elegant solution from Martin, > > > m = geno(vcf)$GT > > freq = rowSums((m == "0|1" | m == "1|0") / 2 + (m == "1|1")) / ncol(m) > > which saves a couple of apply's. > > > Valerie > > > Valerie >> >> On 07/17/2012 08:09 AM, Vincent Carey wrote: >> >>> Val Obenchain can reply more definitively, but, briefly, >>> VariantAnnotation >>> package has a scanVcf capability that >>> allows survey of vcf archives with managed memory consumption. If you do >>> not see enough details in the vignette >>> to solve this use case efficiently, let us know. >>> >>> On Tue, Jul 17, 2012 at 10:53 AM, Timothy Duff<timothy.duff@ncf.edu> >>> wrote: >>> >>> Hi. I have a set of rsIDs for which I'm interested in obtaining allele >>>> frequencies for (in a particular population surveyed by 1kGP.) The >>>> corresponding .vcf files are pretty memory consumptive. Do there >>>> currently >>>> exist basic BC tools that could help with this sort of thing? Thank you >>>> all >>>> for your consideration. >>>> >>>> -- >>>> Tim >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> ______________________________**_________________ >>>> Bioconductor mailing list >>>> Bioconductor@r-project.org >>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat="" .ethz.ch="" mailman="" listinfo="" bioconductor=""> >>>> Search the archives: >>>> http://news.gmane.org/gmane.**science.biology.informatics.**condu ctor<http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >>>> >>>> [[alternative HTML version deleted]] >>> >>> ______________________________**_________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.="" ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> Search the archives: http://news.gmane.org/gmane.** >>> science.biology.informatics.**conductor<http: news.gmane.org="" gman="" e.science.biology.informatics.conductor=""> >>> >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> > > -- Tim [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 496 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6