VCF as custom variant annotation database

0

Entering edit mode

Lescai, Francesco ▴ 380

@lescai-francesco-5078

Last seen 5.6 years ago

Denmark

Hi there, following this suggestion Ive converted the results of a number of analyses into VCF files. I believe however I never exploited the full potential of the VCF object and Im not sure which would be the most efficient way for a targeted query, i.e. check if a variant from a VCF object is present in any of the others, like I would to with an SQL database. Ideally, Id like to add a metadata field to one of my objects, that could be reported in the INFO field like any other annotation when I then write the VCF on a file. To do this Id have to add the field both in the info(header(vcf)) and in the info(vcf), with/after the result of my query. Two questions therefore: 1) how do you add an INFO field in a way that doesnt generate error from within R 2) whats the most computationally efficient way to query for presence/absence of a variant (I thought about a match on the names, a countOverlaps on the GRanges etc, but that doesnt exploit the tabix index, and still seems to me a bit clumsy). thanks for any advice on how to best use this class, best Francesco On 3 Oct 2013, at 15:53, Vincent Carey <stvjc@channing.harvard.edu<mailto:stvjc@channing.harvard.edu>> wrote: On Thu, Oct 3, 2013 at 7:18 AM, Francesco Lescai <francesco.lescai @hum-gen.au.dk<mailto:francesco.lescai@hum-gen.au.dk="">> wrote: Hi guys, I would like to store lists of variants (SNPs and INDELs) in a convenient format to be used frequently with other bioconductor packages, and I was thinking to store them as Annotation databases. In this way I could store and query my own list of variants with annotations coming from my work and/or previous experiments. Most of the examples/vignettes I could find like SQLForge are however gene-centric. Can I create a custom one, maybe using SNPlocs as template? Have you considered tabix-indexed VCF? This can be stored in a package and Rsamtools/VariantAnnotation facilities can be used for targeted querying. The SNPlocs containers are likely to be redesigned in the near future. thanks for any suggestions, Francesco [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

Annotation SNPlocs Annotation SNPlocs • 1.5k views

ADD COMMENT • link updated 10.3 years ago by Valerie Obenchain ★ 6.8k • written 10.3 years ago by Lescai, Francesco ▴ 380

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 2.3 years ago

United States

Hi Francesco, On 01/08/2014 05:59 AM, Francesco Lescai wrote: > Hi there, > following this suggestion I?ve converted the results of a number of > analyses into VCF files. > I believe however I never exploited the full potential of the VCF object > and I?m not sure which would be the most efficient way for a targeted > query, i.e. check if a variant from a VCF object is present in any of > the others, like I would to with an SQL database. > > Ideally, I?d like to add a metadata field to one of my objects, that > could be reported in the INFO field like any other annotation when I > then write the VCF on a file. > To do this I?d have to add the field both in the info(header(vcf)) and > in the info(vcf), with/after the result of my query. > > Two questions therefore: > 1) how do you add an INFO field in a way that doesn?t generate error > from within R > fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") > vcf <- readVcf(fl, "hg19") Original info variables. > vcf <- readVcf(fl, "hg19") > names(info(vcf)) [1] "NS" "DP" "AF" "AA" "DB" "H2" > dim(info(vcf)) [1] 5 6 Add new variables with the info<- setter. > newInfo <- info(vcf) > newInfo$foo <- 1:5 > info(vcf) <- newInfo > names(info(vcf)) [1] "NS" "DP" "AF" "AA" "DB" "H2" "foo" > info(vcf)$foo [1] 1 2 3 4 5 > > 2) what?s the most computationally efficient way to query for > presence/absence of a variant (I thought about a match on the names, a > countOverlaps on the GRanges etc, but that doesn?t exploit the tabix > index, and still seems to me a bit clumsy). You mention the tabix index which is only applicable to the file on disk and not the VCF class in R. I'm not sure if you are interested in querying the file on disk or class in R so below I've addressed both. Query vcf file on disk: To match variant position in the file using a ScanVcfParam with GRanges is best if you want the VCF class as the return object. Alternatively you could use filterVcf() where the output is a reduced vcf file on disk subset by the filter criteria. filterVcf() can subset by position, character string or specific field values and has options of 'prefilter' and 'filter'. A 'prefilter' searches unparsed lines from the file for a specific character string. 'filters' operate on the data once it has been parsed into the VCF object (this allows 'filters' to be defined in VCF-class syntax). Query VCF class in R: To match on position I'd use subsetByOverlaps() on the VCF GRanges if you want a GRanges back. If you only want a count of the hits (i.e., location if present) then use countOverlaps(). This is essentially what you're doing now but if chromosome and/or strand are important then this is the best way to go. Others free to chime in. You could gain a little performance if the data are large and chromosome and strand are not important. Using countOverlaps() on the ranges() extracted from the GRanges will save a some time on big data. To match on variant name I would match on the colnames(vcf) as you are doing. Are any of these operations taking prohibitively long? Please let me know if they are. Valerie > > thanks for any advice on how to best use this class, > best > Francesco > > > > On 3 Oct 2013, at 15:53, Vincent Carey <stvjc at="" channing.harvard.edu=""> <mailto:stvjc at="" channing.harvard.edu="">> wrote: > >> >> >> >> On Thu, Oct 3, 2013 at 7:18 AM, Francesco Lescai >> <francesco.lescai at="" hum-gen.au.dk="">> <mailto:francesco.lescai at="" hum-gen.au.dk="">> wrote: >> >> Hi guys, >> I would like to store lists of variants (SNPs and INDELs) in a >> convenient format to be used frequently with other bioconductor >> packages, and I was thinking to store them as Annotation databases. >> In this way I could store and query my own list of variants with >> annotations coming from my work and/or previous experiments. >> >> >> Most of the examples/vignettes I could find like SQLForge are >> however gene-centric. >> Can I create a custom one, maybe using SNPlocs as template? >> >> >> Have you considered tabix-indexed VCF? This can be stored in a >> package and Rsamtools/VariantAnnotation facilities can be used for >> targeted querying. The SNPlocs containers are likely to be redesigned >> in the near future. >> >> thanks for any suggestions, >> Francesco >> >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > > -- Valerie Obenchain Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B155 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: vobencha at fhcrc.org Phone: (206) 667-3158 Fax: (206) 667-1319

ADD COMMENT • link 10.3 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Thanks Valerie and Michael, Ill implement your suggestions and let you know how it goes. As for the info section, last time I tried to modify the vcf object directly, that might have been quite different from setting again the whole info section. As for the overlaps, following your indication I could also use overlapsAny(), which seems to work with two vcf objects as well. thanks again, Francesco On 8 Jan 2014, at 19:09, Valerie Obenchain <vobencha@fhcrc.org<mailto:vobencha@fhcrc.org>> wrote: Hi Francesco, On 01/08/2014 05:59 AM, Francesco Lescai wrote: Hi there, following this suggestion Ive converted the results of a number of analyses into VCF files. I believe however I never exploited the full potential of the VCF object and Im not sure which would be the most efficient way for a targeted query, i.e. check if a variant from a VCF object is present in any of the others, like I would to with an SQL database. Ideally, Id like to add a metadata field to one of my objects, that could be reported in the INFO field like any other annotation when I then write the VCF on a file. To do this Id have to add the field both in the info(header(vcf)) and in the info(vcf), with/after the result of my query. Two questions therefore: 1) how do you add an INFO field in a way that doesnt generate error from within R > fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") > vcf <- readVcf(fl, "hg19") Original info variables. > vcf <- readVcf(fl, "hg19") > names(info(vcf)) [1] "NS" "DP" "AF" "AA" "DB" "H2" > dim(info(vcf)) [1] 5 6 Add new variables with the info<- setter. > newInfo <- info(vcf) > newInfo$foo <- 1:5 > info(vcf) <- newInfo > names(info(vcf)) [1] "NS" "DP" "AF" "AA" "DB" "H2" "foo" > info(vcf)$foo [1] 1 2 3 4 5 2) whats the most computationally efficient way to query for presence/absence of a variant (I thought about a match on the names, a countOverlaps on the GRanges etc, but that doesnt exploit the tabix index, and still seems to me a bit clumsy). You mention the tabix index which is only applicable to the file on disk and not the VCF class in R. I'm not sure if you are interested in querying the file on disk or class in R so below I've addressed both. Query vcf file on disk: To match variant position in the file using a ScanVcfParam with GRanges is best if you want the VCF class as the return object. Alternatively you could use filterVcf() where the output is a reduced vcf file on disk subset by the filter criteria. filterVcf() can subset by position, character string or specific field values and has options of 'prefilter' and 'filter'. A 'prefilter' searches unparsed lines from the file for a specific character string. 'filters' operate on the data once it has been parsed into the VCF object (this allows 'filters' to be defined in VCF-class syntax). Query VCF class in R: To match on position I'd use subsetByOverlaps() on the VCF GRanges if you want a GRanges back. If you only want a count of the hits (i.e., location if present) then use countOverlaps(). This is essentially what you're doing now but if chromosome and/or strand are important then this is the best way to go. Others free to chime in. You could gain a little performance if the data are large and chromosome and strand are not important. Using countOverlaps() on the ranges() extracted from the GRanges will save a some time on big data. To match on variant name I would match on the colnames(vcf) as you are doing. Are any of these operations taking prohibitively long? Please let me know if they are. Valerie Begin forwarded message: From: Michael Lawrence <lawrence.michael@gene.com<mailto:lawrence.michael@gene.com>> Subject: Re: [BioC] VCF as custom variant annotation database Date: 8 January 2014 16:56:09 CET To: Francesco Lescai <francesco.lescai@hum- gen.au.dk<mailto:francesco.lescai@hum-gen.au.dk="">> On Wed, Jan 8, 2014 at 5:59 AM, Francesco Lescai <francesco.lescai @hum-gen.au.dk<mailto:francesco.lescai@hum-gen.au.dk="">> wrote: Hi there, following this suggestion Ive converted the results of a number of analyses into VCF files. I believe however I never exploited the full potential of the VCF object and Im not sure which would be the most efficient way for a targeted query, i.e. check if a variant from a VCF object is present in any of the others, like I would to with an SQL database. Ideally, Id like to add a metadata field to one of my objects, that could be reported in the INFO field like any other annotation when I then write the VCF on a file. To do this Id have to add the field both in the info(header(vcf)) and in the info(vcf), with/after the result of my query. Two questions therefore: 1) how do you add an INFO field in a way that doesnt generate error from within R You might want to look into the VRanges class. It's just a GRanges for variants. Add an mcol like with GRanges. By default, the mcols end up in the 'geno' field, but you can change this during coercion to VCF: vcf <- asVCF(vr, info="myColumn") writeVcf(vcf, "my.vcf") 2) whats the most computationally efficient way to query for presence/absence of a variant (I thought about a match on the names, a countOverlaps on the GRanges etc, but that doesnt exploit the tabix index, and still seems to me a bit clumsy). Directly from a file? No explicit support for that, but it's an interesting idea. Right now, say you have a VRanges subset to some variants of interest, called "variants", then: vr <- readVcfAsVRanges("my.vcf", genome=genome(variants)[1], param=variants) variants %in% vr I'll make a method for %in%,VRanges,TabixFile to make this easier. thanks for any advice on how to best use this class, best Francesco thanks for any advice on how to best use this class, best Francesco On 3 Oct 2013, at 15:53, Vincent Carey <stvjc@channing.harvard.edu<mailto:stvjc@channing.harvard.edu> <mailto:stvjc@channing.harvard.edu>> wrote: On Thu, Oct 3, 2013 at 7:18 AM, Francesco Lescai <francesco.lescai@hum-gen.au.dk<mailto:francesco.lescai@hum-gen.au.dk> <mailto:francesco.lescai@hum-gen.au.dk>> wrote: Hi guys, I would like to store lists of variants (SNPs and INDELs) in a convenient format to be used frequently with other bioconductor packages, and I was thinking to store them as Annotation databases. In this way I could store and query my own list of variants with annotations coming from my work and/or previous experiments. Most of the examples/vignettes I could find like SQLForge are however gene-centric. Can I create a custom one, maybe using SNPlocs as template? Have you considered tabix-indexed VCF? This can be stored in a package and Rsamtools/VariantAnnotation facilities can be used for targeted querying. The SNPlocs containers are likely to be redesigned in the near future. thanks for any suggestions, Francesco [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> <mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Valerie Obenchain Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B155 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: vobencha@fhcrc.org<mailto:vobencha@fhcrc.org> Phone: (206) 667-3158 Fax: (206) 667-1319 ________________________________ [X] Francesco Lescai, PhD, EDBT Associate Professor Department of Biomedicine, Human Genetics Building 1243 - Bioinformatics Wilhelm Meyers Alle 4 8000 Aarhus C, Denmark Tel: +45 871 68496 francesco.lescai@hum-gen.au.dk<mailto:francesco.lescai@hum-gen.au.dk> iSEQ Centre for iSequencing iPSYCH Initiative for Integrative Psychiatric Research www.ipsych.dk<http: www.ipsych.dk=""> [[alternative HTML version deleted]]

ADD REPLY • link 10.3 years ago Lescai, Francesco ▴ 380

Login before adding your answer.