ensemblVEP: passing on coverage / frequency information

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 11.4 years ago

Dear Valerie, thanks a lot for making your great ensemblVEP package available. I have been using it to assess the consequences of variants detected by the VariantTools package (version 1.6.1). ensemblVEP retrieves the variantEffectPredictor output, but triggers a number of warnings (see below). library(ensemblVEP) ## example.vcf is available at http://dl.dropbox.com/u/126180/example.vcf vcf <- readVcf( "example.vcf", genome="hg19") ## the vcf object contains coverage information vcf geno(vcf)$DP ## running VEP triggers warnings vep.param <- VEPParam() output( vep.param )$vcf <- TRUE vep <- ensemblVEP( "example.vcf", genome="hg19", param=vep.param ) warnings() 1: In doTryCatch(return(expr), name, parentenv, handler) : record 1 (and others?) INFO 'AD:DP:AP' not found 2: In doTryCatch(return(expr), name, parentenv, handler) record 1 (and others?) FORMAT '0,2' not found 3: In doTryCatch(return(expr), name, parentenv, handler) : record 1 (and others?) FORMAT '2' not found ... I think these warnings refer to the "geno" slot of the vcf file. When I request a VCF object as output from ensemblVEP, the object contains the same elements in its geno slot as the original vcf input file, but they only contain NAs. Is this expected or should the geno slot be passed on to the VCF object generated by ensemblVEP ? My final objective is to obtain a GRanges or data.frame containing both the predicted consequences and the coverage / frequencies from the vcf input file for each variant. I have seen that the parseCSQToGRanges returns a "VCFRowID" column associating each row with the original entry in the VCF object. Would you recommend to use this column to extract the corresponding rows from the vectors / arrays stored in the "geno" slot ? Or is there a simpler, more elegant solution you could point me to ? Thanks a lot ! Thomas -- output of sessionInfo(): R version 3.0.0 (2013-04-03) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] ensemblVEP_1.0.0 VariantAnnotation_1.6.1 Rsamtools_1.12.0 [4] Biostrings_2.28.0 GenomicRanges_1.12.1 IRanges_1.18.0 [7] BiocGenerics_0.6.0 loaded via a namespace (and not attached): [1] AnnotationDbi_1.22.1 Biobase_2.20.0 biomaRt_2.16.0 [4] bitops_1.0-5 BSgenome_1.28.0 compiler_3.0.0 [7] DBI_0.2-5 GenomicFeatures_1.12.0 RCurl_1.95-4.1 [10] RSQLite_0.11.2 rtracklayer_1.20.0 stats4_3.0.0 [13] tools_3.0.0 XML_3.96-1.1 zlibbioc_1.6.0 -- Sent via the guest posting facility at bioconductor.org.

Coverage VariantTools ensemblVEP Coverage VariantTools ensemblVEP • 2.8k views

ADD COMMENT • link updated 12.8 years ago by Valerie Obenchain ★ 6.8k • written 12.8 years ago by Guest User ★ 13k

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 4.1 years ago

United States

Hi Thomas, You've hit an interesting bug. The problem is with how CSQ is returned from Ensembl. The documentation states that CSQ will be returned as an INFO column. However your file doesn't have INFO data and CSQ is being appended to the FORMAT variables instead. This is a problem when parsing the vcf with scanVcf/readVcf because now the number of FORMAT variables doesn't match the number of FORMAT data fields for each sample. This is why you see NA for all geno variables, they aren't being parsed. For example, the FORMAT field is coming back like this, AD:DP:AP;CSQ=C|ENSG00000128602|ENST00000462420|... but the samples only have data for the 3 original fields, 0,2:2:0,1 This seems like a bug and not intended behavior. Ensembl has even created a new INFO header line in the vcf returned to indicate the data would be in the INFO column, ##INFO=<id=csq,number=.,type=string,description="consequence type="" as="" predicted="" by="" vep.="" ....="" if="" you="" want="" to="" see="" this="" yourself="" you="" can="" write="" the="" vcf="" to="" disk="" and="" open="" it="" in="" an="" editor.="" param="" <-="" vepparam(output="c(vcf=TRUE)," input="c(output_file=" path="" myfile.vcf"))"="" ensemblvep("example.vcf",="" param="param)" i="" will="" report="" this="" bug="" to="" ensembl.="" as="" for="" putting="" together="" the="" summary="" granges,="" yes,="" i="" would="" use="" the="" 'vcfrowid'="" column.="" using="" ex2.vcf="" as="" an="" example,=""> file <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") > myparam <- VEPParam(output=c(vcf=TRUE)) > vep <- ensemblVEP(file, param=myparam) > vcf <- readVcf(file, "hg19") > gr <- parseCSQToGRanges(vep) Subset the full vcf on 'VCFRowID': > vcf_sub <- vcf[gr$VCFRowID] Add info or geno data as mcols(): > df <- DataFrame(mcols(gr), AF=info(vcf_sub)$AF, DP=geno(vcf_sub)$DP) > mcols(gr) <- df I've checked in version 1.1.1 to devel. 'VCFRowID' is now returned as an integer so you can do the subsetting shown above (previously returned as factor). I also added the 'database' field to the VEPParam database options. This was a recent addition from Ensembl. Testing was done with the most recent VEP API version 71. Valerie On 04/10/2013 09:05 AM, Thomas Sandmann [guest] wrote: > Dear Valerie, > > thanks a lot for making your great ensemblVEP package available. > > I have been using it to assess the consequences of variants detected by the VariantTools package (version 1.6.1). > ensemblVEP retrieves the variantEffectPredictor output, but triggers a number of warnings (see below). > > library(ensemblVEP) > > ## example.vcf is available at http://dl.dropbox.com/u/126180/example.vcf > vcf <- readVcf( "example.vcf", genome="hg19") > > ## the vcf object contains coverage information > vcf > geno(vcf)$DP > > ## running VEP triggers warnings > vep.param <- VEPParam() > output( vep.param )$vcf <- TRUE > > vep <- ensemblVEP( "example.vcf", > genome="hg19", > param=vep.param > ) > > warnings() > > 1: In doTryCatch(return(expr), name, parentenv, handler) : > record 1 (and others?) INFO 'AD:DP:AP' not found > 2: In doTryCatch(return(expr), name, parentenv, handler) > record 1 (and others?) FORMAT '0,2' not found > 3: In doTryCatch(return(expr), name, parentenv, handler) : > record 1 (and others?) FORMAT '2' not found > ... > > I think these warnings refer to the "geno" slot of the vcf file. When I request a VCF object as output from ensemblVEP, the object contains the same elements in its geno slot as the original vcf input file, but they only contain NAs. Is this expected or should the geno slot be passed on to the VCF object generated by ensemblVEP ? > > My final objective is to obtain a GRanges or data.frame containing both the predicted consequences and the coverage / frequencies from the vcf input file for each variant. I have seen that the parseCSQToGRanges returns a "VCFRowID" column associating each row with the original entry in the VCF object. > > Would you recommend to use this column to extract the corresponding rows from the vectors / arrays stored in the "geno" slot ? Or is there a simpler, more elegant solution you could point me to ? > > Thanks a lot ! > Thomas > > -- output of sessionInfo(): > > R version 3.0.0 (2013-04-03) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] ensemblVEP_1.0.0 VariantAnnotation_1.6.1 Rsamtools_1.12.0 > [4] Biostrings_2.28.0 GenomicRanges_1.12.1 IRanges_1.18.0 > [7] BiocGenerics_0.6.0 > > loaded via a namespace (and not attached): > [1] AnnotationDbi_1.22.1 Biobase_2.20.0 biomaRt_2.16.0 > [4] bitops_1.0-5 BSgenome_1.28.0 compiler_3.0.0 > [7] DBI_0.2-5 GenomicFeatures_1.12.0 RCurl_1.95-4.1 > [10] RSQLite_0.11.2 rtracklayer_1.20.0 stats4_3.0.0 > [13] tools_3.0.0 XML_3.96-1.1 zlibbioc_1.6.0 > > > -- > Sent via the guest posting facility at bioconductor.org. >

ADD COMMENT • link 12.8 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Thomas, As I was preparing to write Ensembl about this 'bug' I realized that your vcf file does not have the '.' space holder for missing data in INFO. Because this column is skipped, FORMAT is now the 8th field in the row (should be INFO). This explains why CSQ is being appended to FORMAT. You said the vcf was created with VariantTools. I'm assuming it was written out with writeVcf()? The case of no INFO should be handled gracefully. Can you show me how the INFO data are represented in the GRanges or VCF before writing out? (e.g., are they '.', NA, NULL etc.) Thanks, Valerie On 04/10/2013 02:51 PM, Valerie Obenchain wrote: > Hi Thomas, > > You've hit an interesting bug. The problem is with how CSQ is returned > from Ensembl. > > The documentation states that CSQ will be returned as an INFO column. > However your file doesn't have INFO data and CSQ is being appended to > the FORMAT variables instead. > > This is a problem when parsing the vcf with scanVcf/readVcf because now > the number of FORMAT variables doesn't match the number of FORMAT data > fields for each sample. This is why you see NA for all geno variables, > they aren't being parsed. > > For example, the FORMAT field is coming back like this, > AD:DP:AP;CSQ=C|ENSG00000128602|ENST00000462420|... > > but the samples only have data for the 3 original fields, > 0,2:2:0,1 > > This seems like a bug and not intended behavior. Ensembl has even > created a new INFO header line in the vcf returned to indicate the data > would be in the INFO column, > > ##INFO=<id=csq,number=.,type=string,description="consequence type="" as=""> predicted by VEP. .... > > If you want to see this yourself you can write the vcf to disk and open > it in an editor. > > param <- VEPParam(output=c(vcf=TRUE), > input=c(output_file="/path/myfile.vcf")) > ensemblVEP("example.vcf", param=param) > > I will report this bug to Ensembl. > > > As for putting together the summary GRanges, yes, I would use the > 'VCFRowID' column. Using ex2.vcf as an example, > > > file <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") > > myparam <- VEPParam(output=c(vcf=TRUE)) > > vep <- ensemblVEP(file, param=myparam) > > vcf <- readVcf(file, "hg19") > > gr <- parseCSQToGRanges(vep) > > Subset the full vcf on 'VCFRowID': > > vcf_sub <- vcf[gr$VCFRowID] > > Add info or geno data as mcols(): > > df <- DataFrame(mcols(gr), AF=info(vcf_sub)$AF, DP=geno(vcf_sub)$DP) > > mcols(gr) <- df > > I've checked in version 1.1.1 to devel. 'VCFRowID' is now returned as an > integer so you can do the subsetting shown above (previously returned as > factor). I also added the 'database' field to the VEPParam database > options. This was a recent addition from Ensembl. Testing was done with > the most recent VEP API version 71. > > Valerie > > > > > On 04/10/2013 09:05 AM, Thomas Sandmann [guest] wrote: >> Dear Valerie, >> >> thanks a lot for making your great ensemblVEP package available. >> >> I have been using it to assess the consequences of variants detected >> by the VariantTools package (version 1.6.1). >> ensemblVEP retrieves the variantEffectPredictor output, but triggers a >> number of warnings (see below). >> >> library(ensemblVEP) >> >> ## example.vcf is available at http://dl.dropbox.com/u/126180/example.vcf >> vcf <- readVcf( "example.vcf", genome="hg19") >> >> ## the vcf object contains coverage information >> vcf >> geno(vcf)$DP >> >> ## running VEP triggers warnings >> vep.param <- VEPParam() >> output( vep.param )$vcf <- TRUE >> >> vep <- ensemblVEP( "example.vcf", >> genome="hg19", >> param=vep.param >> ) >> >> warnings() >> >> 1: In doTryCatch(return(expr), name, parentenv, handler) : >> record 1 (and others?) INFO 'AD:DP:AP' not found >> 2: In doTryCatch(return(expr), name, parentenv, handler) >> record 1 (and others?) FORMAT '0,2' not found >> 3: In doTryCatch(return(expr), name, parentenv, handler) : >> record 1 (and others?) FORMAT '2' not found >> ... >> >> I think these warnings refer to the "geno" slot of the vcf file. When >> I request a VCF object as output from ensemblVEP, the object contains >> the same elements in its geno slot as the original vcf input file, but >> they only contain NAs. Is this expected or should the geno slot be >> passed on to the VCF object generated by ensemblVEP ? >> >> My final objective is to obtain a GRanges or data.frame containing >> both the predicted consequences and the coverage / frequencies from >> the vcf input file for each variant. I have seen that the >> parseCSQToGRanges returns a "VCFRowID" column associating each row >> with the original entry in the VCF object. >> >> Would you recommend to use this column to extract the corresponding >> rows from the vectors / arrays stored in the "geno" slot ? Or is there >> a simpler, more elegant solution you could point me to ? >> >> Thanks a lot ! >> Thomas >> >> -- output of sessionInfo(): >> >> R version 3.0.0 (2013-04-03) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> [8] base >> >> other attached packages: >> [1] ensemblVEP_1.0.0 VariantAnnotation_1.6.1 Rsamtools_1.12.0 >> [4] Biostrings_2.28.0 GenomicRanges_1.12.1 IRanges_1.18.0 >> [7] BiocGenerics_0.6.0 >> >> loaded via a namespace (and not attached): >> [1] AnnotationDbi_1.22.1 Biobase_2.20.0 biomaRt_2.16.0 >> [4] bitops_1.0-5 BSgenome_1.28.0 compiler_3.0.0 >> [7] DBI_0.2-5 GenomicFeatures_1.12.0 RCurl_1.95-4.1 >> [10] RSQLite_0.11.2 rtracklayer_1.20.0 stats4_3.0.0 >> [13] tools_3.0.0 XML_3.96-1.1 zlibbioc_1.6.0 >> >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.8 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Hi Valerie, thanks a lot for looking following up on this issue. Yes, I generated the input for ensemblVEP with a call to writeVcf. Here's my workflow: 1. Variant calling with the '*callVariants*' function from the '* VariantTools*' package: returns the* 'variants.gr' GRanges object* 2. Generate a Vcf object from the 'variant' GRanges with a call to '* variantGR2Vcf'*: returns *variants.vcf VCF object* 3. write Vcf object to disk as a vcf file with '*writeVcf*': generates variants.vcf file You can find the variants.gr and variants.vcf R objects here: https://dl.dropboxusercontent.com/u/126180/variants.object.Rdata and the exported vcf file from the same data here: https://dl.dropboxusercontent.com/u/126180/variants.vcf Thanks a lot ! Thomas [[alternative HTML version deleted]]

ADD REPLY • link 12.8 years ago Thomas Sandmann ▴ 30

0

Entering edit mode

Perfect. Thanks for sending. There were a couple of things going on. In the case of no INFO, readVcf() was storing a DataFrame of 0 rows and 0 columns and writeVcf() was writing out a vector of '.' the length of ncol(info(vcf)). This resulted in no '.' space holders in the newly written vcf. readVcf() now creates a DataFrame with N rows (number of records) and 0 columns when no INFO are present. This new behavior would allow writeVcf() to get the correct length from ncol(info(vcf)). Instead, to ensure backwards compatibility, I'm using the number of records to determine the length of '.' to write out. Patched versions are 1.7.6 in devel and 1.6.2 in release. Let me know if you run into problems. Sorry I didn't identify this last week. Valerie On 04/15/2013 09:00 AM, Thomas Sandmann wrote: > Hi Valerie, > > thanks a lot for looking following up on this issue. > Yes, I generated the input for ensemblVEP with a call to writeVcf. > > Here's my workflow: > > 1. Variant calling with the '/callVariants/' function from the > '/VariantTools/' package: returns the*'variants.gr <http: variants.gr="">' > GRanges object* > 2. Generate a Vcf object from the 'variant' GRanges with a call to > '/variantGR2Vcf'/: returns *variants.vcf VCF object* > 3. write Vcf object to disk as a vcf file with '/writeVcf/': generates > variants.vcf file > > You can find the variants.gr <http: variants.gr=""> and variants.vcf R > objects here: > https://dl.dropboxusercontent.com/u/126180/variants.object.Rdata > and the exported vcf file from the same data here: > https://dl.dropboxusercontent.com/u/126180/variants.vcf > > Thanks a lot ! > Thomas

ADD REPLY • link 12.8 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Thanks for fixing the functions so quickly. That helps a lot ! Thomas [[alternative HTML version deleted]]

ADD REPLY • link 12.8 years ago Thomas Sandmann ▴ 30

Login before adding your answer.