Question

Error in VCF parsing with VariantAnnotation

2

Entering edit mode

mumichae ▴ 20

@mumichae-20187

Last seen 6.0 years ago

Hi,

I have been annotating VCF files with VEP.

utils::download.file("https://i12g-gagneurweb.in.tum.de/public/bugreports/bioc_variantAnnotation/example_no_anno.vcf.gz", "example_no_anno.vcf.gz")
    utils::download.file("https://i12g-gagneurweb.in.tum.de/public/bugreports/bioc_variantAnnotation/example_vep_anno.vcf.gz", "example_vep.vcf.gz")

VEP command on the command line

vep -i example_no_anno.vcf.gz --vcf TRUE --output_file example_vep.vcf.gz --compress_output bgzip --minimal TRUE --allele_number TRUE --everything TRUE --assembly GRCh37 --db_version 94 --merged TRUE --user anonymous --port 3337 --host ensembldb.ensembl.org --cache TRUE --dir dir_cache/ensembl-vep/94/cachedir --sift s --polyphen s --total_length TRUE --numbers TRUE --symbol TRUE --hgvs TRUE --ccds TRUE --uniprot TRUE --xref_refseq TRUE --af TRUE --max_af TRUE --af_exac TRUE --af_gnomad TRUE --pubmed TRUE --canonical TRUE --biotype TRUE

However after reading the annotated VCF file, some lines seem to be randomly split and parsed as a new line. In a minimal example with 1 variant, I end up with 2 entries in R, where the second one has half of the info column as chromosome names. Could this be a bug?

library(VariantAnnotation)

# plain vcf file
vcf <- readVcf("example_no_anno.vcf.gz")
colData(vcf)
dim(vcf)
str(seqlevels(vcf))

# annotated with VEP 
# contains very long line but no errors in the format
vcf <- readVcf("example_vep.vcf.gz")
colData(vcf)
dim(vcf)
str(seqlevels(vcf)[1])
str(seqlevels(vcf)[2])
str(seqlevels(vcf)[3])

Please let me know, if you need more input to replicate this error.

Best, Michaela Müller

software error VariantAnnotation ensemblVEP • 1.9k views

ADD COMMENT • link updated 6.0 years ago by Valerie Obenchain ★ 6.8k • written 6.1 years ago by mumichae ▴ 20

0

Entering edit mode

sessionInfo() please? Thanks

ADD REPLY • link 6.1 years ago Hervé Pagès 16k

0

Entering edit mode

R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Scientific Linux 7.6 (Nitrogen)

Matrix products: default
BLAS: /data/nasif12/modules_if12/SL7/i12g/R/3.5.1-Bioc3.8/lib64/R/lib/libRblas.so
LAPACK: /data/nasif12/modules_if12/SL7/i12g/R/3.5.1-Bioc3.8/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] plotly_4.8.0                ggpubr_0.2                  scales_1.0.0                cowplot_0.9.4               tidyr_0.8.2                 ggplot2_3.1.0              
 [7] magrittr_1.5                ensemblVEP_1.24.0           AnnotationHub_2.14.2        dplyr_0.7.8                 data.table_1.12.0           VariantAnnotation_1.28.7   
[13] Rsamtools_1.34.0            Biostrings_2.50.2           XVector_0.22.0              SummarizedExperiment_1.12.0 DelayedArray_0.8.0          BiocParallel_1.16.5        
[19] matrixStats_0.54.0          Biobase_2.42.0              GenomicRanges_1.34.0        GenomeInfoDb_1.18.1         IRanges_2.16.0              S4Vectors_0.20.1           
[25] BiocGenerics_0.28.0        

loaded via a namespace (and not attached):
 [1] httr_1.4.0                    viridisLite_0.3.0             jsonlite_1.6                  bit64_0.9-7                   shiny_1.2.0                   assertthat_0.2.0             
 [7] interactiveDisplayBase_1.20.0 BiocManager_1.30.4            blob_1.1.1                    BSgenome_1.50.0               GenomeInfoDbData_1.2.0        yaml_2.2.0                   
[13] progress_1.2.0                pillar_1.3.1                  RSQLite_2.1.1                 lattice_0.20-38               glue_1.3.0                    digest_0.6.18                
[19] promises_1.0.1                colorspace_1.3-2              htmltools_0.3.6               httpuv_1.4.5.1                Matrix_1.2-15                 plyr_1.8.4                   
[25] XML_3.98-1.16                 pkgconfig_2.0.2               biomaRt_2.38.0                zlibbioc_1.28.0               purrr_0.2.5                   xtable_1.8-3                 
[31] later_0.7.5                   tibble_2.0.0                  DT_0.5                        withr_2.1.2                   GenomicFeatures_1.34.1        lazyeval_0.2.1               
[37] crayon_1.3.4                  mime_0.6                      memoise_1.1.0                 tools_3.5.1                   prettyunits_1.0.2             hms_0.4.2                    
[43] stringr_1.3.1                 munsell_0.5.0                 AnnotationDbi_1.44.0          bindrcpp_0.2.2                compiler_3.5.1                rlang_0.3.1                  
[49] grid_3.5.1                    RCurl_1.95-4.11               rstudioapi_0.9.0              htmlwidgets_1.3               labeling_0.3                  bitops_1.0-6                 
[55] gtable_0.2.0                  DBI_1.0.0                     R6_2.3.0                      GenomicAlignments_1.18.1      rtracklayer_1.42.1            bit_1.1-14                   
[61] bindr_0.1.1                   stringi_1.2.4                 Rcpp_1.0.0                    tidyselect_0.2.5

ADD REPLY • link 6.1 years ago mumichae ▴ 20

score 0 · Answer 1 · 2019-03-17

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 3.2 years ago

United States

This question was also posted on github. Conversation has moved to https://github.com/Bioconductor/VariantAnnotation/issues/25.

ADD COMMENT • link 6.0 years ago Valerie Obenchain ★ 6.8k