Error in VCF parsing with VariantAnnotation
1
2
Entering edit mode
mumichae ▴ 20
@mumichae-20187
Last seen 5.0 years ago

Hi,

I have been annotating VCF files with VEP.

utils::download.file("https://i12g-gagneurweb.in.tum.de/public/bugreports/bioc_variantAnnotation/example_no_anno.vcf.gz", "example_no_anno.vcf.gz")
    utils::download.file("https://i12g-gagneurweb.in.tum.de/public/bugreports/bioc_variantAnnotation/example_vep_anno.vcf.gz", "example_vep.vcf.gz")

VEP command on the command line

vep -i example_no_anno.vcf.gz --vcf TRUE --output_file example_vep.vcf.gz --compress_output bgzip --minimal TRUE --allele_number TRUE --everything TRUE --assembly GRCh37 --db_version 94 --merged TRUE --user anonymous --port 3337 --host ensembldb.ensembl.org --cache TRUE --dir dir_cache/ensembl-vep/94/cachedir --sift s --polyphen s --total_length TRUE --numbers TRUE --symbol TRUE --hgvs TRUE --ccds TRUE --uniprot TRUE --xref_refseq TRUE --af TRUE --max_af TRUE --af_exac TRUE --af_gnomad TRUE --pubmed TRUE --canonical TRUE --biotype TRUE

However after reading the annotated VCF file, some lines seem to be randomly split and parsed as a new line. In a minimal example with 1 variant, I end up with 2 entries in R, where the second one has half of the info column as chromosome names. Could this be a bug?

library(VariantAnnotation)

# plain vcf file
vcf <- readVcf("example_no_anno.vcf.gz")
colData(vcf)
dim(vcf)
str(seqlevels(vcf))

# annotated with VEP 
# contains very long line but no errors in the format
vcf <- readVcf("example_vep.vcf.gz")
colData(vcf)
dim(vcf)
str(seqlevels(vcf)[1])
str(seqlevels(vcf)[2])
str(seqlevels(vcf)[3])

Please let me know, if you need more input to replicate this error.

Best, Michaela Müller

software error VariantAnnotation ensemblVEP • 1.4k views
ADD COMMENT
0
Entering edit mode

sessionInfo() please? Thanks

ADD REPLY
0
Entering edit mode
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Scientific Linux 7.6 (Nitrogen)

Matrix products: default
BLAS: /data/nasif12/modules_if12/SL7/i12g/R/3.5.1-Bioc3.8/lib64/R/lib/libRblas.so
LAPACK: /data/nasif12/modules_if12/SL7/i12g/R/3.5.1-Bioc3.8/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] plotly_4.8.0                ggpubr_0.2                  scales_1.0.0                cowplot_0.9.4               tidyr_0.8.2                 ggplot2_3.1.0              
 [7] magrittr_1.5                ensemblVEP_1.24.0           AnnotationHub_2.14.2        dplyr_0.7.8                 data.table_1.12.0           VariantAnnotation_1.28.7   
[13] Rsamtools_1.34.0            Biostrings_2.50.2           XVector_0.22.0              SummarizedExperiment_1.12.0 DelayedArray_0.8.0          BiocParallel_1.16.5        
[19] matrixStats_0.54.0          Biobase_2.42.0              GenomicRanges_1.34.0        GenomeInfoDb_1.18.1         IRanges_2.16.0              S4Vectors_0.20.1           
[25] BiocGenerics_0.28.0        

loaded via a namespace (and not attached):
 [1] httr_1.4.0                    viridisLite_0.3.0             jsonlite_1.6                  bit64_0.9-7                   shiny_1.2.0                   assertthat_0.2.0             
 [7] interactiveDisplayBase_1.20.0 BiocManager_1.30.4            blob_1.1.1                    BSgenome_1.50.0               GenomeInfoDbData_1.2.0        yaml_2.2.0                   
[13] progress_1.2.0                pillar_1.3.1                  RSQLite_2.1.1                 lattice_0.20-38               glue_1.3.0                    digest_0.6.18                
[19] promises_1.0.1                colorspace_1.3-2              htmltools_0.3.6               httpuv_1.4.5.1                Matrix_1.2-15                 plyr_1.8.4                   
[25] XML_3.98-1.16                 pkgconfig_2.0.2               biomaRt_2.38.0                zlibbioc_1.28.0               purrr_0.2.5                   xtable_1.8-3                 
[31] later_0.7.5                   tibble_2.0.0                  DT_0.5                        withr_2.1.2                   GenomicFeatures_1.34.1        lazyeval_0.2.1               
[37] crayon_1.3.4                  mime_0.6                      memoise_1.1.0                 tools_3.5.1                   prettyunits_1.0.2             hms_0.4.2                    
[43] stringr_1.3.1                 munsell_0.5.0                 AnnotationDbi_1.44.0          bindrcpp_0.2.2                compiler_3.5.1                rlang_0.3.1                  
[49] grid_3.5.1                    RCurl_1.95-4.11               rstudioapi_0.9.0              htmlwidgets_1.3               labeling_0.3                  bitops_1.0-6                 
[55] gtable_0.2.0                  DBI_1.0.0                     R6_2.3.0                      GenomicAlignments_1.18.1      rtracklayer_1.42.1            bit_1.1-14                   
[61] bindr_0.1.1                   stringi_1.2.4                 Rcpp_1.0.0                    tidyselect_0.2.5
ADD REPLY
0
Entering edit mode
@valerie-obenchain-4275
Last seen 2.2 years ago
United States

This question was also posted on github. Conversation has moved to https://github.com/Bioconductor/VariantAnnotation/issues/25.

ADD COMMENT

Login before adding your answer.

Traffic: 741 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6