Question: Error in VCF parsing with VariantAnnotation
2
gravatar for mumichae
6 months ago by
mumichae20
mumichae20 wrote:

Hi,

I have been annotating VCF files with VEP.

utils::download.file("https://i12g-gagneurweb.in.tum.de/public/bugreports/bioc_variantAnnotation/example_no_anno.vcf.gz", "example_no_anno.vcf.gz")
    utils::download.file("https://i12g-gagneurweb.in.tum.de/public/bugreports/bioc_variantAnnotation/example_vep_anno.vcf.gz", "example_vep.vcf.gz")

VEP command on the command line

vep -i example_no_anno.vcf.gz --vcf TRUE --output_file example_vep.vcf.gz --compress_output bgzip --minimal TRUE --allele_number TRUE --everything TRUE --assembly GRCh37 --db_version 94 --merged TRUE --user anonymous --port 3337 --host ensembldb.ensembl.org --cache TRUE --dir dir_cache/ensembl-vep/94/cachedir --sift s --polyphen s --total_length TRUE --numbers TRUE --symbol TRUE --hgvs TRUE --ccds TRUE --uniprot TRUE --xref_refseq TRUE --af TRUE --max_af TRUE --af_exac TRUE --af_gnomad TRUE --pubmed TRUE --canonical TRUE --biotype TRUE

However after reading the annotated VCF file, some lines seem to be randomly split and parsed as a new line. In a minimal example with 1 variant, I end up with 2 entries in R, where the second one has half of the info column as chromosome names. Could this be a bug?

library(VariantAnnotation)

# plain vcf file
vcf <- readVcf("example_no_anno.vcf.gz")
colData(vcf)
dim(vcf)
str(seqlevels(vcf))

# annotated with VEP 
# contains very long line but no errors in the format
vcf <- readVcf("example_vep.vcf.gz")
colData(vcf)
dim(vcf)
str(seqlevels(vcf)[1])
str(seqlevels(vcf)[2])
str(seqlevels(vcf)[3])

Please let me know, if you need more input to replicate this error.

Best, Michaela Müller

ADD COMMENTlink modified 6 months ago by Valerie Obenchain6.7k • written 6 months ago by mumichae20

sessionInfo() please? Thanks

ADD REPLYlink written 6 months ago by Hervé Pagès ♦♦ 14k
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Scientific Linux 7.6 (Nitrogen)

Matrix products: default
BLAS: /data/nasif12/modules_if12/SL7/i12g/R/3.5.1-Bioc3.8/lib64/R/lib/libRblas.so
LAPACK: /data/nasif12/modules_if12/SL7/i12g/R/3.5.1-Bioc3.8/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] plotly_4.8.0                ggpubr_0.2                  scales_1.0.0                cowplot_0.9.4               tidyr_0.8.2                 ggplot2_3.1.0              
 [7] magrittr_1.5                ensemblVEP_1.24.0           AnnotationHub_2.14.2        dplyr_0.7.8                 data.table_1.12.0           VariantAnnotation_1.28.7   
[13] Rsamtools_1.34.0            Biostrings_2.50.2           XVector_0.22.0              SummarizedExperiment_1.12.0 DelayedArray_0.8.0          BiocParallel_1.16.5        
[19] matrixStats_0.54.0          Biobase_2.42.0              GenomicRanges_1.34.0        GenomeInfoDb_1.18.1         IRanges_2.16.0              S4Vectors_0.20.1           
[25] BiocGenerics_0.28.0        

loaded via a namespace (and not attached):
 [1] httr_1.4.0                    viridisLite_0.3.0             jsonlite_1.6                  bit64_0.9-7                   shiny_1.2.0                   assertthat_0.2.0             
 [7] interactiveDisplayBase_1.20.0 BiocManager_1.30.4            blob_1.1.1                    BSgenome_1.50.0               GenomeInfoDbData_1.2.0        yaml_2.2.0                   
[13] progress_1.2.0                pillar_1.3.1                  RSQLite_2.1.1                 lattice_0.20-38               glue_1.3.0                    digest_0.6.18                
[19] promises_1.0.1                colorspace_1.3-2              htmltools_0.3.6               httpuv_1.4.5.1                Matrix_1.2-15                 plyr_1.8.4                   
[25] XML_3.98-1.16                 pkgconfig_2.0.2               biomaRt_2.38.0                zlibbioc_1.28.0               purrr_0.2.5                   xtable_1.8-3                 
[31] later_0.7.5                   tibble_2.0.0                  DT_0.5                        withr_2.1.2                   GenomicFeatures_1.34.1        lazyeval_0.2.1               
[37] crayon_1.3.4                  mime_0.6                      memoise_1.1.0                 tools_3.5.1                   prettyunits_1.0.2             hms_0.4.2                    
[43] stringr_1.3.1                 munsell_0.5.0                 AnnotationDbi_1.44.0          bindrcpp_0.2.2                compiler_3.5.1                rlang_0.3.1                  
[49] grid_3.5.1                    RCurl_1.95-4.11               rstudioapi_0.9.0              htmlwidgets_1.3               labeling_0.3                  bitops_1.0-6                 
[55] gtable_0.2.0                  DBI_1.0.0                     R6_2.3.0                      GenomicAlignments_1.18.1      rtracklayer_1.42.1            bit_1.1-14                   
[61] bindr_0.1.1                   stringi_1.2.4                 Rcpp_1.0.0                    tidyselect_0.2.5
ADD REPLYlink written 6 months ago by mumichae20
Answer: Error in VCF parsing with VariantAnnotation
0
gravatar for Valerie Obenchain
6 months ago by
United States
Valerie Obenchain6.7k wrote:

This question was also posted on github. Conversation has moved to https://github.com/Bioconductor/VariantAnnotation/issues/25.

ADD COMMENTlink written 6 months ago by Valerie Obenchain6.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 195 users visited in the last hour