Question: readVcf: duplicated records produced importing 1kG vcf
gravatar for sskimb
4 months ago by
sskimb10 wrote:

I would like to report a bug in readVcf (VariantAnnotation). This bug produces duplicated records when reading 1000 Genomes Phase3 vcf files.

Here is a minimal code that reproduces the bug:

> library(VariantAnnotation)
> ikGFile <- "/data/1kG/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz"
> param <- GRanges("21", IRanges(start=c(14879074,14879077), width=1))
> ikGvcf <- readVcf(ikGFile, 'hg19', param = param )
> rowRanges(ikGvcf[duplicated(names(ikGvcf))])

GRanges object with 3 ranges and 5 metadata columns:
                        seqnames            ranges strand | paramRangeID            REF
                           <Rle>         <IRanges>  <Rle> |     <factor> <DNAStringSet>
  esv3646383;esv3646384       21          14626476      * |         <NA>              T
             esv3646396       21          14878948      * |         <NA>              A
            rs549025226       21 14879070-14879079      * |         <NA>     GTATATGTAT
                                    ALT      QUAL      FILTER
                        <CharacterList> <numeric> <character>
  esv3646383;esv3646384     <CN0>,<CN2>       100        PASS
             esv3646396           <CN0>       100        PASS
            rs549025226               G       100        PASS
  seqinfo: 86 sequences from hg19 genome

The first two rows are CNVs, which have become duplicated many many times, while the last one only twice. It appears that this bug is related to CNVs and their overlapping multi-base SNVs. That is, the ranges of those CNVs are 21:14626476-14956108 and 21:14878948-14883343, respectively. This bug is not restricted to this region, but occurs throughout the chromosome (omitted for brevity).

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
 [1] VariantAnnotation_1.30.1    Rsamtools_2.0.0             Biostrings_2.52.0          
 [4] XVector_0.24.0              SummarizedExperiment_1.14.0 DelayedArray_0.10.0        
 [7] BiocParallel_1.18.0         matrixStats_0.54.0          Biobase_2.44.0             
[10] GenomicRanges_1.36.0        GenomeInfoDb_1.20.0         IRanges_2.18.1             
[13] S4Vectors_0.22.0            BiocGenerics_0.30.0        

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1               compiler_3.6.0           prettyunits_1.0.2       
 [4] GenomicFeatures_1.36.3   bitops_1.0-6             tools_3.6.0             
 [7] zlibbioc_1.30.0          progress_1.2.2           biomaRt_2.40.1          
[10] digest_0.6.20            bit_1.1-14               BSgenome_1.52.0         
[13] RSQLite_2.1.1            memoise_1.1.0            lattice_0.20-38         
[16] pkgconfig_2.0.2          rlang_0.4.0              Matrix_1.2-17           
[19] DBI_1.0.0                rstudioapi_0.10          GenomeInfoDbData_1.2.1  
[22] rtracklayer_1.44.0       httr_1.4.0               stringr_1.4.0           
[25] hms_0.4.2                bit64_0.9-7              grid_3.6.0              
[28] R6_2.4.0                 AnnotationDbi_1.46.0     XML_3.98-1.20           
[31] magrittr_1.5             blob_1.1.1               GenomicAlignments_1.20.1
[34] assertthat_0.2.1         stringi_1.4.3            RCurl_1.95-4.12         
[37] crayon_1.3.4            
software error • 92 views
ADD COMMENTlink written 4 months ago by sskimb10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 236 users visited in the last hour