Entering edit mode
sskimb
▴
10
@sskimb-10162
Last seen 5.4 years ago
I would like to report a bug in readVcf (VariantAnnotation). This bug produces duplicated records when reading 1000 Genomes Phase3 vcf files.
Here is a minimal code that reproduces the bug:
> library(VariantAnnotation)
> ikGFile <- "/data/1kG/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz"
> param <- GRanges("21", IRanges(start=c(14879074,14879077), width=1))
> ikGvcf <- readVcf(ikGFile, 'hg19', param = param )
> rowRanges(ikGvcf[duplicated(names(ikGvcf))])
GRanges object with 3 ranges and 5 metadata columns:
seqnames ranges strand | paramRangeID REF
<Rle> <IRanges> <Rle> | <factor> <DNAStringSet>
esv3646383;esv3646384 21 14626476 * | <NA> T
esv3646396 21 14878948 * | <NA> A
rs549025226 21 14879070-14879079 * | <NA> GTATATGTAT
ALT QUAL FILTER
<CharacterList> <numeric> <character>
esv3646383;esv3646384 <CN0>,<CN2> 100 PASS
esv3646396 <CN0> 100 PASS
rs549025226 G 100 PASS
-------
seqinfo: 86 sequences from hg19 genome
The first two rows are CNVs, which have become duplicated many many times, while the last one only twice. It appears that this bug is related to CNVs and their overlapping multi-base SNVs. That is, the ranges of those CNVs are 21:14626476-14956108 and 21:14878948-14883343, respectively. This bug is not restricted to this region, but occurs throughout the chromosome (omitted for brevity).
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods
[9] base
other attached packages:
[1] VariantAnnotation_1.30.1 Rsamtools_2.0.0 Biostrings_2.52.0
[4] XVector_0.24.0 SummarizedExperiment_1.14.0 DelayedArray_0.10.0
[7] BiocParallel_1.18.0 matrixStats_0.54.0 Biobase_2.44.0
[10] GenomicRanges_1.36.0 GenomeInfoDb_1.20.0 IRanges_2.18.1
[13] S4Vectors_0.22.0 BiocGenerics_0.30.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 compiler_3.6.0 prettyunits_1.0.2
[4] GenomicFeatures_1.36.3 bitops_1.0-6 tools_3.6.0
[7] zlibbioc_1.30.0 progress_1.2.2 biomaRt_2.40.1
[10] digest_0.6.20 bit_1.1-14 BSgenome_1.52.0
[13] RSQLite_2.1.1 memoise_1.1.0 lattice_0.20-38
[16] pkgconfig_2.0.2 rlang_0.4.0 Matrix_1.2-17
[19] DBI_1.0.0 rstudioapi_0.10 GenomeInfoDbData_1.2.1
[22] rtracklayer_1.44.0 httr_1.4.0 stringr_1.4.0
[25] hms_0.4.2 bit64_0.9-7 grid_3.6.0
[28] R6_2.4.0 AnnotationDbi_1.46.0 XML_3.98-1.20
[31] magrittr_1.5 blob_1.1.1 GenomicAlignments_1.20.1
[34] assertthat_0.2.1 stringi_1.4.3 RCurl_1.95-4.12
[37] crayon_1.3.4