To whom it may concern,
I am a statistician at the Fred Hutchinson Cancer Research Center and am trying to use the Bioconductor library "VariantAnnotation" to read in the data from the 1000 genomes project (http://www.1000genomes.org/). I am getting an error I am not able to decipher and would appreciate any help in solving the problem. In case it is relevant, I am using a shared server on our campus with almost 400gb total memory.
Below are the commands I am using:
library(VariantAnnotation)
chr21 <- readVcf(file = "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz", genome = "hg19")
Below is the error output:
Error: scanVcf: scanVcf: scanTabix: (internal) _vcftype_grow 'sz' < 0; cannot allocate memory?
path: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
index: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz.tbi
path: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
Below is the session info:
R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] VariantAnnotation_1.10.5 Rsamtools_1.16.1 Biostrings_2.32.1
[4] XVector_0.4.0 GenomicRanges_1.16.4 GenomeInfoDb_1.0.2
[7] IRanges_1.22.10 BiocGenerics_0.10.0
loaded via a namespace (and not attached):
[1] AnnotationDbi_1.26.1 base64enc_0.1-2 BatchJobs_1.4
[4] BBmisc_1.7 Biobase_2.24.0 BiocParallel_0.6.1
[7] biomaRt_2.20.0 bitops_1.0-6 brew_1.0-6
[10] BSgenome_1.32.0 checkmate_1.5.0 codetools_0.2-9
[13] DBI_0.3.1 digest_0.6.4 fail_1.2
[16] foreach_1.4.2 GenomicAlignments_1.0.6 GenomicFeatures_1.16.3
[19] iterators_1.0.7 RCurl_1.95-4.3 RSQLite_0.11.4
[22] rtracklayer_1.24.2 sendmailR_1.2-1 stats4_3.1.1
[25] stringr_0.6.2 tools_3.1.1 XML_3.98-1.1
[28] zlibbioc_1.10.0
Thank you for your time and consideration.
Wade
Have you figured this out? I was trying to read a single chromosome (22) in today (file is 205mb) and getting the same memory error, also on a large server.
Take the strategies outlined above -- read only what you want (using ScanVcfParam), iterate (using yieldSize) through large files.