I am working on a linux system with mem 250G. Currently another program is running which takes mem less than 20GB.
If I try to load a vcf file (<16GB) with readVcf function in VariantAnnotation package, I get error message:
> germ.mut = readVcf("/home/kong/Haiying/Projects/PrimaryMelanoma/AllBatches/Lock/GermlineMutation/GermlineMutations.vcf", "hg19") Error: scanVcf: (internal) _vcftype_grow 'sz' < 0; cannot allocate memory? path: /home/kong/Haiying/Projects/PrimaryMelanoma/AllBatches/Lock/GermlineMutation/GermlineMutations.vcf
I tried gc() before running the line, and got the same error message.
I could follow the solution on: Error in reading 1000 genomes data
But since I have so much memory, is there any way to just load whole vcf at once?
> sessionInfo() R version 3.3.3 (2017-03-06) Platform: x86_64-pc-linux-gnu (64-bit) Running under: openSUSE 13.1 (Bottle) (x86_64) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=en_GB.UTF-8 LC_NAME=en_GB.UTF-8 [9] LC_ADDRESS=en_GB.UTF-8 LC_TELEPHONE=en_GB.UTF-8 [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=en_GB.UTF-8 attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] cgdv17_0.12.0 VariantAnnotation_1.20.3 [3] Rsamtools_1.26.2 Biostrings_2.42.1 [5] XVector_0.14.1 SummarizedExperiment_1.4.0 [7] Biobase_2.34.0 GenomicRanges_1.26.4 [9] GenomeInfoDb_1.10.3 IRanges_2.8.2 [11] S4Vectors_0.12.2 BiocGenerics_0.20.0 [13] BiocInstaller_1.24.0 xlsx_0.5.7 [15] xlsxjars_0.6.1 rJava_0.9-8 loaded via a namespace (and not attached): [1] Rcpp_0.12.10 AnnotationDbi_1.36.2 GenomicAlignments_1.10.1 [4] zlibbioc_1.20.0 BiocParallel_1.8.2 BSgenome_1.42.0 [7] lattice_0.20-35 tools_3.3.3 grid_3.3.3 [10] DBI_0.6-1 digest_0.6.12 Matrix_1.2-8 [13] rtracklayer_1.34.2 bitops_1.0-6 biomaRt_2.30.0 [16] RCurl_1.95-4.8 memoise_1.1.0 RSQLite_1.1-2 [19] GenomicFeatures_1.26.4 XML_3.98-1.6
Thank you very much for your reply.
How is it decided "the data are still too large"? I should have more than 200GB mem, and the vcf file I am trying to load is less than 16GB.
By 'still to large' I meant that you receive the same error about memory allocation.
When I look at the code, it seems like the error message could be moderately misleading. The error occurs if one of the components of the VCF file (e.g., the GT field) the product of the dimensions of the resulting matrix were larger than the maximum integer size (about 2.14 billion). With 1000 samples and a field taking on three values per sample, the maximum number of variants would be about 715,000 (2.14 billion / 1000 / 3). Making the code work with much larger data sets isn't a priority for me -- process it in chunks to manage memory, allow other processes and users to access the computer's resources, and facilitate parallel evaluation.
There are a number of additional possibilities. Remember that a VCF file is text file, but one operates in R on different data types, so the character value '1' in a VCF file is a single byte, but represented by a double in R so take up 8 bytes. Depending on memory allocation patterns, memory available to the operating system may become fragmented, so that while there are many more than x bytes available, contiguous blocks are all less than the amount required. Etc.
Thank you very much for the explanation.