Question

cannot allocate memory?

1

Entering edit mode

Haiying.Kong ▴ 110

@haiyingkong-9254

Last seen 5.0 years ago

Germany

I am working on a linux system with mem 250G. Currently another program is running which takes mem less than 20GB.

If I try to load a vcf file (<16GB) with readVcf function in VariantAnnotation package, I get error message:

> germ.mut = readVcf("/home/kong/Haiying/Projects/PrimaryMelanoma/AllBatches/Lock/GermlineMutation/GermlineMutations.vcf", "hg19")
Error: scanVcf: (internal) _vcftype_grow 'sz' < 0; cannot allocate memory?
  path: /home/kong/Haiying/Projects/PrimaryMelanoma/AllBatches/Lock/GermlineMutation/GermlineMutations.vcf

I tried gc() before running the line, and got the same error message.

I could follow the solution on: Error in reading 1000 genomes data

But since I have so much memory, is there any way to just load whole vcf at once?

> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: openSUSE 13.1 (Bottle) (x86_64)

locale:
 [1] LC_CTYPE=en_GB.UTF-8          LC_NUMERIC=C
 [3] LC_TIME=en_GB.UTF-8           LC_COLLATE=en_GB.UTF-8
 [5] LC_MONETARY=en_GB.UTF-8       LC_MESSAGES=en_GB.UTF-8
 [7] LC_PAPER=en_GB.UTF-8          LC_NAME=en_GB.UTF-8
 [9] LC_ADDRESS=en_GB.UTF-8        LC_TELEPHONE=en_GB.UTF-8
[11] LC_MEASUREMENT=en_GB.UTF-8    LC_IDENTIFICATION=en_GB.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] cgdv17_0.12.0              VariantAnnotation_1.20.3
 [3] Rsamtools_1.26.2           Biostrings_2.42.1
 [5] XVector_0.14.1             SummarizedExperiment_1.4.0
 [7] Biobase_2.34.0             GenomicRanges_1.26.4
 [9] GenomeInfoDb_1.10.3        IRanges_2.8.2
[11] S4Vectors_0.12.2           BiocGenerics_0.20.0
[13] BiocInstaller_1.24.0       xlsx_0.5.7
[15] xlsxjars_0.6.1             rJava_0.9-8

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.10             AnnotationDbi_1.36.2     GenomicAlignments_1.10.1
 [4] zlibbioc_1.20.0          BiocParallel_1.8.2       BSgenome_1.42.0
 [7] lattice_0.20-35          tools_3.3.3              grid_3.3.3
[10] DBI_0.6-1                digest_0.6.12            Matrix_1.2-8
[13] rtracklayer_1.34.2       bitops_1.0-6             biomaRt_2.30.0
[16] RCurl_1.95-4.8           memoise_1.1.0            RSQLite_1.1-2
[19] GenomicFeatures_1.26.4   XML_3.98-1.6

readVcf • 2.1k views

ADD COMMENT • link updated 7.0 years ago by Martin Morgan 25k • written 7.0 years ago by Haiying.Kong ▴ 110

score 0 · Answer 1 · 2017-05-04

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 5 days ago

United States

Input only the information you're interested in, using specialized functions readInfo(), readGeno() or more generally readVcf with the ScanVcfParam() function. If the data are still too large, iterate through using VcfFile() with a yieldSize() argument, and GenomicFiles::reduceByYield(). The relevant help pages and package vignettes (e.g., on the landing pages https://bioconductor.org/pacages/VariantAnnotation) have examples that might help you to pose additional more specific questions if you run into problems.

ADD COMMENT • link 7.0 years ago Martin Morgan 25k

0

Entering edit mode

Thank you very much for your reply.

How is it decided "the data are still too large"? I should have more than 200GB mem, and the vcf file I am trying to load is less than 16GB.

ADD REPLY • link 7.0 years ago Haiying.Kong ▴ 110

0

Entering edit mode

By 'still to large' I meant that you receive the same error about memory allocation.

When I look at the code, it seems like the error message could be moderately misleading. The error occurs if one of the components of the VCF file (e.g., the GT field) the product of the dimensions of the resulting matrix were larger than the maximum integer size (about 2.14 billion). With 1000 samples and a field taking on three values per sample, the maximum number of variants would be about 715,000 (2.14 billion / 1000 / 3). Making the code work with much larger data sets isn't a priority for me -- process it in chunks to manage memory, allow other processes and users to access the computer's resources, and facilitate parallel evaluation.

There are a number of additional possibilities. Remember that a VCF file is text file, but one operates in R on different data types, so the character value '1' in a VCF file is a single byte, but represented by a double in R so take up 8 bytes. Depending on memory allocation patterns, memory available to the operating system may become fragmented, so that while there are many more than x bytes available, contiguous blocks are all less than the amount required. Etc.