Question: cannot allocate memory?
1
gravatar for Haiying.Kong
22 months ago by
Haiying.Kong110
Germany
Haiying.Kong110 wrote:

I am working on a linux system with mem 250G. Currently another program is running which takes mem less than 20GB.

If I try to load a vcf file (<16GB) with readVcf function in VariantAnnotation package, I get error message:

> germ.mut = readVcf("/home/kong/Haiying/Projects/PrimaryMelanoma/AllBatches/Lock/GermlineMutation/GermlineMutations.vcf", "hg19")
Error: scanVcf: (internal) _vcftype_grow 'sz' < 0; cannot allocate memory?
  path: /home/kong/Haiying/Projects/PrimaryMelanoma/AllBatches/Lock/GermlineMutation/GermlineMutations.vcf

I tried gc() before running the line, and got the same error message.

I could follow the solution on: Error in reading 1000 genomes data

But since I have so much memory, is there any way to just load whole vcf at once?

> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: openSUSE 13.1 (Bottle) (x86_64)

locale:
 [1] LC_CTYPE=en_GB.UTF-8          LC_NUMERIC=C
 [3] LC_TIME=en_GB.UTF-8           LC_COLLATE=en_GB.UTF-8
 [5] LC_MONETARY=en_GB.UTF-8       LC_MESSAGES=en_GB.UTF-8
 [7] LC_PAPER=en_GB.UTF-8          LC_NAME=en_GB.UTF-8
 [9] LC_ADDRESS=en_GB.UTF-8        LC_TELEPHONE=en_GB.UTF-8
[11] LC_MEASUREMENT=en_GB.UTF-8    LC_IDENTIFICATION=en_GB.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] cgdv17_0.12.0              VariantAnnotation_1.20.3
 [3] Rsamtools_1.26.2           Biostrings_2.42.1
 [5] XVector_0.14.1             SummarizedExperiment_1.4.0
 [7] Biobase_2.34.0             GenomicRanges_1.26.4
 [9] GenomeInfoDb_1.10.3        IRanges_2.8.2
[11] S4Vectors_0.12.2           BiocGenerics_0.20.0
[13] BiocInstaller_1.24.0       xlsx_0.5.7
[15] xlsxjars_0.6.1             rJava_0.9-8

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.10             AnnotationDbi_1.36.2     GenomicAlignments_1.10.1
 [4] zlibbioc_1.20.0          BiocParallel_1.8.2       BSgenome_1.42.0
 [7] lattice_0.20-35          tools_3.3.3              grid_3.3.3
[10] DBI_0.6-1                digest_0.6.12            Matrix_1.2-8
[13] rtracklayer_1.34.2       bitops_1.0-6             biomaRt_2.30.0
[16] RCurl_1.95-4.8           memoise_1.1.0            RSQLite_1.1-2
[19] GenomicFeatures_1.26.4   XML_3.98-1.6

 

readvcf • 691 views
ADD COMMENTlink modified 22 months ago by Martin Morgan ♦♦ 23k • written 22 months ago by Haiying.Kong110
Answer: cannot allocate memory?
0
gravatar for Martin Morgan
22 months ago by
Martin Morgan ♦♦ 23k
United States
Martin Morgan ♦♦ 23k wrote:

Input only the information you're interested in, using specialized functions readInfo(), readGeno() or more generally readVcf with the ScanVcfParam() function. If the data are still too large, iterate through using VcfFile() with a yieldSize() argument, and GenomicFiles::reduceByYield(). The relevant help pages and package vignettes (e.g., on the landing pages https://bioconductor.org/pacages/VariantAnnotation) have examples that might help you to pose additional more specific questions if you run into problems.

ADD COMMENTlink written 22 months ago by Martin Morgan ♦♦ 23k

Thank you very much for your reply.

How is it decided "the data are still too large"? I should have more than 200GB mem, and the vcf file I am trying to load is less than 16GB.

ADD REPLYlink written 22 months ago by Haiying.Kong110

By 'still to large' I meant that you receive the same error about memory allocation.

When I look at the code, it seems like the error message could be moderately misleading. The error occurs if one of the components of the VCF file (e.g., the GT field) the product of the dimensions of the resulting matrix were larger than the maximum integer size (about 2.14 billion). With 1000 samples and a field taking on three values per sample, the maximum number of variants would be about 715,000 (2.14 billion / 1000 / 3). Making  the code work with much larger data sets isn't a priority for me -- process it in chunks to manage memory, allow other processes and users to access the computer's resources, and facilitate parallel evaluation.

There are a number of additional possibilities. Remember that a VCF file is  text file, but one operates in R on different data types, so the character value '1' in a VCF file is a single byte, but represented by a double in R so take up 8 bytes. Depending on memory allocation patterns, memory available to the operating system may become fragmented, so that while there are many more than x bytes available, contiguous blocks are all less than the amount required. Etc.

ADD REPLYlink written 22 months ago by Martin Morgan ♦♦ 23k

Thank you very much for the explanation.

ADD REPLYlink written 22 months ago by Haiying.Kong110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 206 users visited in the last hour