Memory usage in readVcf from VariantAnnotation
0
2
Entering edit mode
@kasper-daniel-hansen-2979
Last seen 18 months ago
United States

This post is just a comment / observation.

As part of a research question, we are interested in looking at genomic variants in 16 samples from 1000 genomes, selected to span different populations, so we expect the number of variants to be substantially large than if we had 16 europeans.  These samples are present in different 1000 genome releases, so we extracted and joined the data to create a single 1.7G gzip compressed VCF file with data only on these 16 individuals.

In an empty R session, where I only read in this VCF file (and load VariantAnnotation), a call to gc() immediately following reading the file, gives me this:

             used    (Mb)  gc trigger    (Mb)    max used    (Mb)
Ncells  166178096  8874.9  1006166790 53735.2  1572135611 83961.1
Vcells 6129253908 46762.5 11233293048 85703.3 10698202890 81620.9

In other words, the resulting object must be around 55G in size and took roughly 160G to create, and took a long time (I think several hours, but I could be wrong). The Rda created by save'ing this object is 4G (which I did after running gc()). Given the size of the object, I guess utilizing ~3x more memory to create it, does not seem surprising. Since this is public data I am happy to post the vcf.gz file somewhere. 

I am not too familiar with VCF files (that is an understatement), but this memory usage (the size of the final object) just seems a bit insane to me. I was kind of surprised by this. It is entirely possible that the vcf.gz file we use is structured in a weird way (if that is possible); all we did was extract and join files using bcftools, and we have not done any qc on the result.

Best,
Kasper

VariantAnnotation • 2.0k views
ADD COMMENT
0
Entering edit mode

VCF files are complicated beasts; can you augment your post with the command you used to read in the data, in particular noting the fixed, info, and geno arguments to ScanVcfParam(), in relation to the parts of the file relevant to your research question?

ADD REPLY
0
Entering edit mode
I just did readVcf(PATH_TO_FILE, genome = "hg19") ... I think (of course I did not save this ... doh). Kasper On Mon, Dec 8, 2014 at 10:20 PM, Martin Morgan [bioc] < noreply@bioconductor.org> wrote: > Activity on a post you are following on support.bioconductor.org > > User Martin Morgan <https: support.bioconductor.org="" u="" 1513=""/> wrote Comment: > Memory usage in readVcf from VariantAnnotation > <https: support.bioconductor.org="" p="" 63468="" #63470="">: > > VCF files are complicated beasts; can you augment your post with the > command you used to read in the data, in particular noting the fixed, info, > and geno arguments to ScanVcfParam(), in relation to the parts of the file > relevant to your research question? > > ------------------------------ > > You may reply via email or visit > C: Memory usage in readVcf from VariantAnnotation >
ADD REPLY
0
Entering edit mode

This post describes a similar problem with memory usage:

Error in reading 1000 genomes data

ADD REPLY
0
Entering edit mode

You could try the SeqArray package.  The first step is to convert the VCF into a GDS file, which does take a while, but then it is much faster to access the subsets of the data you want, or apply a function over all genotypes, since parsing the text format has been done ahead of time.  There is also an asVCF method to return VCF-class objects, which you can do on a subset of the data.

ADD REPLY

Login before adding your answer.

Traffic: 556 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6