This post is just a comment / observation.
As part of a research question, we are interested in looking at genomic variants in 16 samples from 1000 genomes, selected to span different populations, so we expect the number of variants to be substantially large than if we had 16 europeans. These samples are present in different 1000 genome releases, so we extracted and joined the data to create a single 1.7G gzip compressed VCF file with data only on these 16 individuals.
In an empty R session, where I only read in this VCF file (and load VariantAnnotation), a call to gc() immediately following reading the file, gives me this:
used (Mb) gc trigger (Mb) max used (Mb) Ncells 166178096 8874.9 1006166790 53735.2 1572135611 83961.1 Vcells 6129253908 46762.5 11233293048 85703.3 10698202890 81620.9
In other words, the resulting object must be around 55G in size and took roughly 160G to create, and took a long time (I think several hours, but I could be wrong). The Rda created by save'ing this object is 4G (which I did after running gc()). Given the size of the object, I guess utilizing ~3x more memory to create it, does not seem surprising. Since this is public data I am happy to post the vcf.gz file somewhere.
I am not too familiar with VCF files (that is an understatement), but this memory usage (the size of the final object) just seems a bit insane to me. I was kind of surprised by this. It is entirely possible that the vcf.gz file we use is structured in a weird way (if that is possible); all we did was extract and join files using bcftools, and we have not done any qc on the result.
Best,
Kasper
VCF files are complicated beasts; can you augment your post with the command you used to read in the data, in particular noting the fixed, info, and geno arguments to ScanVcfParam(), in relation to the parts of the file relevant to your research question?
This post describes a similar problem with memory usage:
Error in reading 1000 genomes data
You could try the SeqArray package. The first step is to convert the VCF into a GDS file, which does take a while, but then it is much faster to access the subsets of the data you want, or apply a function over all genotypes, since parsing the text format has been done ahead of time. There is also an asVCF method to return VCF-class objects, which you can do on a subset of the data.