I am trying to use the Bioconductor library "VariantAnnotation" to read in data from UK BioBank which has around half a million samples. It gives me error message even when I want to read only 50 variants:
"Error: scanVcf: scanVcf: scanTabix: (internal) _vcftype_grow 'sz' < 0; cannot allocate memory?
path: chr22.vcf.50.header.gz
index: chr22.vcf.50.header.gz.tbi
path: chr22.vcf.50.header.gz"
param <- ScanVcfParam(geno="GT")
vcf_rng <- readVcf("chr22.vcf.50.header.gz", "hg19", param=param)
If I specify a few samples to extract in the ScanVcfParam() step, then it works fine and I can get the variants.
param <- ScanVcfParam(geno="GT",samples = samples(scanVcfHeader("chr22.vcf.50.header.gz"))[1:65000])
vcf_rng <- readVcf("chr22.vcf.50.header.gz", "hg19", param=param)
The number of samples I can specify can be up to 65,000, beyond that seems to be problematic for extracting say 50 variants.
Any suggestions to read BioBank dataset into R? Any help would be appreciated. Thanks in advance!!
You don't say how much RAM you have, or your operating system. If you are constrained (by having not much RAM or being on Windows), then you probably need to add RAM or get a better computer. In addition you need to include the output from
sessionInfo
Thank you for the response! I am afraid memory is not a problem. We ran the package in a server with 256GB RAM. We only tried to read 100 markers from 100,000 individuals. Do you have other suggestions?