I am trying to use the Bioconductor library "VariantAnnotation" to read in data from UK BioBank which has around half a million samples. It gives me error message even when I want to read only 50 variants:
"Error: scanVcf: scanVcf: scanTabix: (internal) _vcftype_grow 'sz' < 0; cannot allocate memory? path: chr22.vcf.50.header.gz index: chr22.vcf.50.header.gz.tbi path: chr22.vcf.50.header.gz"
param <- ScanVcfParam(geno="GT") vcf_rng <- readVcf("chr22.vcf.50.header.gz", "hg19", param=param)
If I specify a few samples to extract in the ScanVcfParam() step, then it works fine and I can get the variants.
param <- ScanVcfParam(geno="GT",samples = samples(scanVcfHeader("chr22.vcf.50.header.gz"))[1:65000]) vcf_rng <- readVcf("chr22.vcf.50.header.gz", "hg19", param=param)
The number of samples I can specify can be up to 65,000, beyond that seems to be problematic for extracting say 50 variants.
Any suggestions to read BioBank dataset into R? Any help would be appreciated. Thanks in advance!!