Question: How to read variants into R from large dataset using VariantAnnotation
0
9 weeks ago by
PhiPhi0
PhiPhi0 wrote:

I am trying to use the Bioconductor library "VariantAnnotation" to read in data from UK BioBank which has around half a million samples. It gives me error message even when I want to read only 50 variants:

"Error: scanVcf: scanVcf: scanTabix: (internal) _vcftype_grow 'sz' < 0; cannot allocate memory?

param <- ScanVcfParam(geno="GT")


If I specify a few samples to extract in the ScanVcfParam() step, then it works fine and I can get the variants.

param <- ScanVcfParam(geno="GT",samples = samples(scanVcfHeader("chr22.vcf.50.header.gz"))[1:65000])


The number of samples I can specify can be up to 65,000, beyond that seems to be problematic for extracting say 50 variants.

Any suggestions to read BioBank dataset into R? Any help would be appreciated. Thanks in advance!!

software error • 189 views
modified 9 weeks ago by Martin Morgan ♦♦ 23k • written 9 weeks ago by PhiPhi0

You don't say how much RAM you have, or your operating system. If you are constrained (by having not much RAM or being on Windows), then you probably need to add RAM or get a better computer. In addition you need to include the output from sessionInfo

Thank you for the response! I am afraid memory is not a problem. We ran the package in a server with 256GB RAM. We only tried to read 100 markers from 100,000 individuals. Do you have other suggestions?

sessionInfo(package="VariantAnnotation")
R version 3.5.1 (2018-07-02)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.10 (Santiago)

Matrix products: default
BLAS: /usr/lib64/R/lib/libRblas.so
LAPACK: /usr/lib64/R/lib/libRlapack.so

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
character(0)

other attached packages:
[1] VariantAnnotation_1.28.8

loaded via a namespace (and not attached):
[1] Rcpp_1.0.0                  compiler_3.5.1
[3] GenomeInfoDb_1.18.1         XVector_0.22.0
[5] prettyunits_1.0.2           GenomicFeatures_1.34.1
[7] methods_3.5.1               bitops_1.0-6
[9] utils_3.5.1                 tools_3.5.1
[11] grDevices_3.5.1             zlibbioc_1.28.0
[13] progress_1.2.0              biomaRt_2.38.0
[15] digest_0.6.18               bit_1.1-14
[17] BSgenome_1.50.0             RSQLite_2.1.1
[19] memoise_1.1.0               lattice_0.20-35
[21] pkgconfig_2.0.2             rlang_0.3.1
[23] Matrix_1.2-15               DelayedArray_0.8.0
[25] DBI_1.0.0                   parallel_3.5.1
[27] GenomeInfoDbData_1.2.0      rtracklayer_1.42.1
[29] httr_1.4.0                  stringr_1.3.1
[31] Biostrings_2.50.2           S4Vectors_0.20.1
[33] graphics_3.5.1              datasets_3.5.1
[35] stats_3.5.1                 IRanges_2.16.0
[37] hms_0.4.2                   stats4_3.5.1
[39] bit64_0.9-7                 grid_3.5.1
[41] base_3.5.1                  Biobase_2.42.0
[43] R6_2.3.0                    AnnotationDbi_1.44.0
[45] XML_3.98-1.16               BiocParallel_1.16.5
[47] magrittr_1.5                blob_1.1.1
[49] GenomicAlignments_1.18.1    Rsamtools_1.34.0
[51] matrixStats_0.54.0          BiocGenerics_0.28.0
[53] GenomicRanges_1.34.0        assertthat_0.2.0
[55] SummarizedExperiment_1.12.0 stringi_1.2.4
[57] RCurl_1.95-4.11             crayon_1.3.4

Answer: How to read variants into R from large dataset using VariantAnnotation
1
9 weeks ago by
Martin Morgan ♦♦ 23k
United States
Martin Morgan ♦♦ 23k wrote:

Two suggestions.

Use the VariantAnnotation::readGT() function for a more efficient input.

Use as the path to the VCF file a VcfFile() with yieldSize= some number of variant to read in a chunk; process the file in an iterative fashion, maybe using GenomicFiles::reduceByYield(), which implements the basic idiom

open(vcf)
repeat {
if (nrow(chunk) == 0)
break
## process the chunk
}


Thank you for your suggestions! I try VariantAnnotation::readGT() function which works if I do not specify the variant range and can read in all half million population correctly:

vcf.ref.file <- "chr22.vcf.50.header.gz"


However, if I specify the variant range, it would still give me memory allocation error message.

vcf.ref.file <- "chr22.vcf.50.header.gz"
rng <- GRanges(seqnames="22", ranges=IRanges(start=16050075,end=16051874))
param <- ScanVcfParam(which=rng,geno="GT")


"Error: scanVcf: scanVcf: scanTabix: (internal) _vcftype_grow 'sz' < 0; cannot allocate memory? path: chr22.vcf.50.header.gz index: chr22.vcf.50.header.gz.tbi path: chr22.vcf.50.header.gz" 

It would only work if I specify the variant range together with a sample range:

vcf.ref.file <- "chr22.vcf.50.header.gz"
rng <- GRanges(seqnames="22", ranges=IRanges(start=16050075,end=16051874))


It is strange that it can read in all variants from all samples but cannot read in a subset of variants from all samples. Since this toy data file "chr22.vcf.50.header.gz" only contains 50 variants, it is fine to read all variants. It would be nonrealistic if I try to feed in a data file with whole genome without specifying the variant range.

BTW, the VcfFile() with yieldSize= would only work if no variant range is specified. So it is not applicable here.

Do you have any other suggestions? Thank you in advance!

I will try to look at the code; as a second work-around I wonder if you specify all samples it will also work?

rng <- GRanges(seqnames="22", ranges=IRanges(start=16050075,end=16051874))


Is this file or a mock of it available publicly?

Thank you for your timely reply! No, it won't work if I specify all samples. This file is from UK-Biobank.

Thanks for your patience; VariantAnnotation had expected typical queries not to have as many samples. Please try version 1.28.11, which is available now via

BiocManager::install("Bioconductor/VariantAnnotation", ref="RELEASE_3_8")


or in 48 hours or so via

BiocManager::install("VariantAnnotation")