How to read variants into R from large dataset using VariantAnnotation
1
0
Entering edit mode
PhiPhi • 0
@phiphi-19798
Last seen 5.2 years ago

I am trying to use the Bioconductor library "VariantAnnotation" to read in data from UK BioBank which has around half a million samples. It gives me error message even when I want to read only 50 variants:

"Error: scanVcf: scanVcf: scanTabix: (internal) _vcftype_grow 'sz' < 0; cannot allocate memory?
 path: chr22.vcf.50.header.gz
 index: chr22.vcf.50.header.gz.tbi
  path: chr22.vcf.50.header.gz"
param <- ScanVcfParam(geno="GT")
vcf_rng <- readVcf("chr22.vcf.50.header.gz", "hg19", param=param)

If I specify a few samples to extract in the ScanVcfParam() step, then it works fine and I can get the variants.

param <- ScanVcfParam(geno="GT",samples = samples(scanVcfHeader("chr22.vcf.50.header.gz"))[1:65000])
vcf_rng <- readVcf("chr22.vcf.50.header.gz", "hg19", param=param)

The number of samples I can specify can be up to 65,000, beyond that seems to be problematic for extracting say 50 variants.

Any suggestions to read BioBank dataset into R? Any help would be appreciated. Thanks in advance!!

software error • 1.6k views
ADD COMMENT
0
Entering edit mode

You don't say how much RAM you have, or your operating system. If you are constrained (by having not much RAM or being on Windows), then you probably need to add RAM or get a better computer. In addition you need to include the output from sessionInfo

ADD REPLY
0
Entering edit mode

Thank you for the response! I am afraid memory is not a problem. We ran the package in a server with 256GB RAM. We only tried to read 100 markers from 100,000 individuals. Do you have other suggestions?

sessionInfo(package="VariantAnnotation")
R version 3.5.1 (2018-07-02)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.10 (Santiago)

Matrix products: default
BLAS: /usr/lib64/R/lib/libRblas.so
LAPACK: /usr/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
character(0)

other attached packages:
[1] VariantAnnotation_1.28.8

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0                  compiler_3.5.1             
 [3] GenomeInfoDb_1.18.1         XVector_0.22.0             
 [5] prettyunits_1.0.2           GenomicFeatures_1.34.1     
 [7] methods_3.5.1               bitops_1.0-6               
 [9] utils_3.5.1                 tools_3.5.1                
[11] grDevices_3.5.1             zlibbioc_1.28.0            
[13] progress_1.2.0              biomaRt_2.38.0             
[15] digest_0.6.18               bit_1.1-14                 
[17] BSgenome_1.50.0             RSQLite_2.1.1              
[19] memoise_1.1.0               lattice_0.20-35            
[21] pkgconfig_2.0.2             rlang_0.3.1                
[23] Matrix_1.2-15               DelayedArray_0.8.0         
[25] DBI_1.0.0                   parallel_3.5.1             
[27] GenomeInfoDbData_1.2.0      rtracklayer_1.42.1         
[29] httr_1.4.0                  stringr_1.3.1              
[31] Biostrings_2.50.2           S4Vectors_0.20.1           
[33] graphics_3.5.1              datasets_3.5.1             
[35] stats_3.5.1                 IRanges_2.16.0             
[37] hms_0.4.2                   stats4_3.5.1               
[39] bit64_0.9-7                 grid_3.5.1                 
[41] base_3.5.1                  Biobase_2.42.0             
[43] R6_2.3.0                    AnnotationDbi_1.44.0       
[45] XML_3.98-1.16               BiocParallel_1.16.5        
[47] magrittr_1.5                blob_1.1.1                 
[49] GenomicAlignments_1.18.1    Rsamtools_1.34.0           
[51] matrixStats_0.54.0          BiocGenerics_0.28.0        
[53] GenomicRanges_1.34.0        assertthat_0.2.0           
[55] SummarizedExperiment_1.12.0 stringi_1.2.4              
[57] RCurl_1.95-4.11             crayon_1.3.4
ADD REPLY
1
Entering edit mode
@martin-morgan-1513
Last seen 5 days ago
United States

Two suggestions.

Use the VariantAnnotation::readGT() function for a more efficient input.

Use as the path to the VCF file a VcfFile() with yieldSize= some number of variant to read in a chunk; process the file in an iterative fashion, maybe using GenomicFiles::reduceByYield(), which implements the basic idiom

open(vcf)
repeat {
    chunk <- readGT(vcf, ...)
    if (nrow(chunk) == 0)
        break
    ## process the chunk
}
ADD COMMENT
0
Entering edit mode

Thank you for your suggestions! I try VariantAnnotation::readGT() function which works if I do not specify the variant range and can read in all half million population correctly:

vcf.ref.file <- "chr22.vcf.50.header.gz"
vcf_rng<-readGT(vcf.ref.file, nucleotides=TRUE, row.names=TRUE)

However, if I specify the variant range, it would still give me memory allocation error message.

vcf.ref.file <- "chr22.vcf.50.header.gz"
rng <- GRanges(seqnames="22", ranges=IRanges(start=16050075,end=16051874))
param <- ScanVcfParam(which=rng,geno="GT")
vcf_rng<-readGT(vcf.ref.file, nucleotides=TRUE, row.names=TRUE, param=param)

"Error: scanVcf: scanVcf: scanTabix: (internal) _vcftype_grow 'sz' < 0; cannot allocate memory? path: chr22.vcf.50.header.gz index: chr22.vcf.50.header.gz.tbi path: chr22.vcf.50.header.gz"

It would only work if I specify the variant range together with a sample range:

vcf.ref.file <- "chr22.vcf.50.header.gz"
rng <- GRanges(seqnames="22", ranges=IRanges(start=16050075,end=16051874)) 
param <- ScanVcfParam(which=rng,geno="GT",samples = samples(scanVcfHeader(vcf.ref.file))[1:65000])
vcf_rng<-readGT(vcf.ref.file, nucleotides=TRUE, row.names=TRUE, param=param)

It is strange that it can read in all variants from all samples but cannot read in a subset of variants from all samples. Since this toy data file "chr22.vcf.50.header.gz" only contains 50 variants, it is fine to read all variants. It would be nonrealistic if I try to feed in a data file with whole genome without specifying the variant range.

BTW, the VcfFile() with yieldSize= would only work if no variant range is specified. So it is not applicable here.

Do you have any other suggestions? Thank you in advance!

ADD REPLY
0
Entering edit mode

I will try to look at the code; as a second work-around I wonder if you specify all samples it will also work?

rng <- GRanges(seqnames="22", ranges=IRanges(start=16050075,end=16051874)) 
param <- ScanVcfParam(which=rng,geno="GT",samples = samples(scanVcfHeader(vcf.ref.file)))

Is this file or a mock of it available publicly?

ADD REPLY
0
Entering edit mode

Thank you for your timely reply! No, it won't work if I specify all samples. This file is from UK-Biobank.

ADD REPLY
0
Entering edit mode

Thanks for your patience; VariantAnnotation had expected typical queries not to have as many samples. Please try version 1.28.11, which is available now via

BiocManager::install("Bioconductor/VariantAnnotation", ref="RELEASE_3_8")

or in 48 hours or so via

BiocManager::install("VariantAnnotation")
ADD REPLY
0
Entering edit mode

Version 1.28.11 works much better! Thanks a lot!!

ADD REPLY

Login before adding your answer.

Traffic: 996 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6