Error in reading 1000 genomes data
4
2
Entering edit mode
@wade-copeland-7044
Last seen 7.5 years ago
United States

To whom it may concern,

 

I am a statistician at the Fred Hutchinson Cancer Research Center and am trying to use the Bioconductor library "VariantAnnotation" to read in the data from the 1000 genomes project (http://www.1000genomes.org/).  I am getting an error I am not able to decipher and would appreciate any help in solving the problem.  In case it is relevant, I am using a shared server on our campus with almost 400gb total memory.

 

Below are the commands I am using:

library(VariantAnnotation)

chr21 <- readVcf(file = "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz", genome = "hg19")

 

Below is the error output:

Error: scanVcf: scanVcf: scanTabix: (internal) _vcftype_grow 'sz' < 0; cannot allocate memory?
 path: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
 index: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz.tbi
  path: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz

 

Below is the session info:

R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] VariantAnnotation_1.10.5 Rsamtools_1.16.1         Biostrings_2.32.1       
[4] XVector_0.4.0            GenomicRanges_1.16.4     GenomeInfoDb_1.0.2      
[7] IRanges_1.22.10          BiocGenerics_0.10.0     

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.26.1    base64enc_0.1-2         BatchJobs_1.4          
 [4] BBmisc_1.7              Biobase_2.24.0          BiocParallel_0.6.1     
 [7] biomaRt_2.20.0          bitops_1.0-6            brew_1.0-6             
[10] BSgenome_1.32.0         checkmate_1.5.0         codetools_0.2-9        
[13] DBI_0.3.1               digest_0.6.4            fail_1.2               
[16] foreach_1.4.2           GenomicAlignments_1.0.6 GenomicFeatures_1.16.3
[19] iterators_1.0.7         RCurl_1.95-4.3          RSQLite_0.11.4         
[22] rtracklayer_1.24.2      sendmailR_1.2-1         stats4_3.1.1           
[25] stringr_0.6.2           tools_3.1.1             XML_3.98-1.1           
[28] zlibbioc_1.10.0 

 

Thank you for your time and consideration.

 

Wade

variantannotation • 3.5k views
ADD COMMENT
2
Entering edit mode
@martin-morgan-1513
Last seen 3 months ago
United States

You might try readGeno() to read just a single GENO field. Or use VcfFile(fl, yieldSize=10000) as well as the ScanBamParam() you're using. The code would look something like

vcffile = open(VcfFile(fl, yieldSize=10000))
repeat {
    vcf = readVcf(vcffile, "hg19", param=param)
    if (nrow(vcf) == 0)
        break
    ## do work on chunk
}

This iteration approach also works with readGeno().

ADD COMMENT
1
Entering edit mode
@valerie-obenchain-4275
Last seen 2.9 years ago
United States

Hello,

There are a couple of approaches to working with large VCF files. One method is to read in data subsets and the other is to use a 'yieldSize'. Examples of how to yield through a file are at the bottom of the ?readVcf man page (section titled 'Iterate through VCF with yieldSize').

The more common situation is that only a portion of the data is really needed for an analysis. In this case you can use the ScanVcfParam to read in subsets by range, variable or sample. See ?ScanVcfParam for full details.

fl <- "ALL.chr21.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz"

For selecting subsets it's useful to look at the header to get variable and sample names. See ?VCFHeader.

hdr <- scanVcfHeader(fl)
hdr
class: VCFHeader 
samples(2504): HG00096 HG00097 ... NA21143 NA21144
meta(2): META contig
fixed(2): FILTER ALT
info(24): CIEND CIPOS ... DP AA
geno(1): GT

Read in a couple of INFO fields but no genotype data.

param <- ScanVcfParam(info = c("DP", "AA"), geno=NA)
vcf <- readVcf(fl, "b37", param=param)
dim(vcf)
[1] 1105538       0
head(info(vcf))
DataFrame with 6 rows and 2 columns
                      DP          AA
               <integer> <character>
21:9411239_G/A     20675           .
rs181691356        22649           .
21:9411264_A/C     28638           .
21:9411267_G/T     29348           .
21:9411302_G/T     30980           .
21:9411313_G/A     30734           .

Calling range() on the GRanges in rowData() gives you an idea of the total range of positions. You may already have a region of interest in mind but if not, this is informative.

range(rowData(vcf))
GRanges object with 1 range and 0 metadata columns:
      seqnames              ranges strand
         <Rle>           <IRanges>  <Rle>
  [1]       21 [9411239, 48119740]      *
  -------
  seqinfo: 86 sequences from hg19 genome

Read in all variables in a given range.

param <- ScanVcfParam(which = GRanges("21", IRanges(9e7, 9.5e7)))
vcf_range <- readVcf(fl, "b37", param=param)
dim(vcf_range)
[1] 19605  2504
names(info(vcf_range))
 [1] "CIEND"     "CIPOS"     "CS"        "END"       "IMPRECISE" "MC"       
 [7] "MEINFO"    "MEND"      "MLEN"      "MSTART"    "SVLEN"     "SVTYPE"   
[13] "TSD"       "AC"        "AF"        "NS"        "AN"        "EAS_AF"   
[19] "EUR_AF"    "AFR_AF"    "AMR_AF"    "SAS_AF"    "DP"        "AA"       
geno(vcf_range)
List of length 1
names(1): GT

Other functions read in single variables as a matrix, such as readGT() or readInfo(). These avoid the overhead of parsing all data into a VCF object and are useful when you want only a single variable. See ?readGT for details.

Valerie

ADD COMMENT
0
Entering edit mode
@lescai-francesco-5078
Last seen 6.2 years ago
Denmark

Hi Valerie,

I seem to have the same problem (it's a very large VCF file, with 4,500 samples), but I am indeed subsetting the file.
I compressed it and tabix'ed it. and I'm requesting
fl <- "exome.recalibrated.vcf.gz"
hdr <- scanVcfHeader(fl)
vcf <- readVcf(fl, genome="hg38", param=ScanVcfParam(info="DP", geno=c("GT","DP")))

However, I'm getting the same error.
Any suggestions?

thanks,
Francesco
 

Error: scanVcf: scanVcf: scanTabix: (internal) _vcftype_grow 'sz' < 0; cannot allocate memory?
 path: exome.recalibrated.vcf.gz
 index: exome.recalibrated.vcf.gz.tbi
  path: exome.recalibrated.vcf.gz

 

R version 3.1.0 (2014-04-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] VariantAnnotation_1.10.5 Rsamtools_1.16.1         Biostrings_2.32.1       
[4] XVector_0.4.0            GenomicRanges_1.16.4     GenomeInfoDb_1.0.2      
[7] IRanges_1.22.10          BiocGenerics_0.10.0     

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.26.1    base64enc_0.1-2         BatchJobs_1.6          
 [4] BBmisc_1.9              Biobase_2.24.0          BiocParallel_0.6.1     
 [7] biomaRt_2.20.0          bitops_1.0-6            brew_1.0-6             
[10] BSgenome_1.32.0         checkmate_1.6.0         codetools_0.2-8        
[13] DBI_0.3.1               digest_0.6.8            fail_1.2               
[16] foreach_1.4.2           GenomicAlignments_1.0.6 GenomicFeatures_1.16.3
[19] iterators_1.0.7         RCurl_1.95-4.7          RSQLite_1.0.0          
[22] rtracklayer_1.24.2      sendmailR_1.2-1         stats4_3.1.0           
[25] stringr_0.6.2           tools_3.1.0             XML_3.98-1.3           
[28] zlibbioc_1.10.0        

 

 

 

 

ADD COMMENT
0
Entering edit mode
@lescai-francesco-5078
Last seen 6.2 years ago
Denmark

I also tried to use a more updated version, but the error is the same

 

R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.6 (Final)

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] VariantAnnotation_1.16.4   Rsamtools_1.22.0          
 [3] Biostrings_2.38.4          XVector_0.10.0            
 [5] SummarizedExperiment_1.0.2 Biobase_2.30.0            
 [7] GenomicRanges_1.22.4       GenomeInfoDb_1.6.3        
 [9] IRanges_2.4.8              S4Vectors_0.8.11          
[11] BiocGenerics_0.16.1        BiocInstaller_1.20.1      

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.32.3    GenomicAlignments_1.6.3 zlibbioc_1.16.0        
 [4] BiocParallel_1.4.3      BSgenome_1.38.0         tools_3.2.2            
 [7] DBI_0.3.1               lambda.r_1.1.7          futile.logger_1.4.1    
[10] rtracklayer_1.30.2      futile.options_1.0.0    bitops_1.0-6           
[13] RCurl_1.95-4.8          biomaRt_2.26.1          RSQLite_1.0.0          
[16] GenomicFeatures_1.22.13 XML_3.98-1.4           

 

ADD COMMENT
0
Entering edit mode

Have you figured this out?  I was trying to read a single chromosome (22) in today (file is 205mb) and getting the same memory error, also on a large server. 

 

ADD REPLY
0
Entering edit mode

Take the strategies outlined above -- read only what you want (using ScanVcfParam), iterate (using yieldSize) through large files.

ADD REPLY

Login before adding your answer.

Traffic: 564 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6