Search
Question: Error in reading 1000 genomes data
2
gravatar for Wade Copeland
3.9 years ago by
United States
Wade Copeland20 wrote:

To whom it may concern,

 

I am a statistician at the Fred Hutchinson Cancer Research Center and am trying to use the Bioconductor library "VariantAnnotation" to read in the data from the 1000 genomes project (http://www.1000genomes.org/).  I am getting an error I am not able to decipher and would appreciate any help in solving the problem.  In case it is relevant, I am using a shared server on our campus with almost 400gb total memory.

 

Below are the commands I am using:

library(VariantAnnotation)

chr21 <- readVcf(file = "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz", genome = "hg19")

 

Below is the error output:

Error: scanVcf: scanVcf: scanTabix: (internal) _vcftype_grow 'sz' < 0; cannot allocate memory?
 path: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
 index: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz.tbi
  path: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz

 

Below is the session info:

R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] VariantAnnotation_1.10.5 Rsamtools_1.16.1         Biostrings_2.32.1       
[4] XVector_0.4.0            GenomicRanges_1.16.4     GenomeInfoDb_1.0.2      
[7] IRanges_1.22.10          BiocGenerics_0.10.0     

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.26.1    base64enc_0.1-2         BatchJobs_1.4          
 [4] BBmisc_1.7              Biobase_2.24.0          BiocParallel_0.6.1     
 [7] biomaRt_2.20.0          bitops_1.0-6            brew_1.0-6             
[10] BSgenome_1.32.0         checkmate_1.5.0         codetools_0.2-9        
[13] DBI_0.3.1               digest_0.6.4            fail_1.2               
[16] foreach_1.4.2           GenomicAlignments_1.0.6 GenomicFeatures_1.16.3
[19] iterators_1.0.7         RCurl_1.95-4.3          RSQLite_0.11.4         
[22] rtracklayer_1.24.2      sendmailR_1.2-1         stats4_3.1.1           
[25] stringr_0.6.2           tools_3.1.1             XML_3.98-1.1           
[28] zlibbioc_1.10.0 

 

Thank you for your time and consideration.

 

Wade

ADD COMMENTlink modified 2.6 years ago by Martin Morgan ♦♦ 22k • written 3.9 years ago by Wade Copeland20
2
gravatar for Martin Morgan
2.6 years ago by
Martin Morgan ♦♦ 22k
United States
Martin Morgan ♦♦ 22k wrote:

You might try readGeno() to read just a single GENO field. Or use VcfFile(fl, yieldSize=10000) as well as the ScanBamParam() you're using. The code would look something like

vcffile = open(VcfFile(fl, yieldSize=10000))
repeat {
    vcf = readVcf(vcffile, "hg19", param=param)
    if (nrow(vcf) == 0)
        break
    ## do work on chunk
}

This iteration approach also works with readGeno().

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by Martin Morgan ♦♦ 22k
1
gravatar for Valerie Obenchain
3.9 years ago by
Valerie Obenchain ♦♦ 6.6k
United States
Valerie Obenchain ♦♦ 6.6k wrote:

Hello,

There are a couple of approaches to working with large VCF files. One method is to read in data subsets and the other is to use a 'yieldSize'. Examples of how to yield through a file are at the bottom of the ?readVcf man page (section titled 'Iterate through VCF with yieldSize').

The more common situation is that only a portion of the data is really needed for an analysis. In this case you can use the ScanVcfParam to read in subsets by range, variable or sample. See ?ScanVcfParam for full details.

fl <- "ALL.chr21.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz"

For selecting subsets it's useful to look at the header to get variable and sample names. See ?VCFHeader.

hdr <- scanVcfHeader(fl)
hdr
class: VCFHeader 
samples(2504): HG00096 HG00097 ... NA21143 NA21144
meta(2): META contig
fixed(2): FILTER ALT
info(24): CIEND CIPOS ... DP AA
geno(1): GT

Read in a couple of INFO fields but no genotype data.

param <- ScanVcfParam(info = c("DP", "AA"), geno=NA)
vcf <- readVcf(fl, "b37", param=param)
dim(vcf)
[1] 1105538       0
head(info(vcf))
DataFrame with 6 rows and 2 columns
                      DP          AA
               <integer> <character>
21:9411239_G/A     20675           .
rs181691356        22649           .
21:9411264_A/C     28638           .
21:9411267_G/T     29348           .
21:9411302_G/T     30980           .
21:9411313_G/A     30734           .

Calling range() on the GRanges in rowData() gives you an idea of the total range of positions. You may already have a region of interest in mind but if not, this is informative.

range(rowData(vcf))
GRanges object with 1 range and 0 metadata columns:
      seqnames              ranges strand
         <Rle>           <IRanges>  <Rle>
  [1]       21 [9411239, 48119740]      *
  -------
  seqinfo: 86 sequences from hg19 genome

Read in all variables in a given range.

param <- ScanVcfParam(which = GRanges("21", IRanges(9e7, 9.5e7)))
vcf_range <- readVcf(fl, "b37", param=param)
dim(vcf_range)
[1] 19605  2504
names(info(vcf_range))
 [1] "CIEND"     "CIPOS"     "CS"        "END"       "IMPRECISE" "MC"       
 [7] "MEINFO"    "MEND"      "MLEN"      "MSTART"    "SVLEN"     "SVTYPE"   
[13] "TSD"       "AC"        "AF"        "NS"        "AN"        "EAS_AF"   
[19] "EUR_AF"    "AFR_AF"    "AMR_AF"    "SAS_AF"    "DP"        "AA"       
geno(vcf_range)
List of length 1
names(1): GT

Other functions read in single variables as a matrix, such as readGT() or readInfo(). These avoid the overhead of parsing all data into a VCF object and are useful when you want only a single variable. See ?readGT for details.

Valerie

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by Valerie Obenchain ♦♦ 6.6k
0
gravatar for Lescai, Francesco
2.6 years ago by
Denmark
Lescai, Francesco370 wrote:

Hi Valerie,

I seem to have the same problem (it's a very large VCF file, with 4,500 samples), but I am indeed subsetting the file.
I compressed it and tabix'ed it. and I'm requesting
fl <- "exome.recalibrated.vcf.gz"
hdr <- scanVcfHeader(fl)
vcf <- readVcf(fl, genome="hg38", param=ScanVcfParam(info="DP", geno=c("GT","DP")))

However, I'm getting the same error.
Any suggestions?

thanks,
Francesco
 

Error: scanVcf: scanVcf: scanTabix: (internal) _vcftype_grow 'sz' < 0; cannot allocate memory?
 path: exome.recalibrated.vcf.gz
 index: exome.recalibrated.vcf.gz.tbi
  path: exome.recalibrated.vcf.gz

 

R version 3.1.0 (2014-04-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] VariantAnnotation_1.10.5 Rsamtools_1.16.1         Biostrings_2.32.1       
[4] XVector_0.4.0            GenomicRanges_1.16.4     GenomeInfoDb_1.0.2      
[7] IRanges_1.22.10          BiocGenerics_0.10.0     

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.26.1    base64enc_0.1-2         BatchJobs_1.6          
 [4] BBmisc_1.9              Biobase_2.24.0          BiocParallel_0.6.1     
 [7] biomaRt_2.20.0          bitops_1.0-6            brew_1.0-6             
[10] BSgenome_1.32.0         checkmate_1.6.0         codetools_0.2-8        
[13] DBI_0.3.1               digest_0.6.8            fail_1.2               
[16] foreach_1.4.2           GenomicAlignments_1.0.6 GenomicFeatures_1.16.3
[19] iterators_1.0.7         RCurl_1.95-4.7          RSQLite_1.0.0          
[22] rtracklayer_1.24.2      sendmailR_1.2-1         stats4_3.1.0           
[25] stringr_0.6.2           tools_3.1.0             XML_3.98-1.3           
[28] zlibbioc_1.10.0        

 

 

 

 

ADD COMMENTlink written 2.6 years ago by Lescai, Francesco370
0
gravatar for Lescai, Francesco
2.6 years ago by
Denmark
Lescai, Francesco370 wrote:

I also tried to use a more updated version, but the error is the same

 

R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.6 (Final)

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] VariantAnnotation_1.16.4   Rsamtools_1.22.0          
 [3] Biostrings_2.38.4          XVector_0.10.0            
 [5] SummarizedExperiment_1.0.2 Biobase_2.30.0            
 [7] GenomicRanges_1.22.4       GenomeInfoDb_1.6.3        
 [9] IRanges_2.4.8              S4Vectors_0.8.11          
[11] BiocGenerics_0.16.1        BiocInstaller_1.20.1      

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.32.3    GenomicAlignments_1.6.3 zlibbioc_1.16.0        
 [4] BiocParallel_1.4.3      BSgenome_1.38.0         tools_3.2.2            
 [7] DBI_0.3.1               lambda.r_1.1.7          futile.logger_1.4.1    
[10] rtracklayer_1.30.2      futile.options_1.0.0    bitops_1.0-6           
[13] RCurl_1.95-4.8          biomaRt_2.26.1          RSQLite_1.0.0          
[16] GenomicFeatures_1.22.13 XML_3.98-1.4           

 

ADD COMMENTlink written 2.6 years ago by Lescai, Francesco370

Have you figured this out?  I was trying to read a single chromosome (22) in today (file is 205mb) and getting the same memory error, also on a large server. 

 

ADD REPLYlink written 2.1 years ago by skillcoyne0

Take the strategies outlined above -- read only what you want (using ScanVcfParam), iterate (using yieldSize) through large files.

ADD REPLYlink written 2.1 years ago by Martin Morgan ♦♦ 22k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 201 users visited in the last hour