Question

Error in reading 1000 genomes data

2

Entering edit mode

Wade Copeland ▴ 20

@wade-copeland-7044

Last seen 8.7 years ago

United States

To whom it may concern,

I am a statistician at the Fred Hutchinson Cancer Research Center and am trying to use the Bioconductor library "VariantAnnotation" to read in the data from the 1000 genomes project (http://www.1000genomes.org/). I am getting an error I am not able to decipher and would appreciate any help in solving the problem. In case it is relevant, I am using a shared server on our campus with almost 400gb total memory.

Below are the commands I am using:

library(VariantAnnotation)

chr21 <- readVcf(file = "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz", genome = "hg19")

Below is the error output:

Error: scanVcf: scanVcf: scanTabix: (internal) _vcftype_grow 'sz' < 0; cannot allocate memory?
path: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
index: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz.tbi
path: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz

Below is the session info:

R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base

other attached packages:
[1] VariantAnnotation_1.10.5 Rsamtools_1.16.1         Biostrings_2.32.1
[4] XVector_0.4.0            GenomicRanges_1.16.4     GenomeInfoDb_1.0.2
[7] IRanges_1.22.10          BiocGenerics_0.10.0

loaded via a namespace (and not attached):
[1] AnnotationDbi_1.26.1    base64enc_0.1-2         BatchJobs_1.4
[4] BBmisc_1.7              Biobase_2.24.0          BiocParallel_0.6.1
[7] biomaRt_2.20.0          bitops_1.0-6            brew_1.0-6
[10] BSgenome_1.32.0         checkmate_1.5.0         codetools_0.2-9
[13] DBI_0.3.1               digest_0.6.4            fail_1.2
[16] foreach_1.4.2           GenomicAlignments_1.0.6 GenomicFeatures_1.16.3
[19] iterators_1.0.7         RCurl_1.95-4.3          RSQLite_0.11.4
[22] rtracklayer_1.24.2      sendmailR_1.2-1         stats4_3.1.1
[25] stringr_0.6.2           tools_3.1.1             XML_3.98-1.1
[28] zlibbioc_1.10.0

Thank you for your time and consideration.

Wade

variantannotation • 4.1k views

ADD COMMENT • link updated 9.9 years ago by Martin Morgan 25k • written 11.2 years ago by Wade Copeland ▴ 20

score 2 · Answer 1 · 2016-03-05

You might try readGeno() to read just a single GENO field. Or use VcfFile(fl, yieldSize=10000) as well as the ScanBamParam() you're using. The code would look something like

vcffile = open(VcfFile(fl, yieldSize=10000))
repeat {
    vcf = readVcf(vcffile, "hg19", param=param)
    if (nrow(vcf) == 0)
        break
    ## do work on chunk
}

This iteration approach also works with readGeno().

score 1 · Answer 2 · 2014-11-14

Hello,

There are a couple of approaches to working with large VCF files. One method is to read in data subsets and the other is to use a 'yieldSize'. Examples of how to yield through a file are at the bottom of the ?readVcf man page (section titled 'Iterate through VCF with yieldSize').

The more common situation is that only a portion of the data is really needed for an analysis. In this case you can use the ScanVcfParam to read in subsets by range, variable or sample. See ?ScanVcfParam for full details.

fl <- "ALL.chr21.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz"

For selecting subsets it's useful to look at the header to get variable and sample names. See ?VCFHeader.

hdr <- scanVcfHeader(fl)
hdr
class: VCFHeader 
samples(2504): HG00096 HG00097 ... NA21143 NA21144
meta(2): META contig
fixed(2): FILTER ALT
info(24): CIEND CIPOS ... DP AA
geno(1): GT

Read in a couple of INFO fields but no genotype data.

param <- ScanVcfParam(info = c("DP", "AA"), geno=NA)
vcf <- readVcf(fl, "b37", param=param)
dim(vcf)
[1] 1105538       0
head(info(vcf))
DataFrame with 6 rows and 2 columns
                      DP          AA
               <integer> <character>
21:9411239_G/A     20675           .
rs181691356        22649           .
21:9411264_A/C     28638           .
21:9411267_G/T     29348           .
21:9411302_G/T     30980           .
21:9411313_G/A     30734           .

Calling range() on the GRanges in rowData() gives you an idea of the total range of positions. You may already have a region of interest in mind but if not, this is informative.

range(rowData(vcf))
GRanges object with 1 range and 0 metadata columns:
      seqnames              ranges strand
         <Rle>           <IRanges>  <Rle>
  [1]       21 [9411239, 48119740]      *
  -------
  seqinfo: 86 sequences from hg19 genome

Read in all variables in a given range.

param <- ScanVcfParam(which = GRanges("21", IRanges(9e7, 9.5e7)))
vcf_range <- readVcf(fl, "b37", param=param)
dim(vcf_range)
[1] 19605  2504
names(info(vcf_range))
 [1] "CIEND"     "CIPOS"     "CS"        "END"       "IMPRECISE" "MC"       
 [7] "MEINFO"    "MEND"      "MLEN"      "MSTART"    "SVLEN"     "SVTYPE"   
[13] "TSD"       "AC"        "AF"        "NS"        "AN"        "EAS_AF"   
[19] "EUR_AF"    "AFR_AF"    "AMR_AF"    "SAS_AF"    "DP"        "AA"       
geno(vcf_range)
List of length 1
names(1): GT

Other functions read in single variables as a matrix, such as readGT() or readInfo(). These avoid the overhead of parsing all data into a VCF object and are useful when you want only a single variable. See ?readGT for details.

Valerie

score 0 · Answer 3 · 2016-03-04

Hi Valerie,

I seem to have the same problem (it's a very large VCF file, with 4,500 samples), but I am indeed subsetting the file.
I compressed it and tabix'ed it. and I'm requesting
fl <- "exome.recalibrated.vcf.gz"
hdr <- scanVcfHeader(fl)
vcf <- readVcf(fl, genome="hg38", param=ScanVcfParam(info="DP", geno=c("GT","DP")))

However, I'm getting the same error.
Any suggestions?

thanks,
Francesco

Error: scanVcf: scanVcf: scanTabix: (internal) _vcftype_grow 'sz' < 0; cannot allocate memory?
path: exome.recalibrated.vcf.gz
index: exome.recalibrated.vcf.gz.tbi
path: exome.recalibrated.vcf.gz

R version 3.1.0 (2014-04-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
[9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base

other attached packages:
[1] VariantAnnotation_1.10.5 Rsamtools_1.16.1         Biostrings_2.32.1
[4] XVector_0.4.0            GenomicRanges_1.16.4     GenomeInfoDb_1.0.2
[7] IRanges_1.22.10          BiocGenerics_0.10.0

loaded via a namespace (and not attached):
[1] AnnotationDbi_1.26.1    base64enc_0.1-2         BatchJobs_1.6
[4] BBmisc_1.9              Biobase_2.24.0          BiocParallel_0.6.1
[7] biomaRt_2.20.0          bitops_1.0-6            brew_1.0-6
[10] BSgenome_1.32.0         checkmate_1.6.0         codetools_0.2-8
[13] DBI_0.3.1               digest_0.6.8            fail_1.2
[16] foreach_1.4.2           GenomicAlignments_1.0.6 GenomicFeatures_1.16.3
[19] iterators_1.0.7         RCurl_1.95-4.7          RSQLite_1.0.0
[22] rtracklayer_1.24.2      sendmailR_1.2-1         stats4_3.1.0
[25] stringr_0.6.2           tools_3.1.0             XML_3.98-1.3
[28] zlibbioc_1.10.0

score 0 · Answer 4 · 2016-03-04

I also tried to use a more updated version, but the error is the same

R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.6 (Final)

locale:
[1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
[9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] VariantAnnotation_1.16.4   Rsamtools_1.22.0
[3] Biostrings_2.38.4          XVector_0.10.0
[5] SummarizedExperiment_1.0.2 Biobase_2.30.0
[7] GenomicRanges_1.22.4       GenomeInfoDb_1.6.3
[9] IRanges_2.4.8              S4Vectors_0.8.11
[11] BiocGenerics_0.16.1        BiocInstaller_1.20.1

loaded via a namespace (and not attached):
[1] AnnotationDbi_1.32.3    GenomicAlignments_1.6.3 zlibbioc_1.16.0
[4] BiocParallel_1.4.3      BSgenome_1.38.0         tools_3.2.2
[7] DBI_0.3.1               lambda.r_1.1.7          futile.logger_1.4.1
[10] rtracklayer_1.30.2      futile.options_1.0.0    bitops_1.0-6
[13] RCurl_1.95-4.8          biomaRt_2.26.1          RSQLite_1.0.0
[16] GenomicFeatures_1.22.13 XML_3.98-1.4