Question

Gviz and AnnotateTrack and memory usage

0

Entering edit mode

Bogdan ▴ 670

@bogdan-2367

Last seen 6 months ago

Palo Alto, CA, USA

Dear all,

I am using GVIZ package in order to read a list of ChIP-seq peaks (30 000 peaks) in BED format; when I transform the data into the AnnotationTrack in GVIZ, it says "cannot allocate vector of size 35.3 Gb" ;)

It is difficult to imagine how a list of 30 000 peaks require 35Gb memory; please could you advise if there is a way to optimize the R code. thanks ;)

-- bogdan

The R code is :

library(ggbio)
library(Gviz)
library(BSgenome)

library(GenomicRanges)
library(GenomicFeatures)
library(biovizBase)
library(rtracklayer)

library(BSgenome.Mmusculus.UCSC.mm9)
library(TxDb.Mmusculus.UCSC.mm9.knownGene)

PROTEIN_CHIP <-rtracklayer::import("file.bed",format="bed")
PROTEIN_CHIP_track <- AnnotationTrack(PROTEIN_CHIP, genome = "mm9",name="PROTEIN")
Error: cannot allocate vector of size 35.3 Gb

gviz • 2.1k views

ADD COMMENT • link updated 8.8 years ago by florian.hahne@novartis.com ★ 1.6k • written 9.2 years ago by Bogdan ▴ 670

0

Entering edit mode

Please add the output of sessionInfo().

ADD REPLY • link 9.2 years ago Dan Tenenbaum ★ 8.2k

score 0 · Answer 1 · 2015-01-27

Thanks Dan. It works now with DataTrack, however it does not display any of the tracks, shall I use :

plotTracks(PROTEIN_CHIP,chromosome="chr1"). Any insights on why this is happening ? The sessionInfo() is :

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4 grid parallel stats graphics grDevices utils
[8] datasets methods base

other attached packages:
[1] BSgenome.Mmusculus.UCSC.mm9_1.4.0 biovizBase_1.14.1
[3] GenomicFeatures_1.18.2            AnnotationDbi_1.28.1
[5] Biobase_2.26.0                    BSgenome_1.34.0
[7] rtracklayer_1.26.2                Biostrings_2.34.1
[9] XVector_0.6.0                     Gviz_1.10.3
[11] GenomicRanges_1.18.3              GenomeInfoDb_1.2.3
[13] IRanges_2.0.1                     S4Vectors_0.4.0
[15] ggbio_1.14.0                      ggplot2_1.0.0
[17] BiocGenerics_0.12.1

loaded via a namespace (and not attached):
[1] acepack_1.3-3.3          base64enc_0.1-2          BatchJobs_1.5
[4] BBmisc_1.8               BiocParallel_1.0.0       biomaRt_2.22.0
[7] bitops_1.0-6             brew_1.0-6               checkmate_1.5.1
[10] cluster_1.15.3           codetools_0.2-10         colorspace_1.2-4
[13] DBI_0.3.1                dichromat_2.0-0          digest_0.6.6
[16] fail_1.2                 foreach_1.4.2            foreign_0.8-62
[19] Formula_1.1-2            GenomicAlignments_1.2.1 GGally_0.5.0
[22] graph_1.44.1             gridExtra_0.9.1          gtable_0.1.2
[25] Hmisc_3.14-6             iterators_1.0.7          lattice_0.20-29
[28] latticeExtra_0.6-26      MASS_7.3-37              matrixStats_0.12.2
[31] munsell_0.4.2            nnet_7.3-8               OrganismDbi_1.8.0
[34] plyr_1.8.1               proto_0.3-10             RBGL_1.42.0
[37] RColorBrewer_1.1-2       Rcpp_0.11.3              RCurl_1.95-4.5
[40] reshape_0.8.5            reshape2_1.4.1           R.methodsS3_1.6.1
[43] rpart_4.1-8              Rsamtools_1.18.2         RSQLite_1.0.0
[46] scales_0.2.4             sendmailR_1.2-1          splines_3.1.2
[49] stringr_0.6.2            survival_2.37-7          tools_3.1.2
[52] VariantAnnotation_1.12.7 XML_3.98-1.1             zlibbioc_1.12.0

score 0 · Answer 2 · 2015-01-28

Hi Bogdan,

you will have to be a bit more specific about what is actually failing with the DataTrack solution. What is the actual code you tried, and what is the error you are seeing? It would also be helpful to see the first couple of lines of your PROTEIN_CHIP GRanges object to know what you have actually read in. In general, I would argue against loading complete genome-wide data into any R object since the copy-by-value semantic will quickly get you into serious memory issues. Instead, use an indexed file type for you ChIPSeq data and let Gviz worry about the data retrieval on demand. The obvious choice here would be the bigWig format. BED is a notoriously bad choice for numeric data (which I assume your ChiPSeq peaks are, or are you only talking about peak locations here), you may at least want to try BedGraph (which can't be indexed if I am not mistaken). UCSC has a nice listing of the different file types: http://genome.ucsc.edu/FAQ/FAQformat.html

Please also see the package vignette about creating AnnotationTracks or DataTracks directly from files. For the typical file types you will not have to do that detour via rtracklayer.

score 0 · Answer 3 · 2015-01-28

0

Entering edit mode

Bogdan ▴ 670

@bogdan-2367

Last seen 6 months ago

Palo Alto, CA, USA

Hi Florian,

thank you for your email and suggestions ! if I may, I will send you the file with the list of peaks in a private email ;)

-- bogdan

ADD COMMENT • link 9.2 years ago Bogdan ▴ 670

score 0 · Answer 4 · 2015-01-29

0

Entering edit mode

florian.hahne@novartis.com ★ 1.6k

@florianhahnenovartiscom-3784

Last seen 5.6 years ago

Switzerland

Sure.

And now I still need to add a sentence because for whatever strange reason the portal wants me to write at least 20 characters...

ADD COMMENT • link 9.2 years ago florian.hahne@novartis.com ★ 1.6k

score 0 · Answer 5 · 2015-07-08

Thanks Florian again for the insights on the memory usage and AnnotationTrack and DataTrack settinsg and issues.

Wondering if there is any other way to import the long BED files into my Gviz session without using wig or graph files.

To be more specific : I have a list of 30000-40000 peaks of CTCF in BED format (optionally I could add a score of 10 to each). The list of peaks is below, and the code I use is :

CTCF_minusE2 <- read.delim("CTCF_minusE2_HOMERpeaks.m1e1.converted-hg19.with-score10", header=TRUE)
CTCF_minusE2_ranges <- GRanges(seqnames = CTCF_minusE2$chr, ranges = IRanges(CTCF_minusE2$start,CTCF_minusE2$end), strand ="*", score=CTCF_minusE2$score)

CTCF_minusE2_annotation_track <- AnnotationTrack(CTCF_minusE2_ranges, genome = "hg19", chromosome = chr, start = start, end = end, name="CTCF_minusE2", fill = "red", stacking="dense")
CTCF_minusE2_data_track <- DataTrack(CTCF_minusE2_ranges, genome = "hg19", chromosome = chr, start = start, end = end, name="CTCF_minusE2", fill = "red", type ="p")

For CTCF_minusE2_annotation_track, the Error is : " cannot allocate vector of size 10.8 Gb". Is there any way to circumvent it ? Thanks a lot !

Here is the list of BED peaks in the file : "CTCF_minusE2_HOMERpeaks.m1e1.converted-hg19.with-score10" (I could send you the full list by regular email or via this website, if the supportBioC list accepts 70 000 peaks).

chr   start   end   score
chr1   10225   12362   10
chr1   14655   16655   10
chr1   90387   92387   10
chr1   104044   106044   10
chr1   136442   138442   10
chr1   139325   141325   10
chr1   236690   238690   10
chr1   250444   252444   10
chr1   324041   326041   10
chr1   368522   370522   10
chr1   388021   390021   10
chr1   415688   417688   10
chr1   422283   424283   10
chr1   436849   438849   10
chr1   455066   457066   10
chr1   460944   462944   10
chr1   526707   528707   10
chr1   544138   546138   10
chr1   565718   567718   10
chr1   572799   574799   10
chr1   600518   602518   10
chr1   663702   665702   10

score 0 · Answer 6 · 2015-07-09

0

Entering edit mode

florian.hahne@novartis.com ★ 1.6k

@florianhahnenovartiscom-3784

Last seen 5.6 years ago

Switzerland

Hm,

if there are really just 40-50k rows in the GRanges object this 10.8G assignment looks odd. I didn't do much memory optimization in Gviz yet, but there is nothing obvious that comes to mind. Could you send that BED file via email address, please (florian dot hahne add novartis dot com) I'll take a closer look to see what goes wrong here.

ADD COMMENT • link 8.8 years ago florian.hahne@novartis.com ★ 1.6k

score 0 · Answer 7 · 2015-07-09

0

Entering edit mode

florian.hahne@novartis.com ★ 1.6k

@florianhahnenovartiscom-3784

Last seen 5.6 years ago

Switzerland

Oh, and please gzip the BED file before sending.

ADD COMMENT • link 8.8 years ago florian.hahne@novartis.com ★ 1.6k