Gviz and AnnotateTrack and memory usage
7
0
Entering edit mode
Bogdan ▴ 670
@bogdan-2367
Last seen 6 months ago
Palo Alto, CA, USA

Dear all,

I am using GVIZ package in order to read a list of ChIP-seq peaks (30 000 peaks) in BED format; when I transform the data into the AnnotationTrack in GVIZ, it says "cannot allocate vector of size 35.3 Gb" ;)

It is difficult to imagine how a list of 30 000 peaks require 35Gb memory; please could you advise if there is a way to optimize the R code. thanks ;)

-- bogdan

The R code is :

 

library(ggbio)
library(Gviz)
library(BSgenome)

library(GenomicRanges)
library(GenomicFeatures)
library(biovizBase)
library(rtracklayer)

library(BSgenome.Mmusculus.UCSC.mm9)
library(TxDb.Mmusculus.UCSC.mm9.knownGene)

PROTEIN_CHIP <-rtracklayer::import("file.bed",format="bed")
PROTEIN_CHIP_track <- AnnotationTrack(PROTEIN_CHIP, genome = "mm9",name="PROTEIN")
Error: cannot allocate vector of size 35.3 Gb
 

 

gviz • 2.1k views
ADD COMMENT
0
Entering edit mode

Please add the output of sessionInfo().

ADD REPLY
0
Entering edit mode
Bogdan ▴ 670
@bogdan-2367
Last seen 6 months ago
Palo Alto, CA, USA

Thanks Dan. It works now with DataTrack, however it does not display any of  the tracks, shall I use :

plotTracks(PROTEIN_CHIP,chromosome="chr1"). Any insights on why this is happening ? The sessionInfo() is :

 

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] stats4    grid      parallel  stats     graphics  grDevices utils    
 [8] datasets  methods   base     

other attached packages:
 [1] BSgenome.Mmusculus.UCSC.mm9_1.4.0 biovizBase_1.14.1                
 [3] GenomicFeatures_1.18.2            AnnotationDbi_1.28.1             
 [5] Biobase_2.26.0                    BSgenome_1.34.0                  
 [7] rtracklayer_1.26.2                Biostrings_2.34.1                
 [9] XVector_0.6.0                     Gviz_1.10.3                      
[11] GenomicRanges_1.18.3              GenomeInfoDb_1.2.3               
[13] IRanges_2.0.1                     S4Vectors_0.4.0                  
[15] ggbio_1.14.0                      ggplot2_1.0.0                    
[17] BiocGenerics_0.12.1              

loaded via a namespace (and not attached):
 [1] acepack_1.3-3.3          base64enc_0.1-2          BatchJobs_1.5           
 [4] BBmisc_1.8               BiocParallel_1.0.0       biomaRt_2.22.0          
 [7] bitops_1.0-6             brew_1.0-6               checkmate_1.5.1         
[10] cluster_1.15.3           codetools_0.2-10         colorspace_1.2-4        
[13] DBI_0.3.1                dichromat_2.0-0          digest_0.6.6            
[16] fail_1.2                 foreach_1.4.2            foreign_0.8-62          
[19] Formula_1.1-2            GenomicAlignments_1.2.1  GGally_0.5.0            
[22] graph_1.44.1             gridExtra_0.9.1          gtable_0.1.2            
[25] Hmisc_3.14-6             iterators_1.0.7          lattice_0.20-29         
[28] latticeExtra_0.6-26      MASS_7.3-37              matrixStats_0.12.2      
[31] munsell_0.4.2            nnet_7.3-8               OrganismDbi_1.8.0       
[34] plyr_1.8.1               proto_0.3-10             RBGL_1.42.0             
[37] RColorBrewer_1.1-2       Rcpp_0.11.3              RCurl_1.95-4.5          
[40] reshape_0.8.5            reshape2_1.4.1           R.methodsS3_1.6.1       
[43] rpart_4.1-8              Rsamtools_1.18.2         RSQLite_1.0.0           
[46] scales_0.2.4             sendmailR_1.2-1          splines_3.1.2           
[49] stringr_0.6.2            survival_2.37-7          tools_3.1.2             
[52] VariantAnnotation_1.12.7 XML_3.98-1.1             zlibbioc_1.12.0

 

ADD COMMENT
0
Entering edit mode
@florianhahnenovartiscom-3784
Last seen 5.6 years ago
Switzerland

Hi Bogdan,

you will have to be a bit more specific about what is actually failing with the DataTrack solution. What is the actual code you tried, and what is the error you are seeing? It would also be helpful to see the first couple of lines of your PROTEIN_CHIP GRanges object to know what you have actually read in. In general, I would argue against loading complete genome-wide data into any R object since the copy-by-value semantic will quickly get you into serious memory issues. Instead, use an indexed file type for you ChIPSeq data and let Gviz worry about the data retrieval on demand. The obvious choice here would be the bigWig format. BED is a notoriously bad choice for numeric data (which I assume your ChiPSeq peaks are, or are you only talking about peak locations here), you may at least want to try BedGraph (which can't be indexed if I am not mistaken). UCSC has a nice listing of the different file types: http://genome.ucsc.edu/FAQ/FAQformat.html

Please also see the package vignette about creating AnnotationTracks or DataTracks directly from files. For the typical file types you will not have to do that detour via rtracklayer.

ADD COMMENT
0
Entering edit mode
Bogdan ▴ 670
@bogdan-2367
Last seen 6 months ago
Palo Alto, CA, USA

Hi Florian,

thank you for your email and suggestions ! if I may, I will send you the file with the list of peaks in a private email ;)

 

-- bogdan

ADD COMMENT
0
Entering edit mode
@florianhahnenovartiscom-3784
Last seen 5.6 years ago
Switzerland

Sure.

And now I still need to add a sentence because for whatever strange reason the portal wants me to write at least 20 characters...

ADD COMMENT
0
Entering edit mode
Bogdan ▴ 670
@bogdan-2367
Last seen 6 months ago
Palo Alto, CA, USA

Thanks Florian again for the insights on the memory usage and AnnotationTrack and DataTrack settinsg and issues.

Wondering if there is any other way to import the long BED files into my Gviz session without using wig or graph files.

To be more specific : I have a list of 30000-40000 peaks of CTCF in BED format (optionally I could add a score of 10 to each). The list of peaks is below, and the code I use is :

CTCF_minusE2 <-   read.delim("CTCF_minusE2_HOMERpeaks.m1e1.converted-hg19.with-score10", header=TRUE)
CTCF_minusE2_ranges <- GRanges(seqnames = CTCF_minusE2$chr, ranges = IRanges(CTCF_minusE2$start,CTCF_minusE2$end), strand ="*", score=CTCF_minusE2$score)

CTCF_minusE2_annotation_track <- AnnotationTrack(CTCF_minusE2_ranges, genome = "hg19", chromosome = chr, start = start, end = end, name="CTCF_minusE2", fill = "red", stacking="dense")
CTCF_minusE2_data_track <- DataTrack(CTCF_minusE2_ranges, genome = "hg19",  chromosome = chr, start = start, end = end, name="CTCF_minusE2", fill = "red", type ="p")

For CTCF_minusE2_annotation_track, the Error is : " cannot allocate vector of size 10.8 Gb". Is there any way to circumvent it ? Thanks a lot !

 

Here is the list of BED peaks in the file : "CTCF_minusE2_HOMERpeaks.m1e1.converted-hg19.with-score10" (I could send you the full list by regular email or via this website, if the supportBioC list accepts 70 000 peaks).

chr    start    end    score
chr1    10225    12362    10
chr1    14655    16655    10
chr1    90387    92387    10
chr1    104044    106044    10
chr1    136442    138442    10
chr1    139325    141325    10
chr1    236690    238690    10
chr1    250444    252444    10
chr1    324041    326041    10
chr1    368522    370522    10
chr1    388021    390021    10
chr1    415688    417688    10
chr1    422283    424283    10
chr1    436849    438849    10
chr1    455066    457066    10
chr1    460944    462944    10
chr1    526707    528707    10
chr1    544138    546138    10
chr1    565718    567718    10
chr1    572799    574799    10
chr1    600518    602518    10
chr1    663702    665702    10

 

 

 

 

 

ADD COMMENT
0
Entering edit mode
@florianhahnenovartiscom-3784
Last seen 5.6 years ago
Switzerland

Hm,

if there are really just 40-50k rows in the GRanges object this 10.8G assignment looks odd. I didn't do much memory optimization in Gviz yet, but there is nothing obvious that comes to mind. Could you send that BED file via email address, please (florian dot hahne add novartis dot com) I'll take a closer look to see what goes wrong here.

 

ADD COMMENT
0
Entering edit mode
@florianhahnenovartiscom-3784
Last seen 5.6 years ago
Switzerland

Oh, and please gzip the BED file before sending.

ADD COMMENT

Login before adding your answer.

Traffic: 787 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6