Question

Problem with summarizeOverlaps() when reading >1 BAM file: "stop worker failed"

2

Entering edit mode

ErickF ▴ 40

@erickf-11032

Last seen 9.5 years ago

Hi,

I recently started working with RNAseq data. I used the code below to try to read 2-4 BAM files (BAM and BAI in the same directory, etc) but I repeatedly get the following error when running summarizeOverlaps():

Error: stop worker failed: 'clear_cluster' receive data failed: reached elapsed time limit

One other time I got this error (with the same code):

Error: 'bplapply' receive data failed: error reading from connection

The BAM files are from ~40M single-end 75bp reads, each ~2-2.5Gb (aligned using tophat2/bowtie2; hg19 reference genome). Code, sessionInfo(), and last lines from traceback() are below (of note, this works just fine if I try to do just one BAM file):

> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
> grl <- exonsBy(txdb, by="gene")
> bamLst
  BamFileList of length 4
  names(4): file1.bam file2.bam file3.bam file4.bam
> experiment2 <- summarizeOverlaps(features=grl, reads=bamLst, ignore.strand=T, singleEnd=T)
  Error: stop worker failed:
    'clear_cluster' receive data failed:
    reached elapsed time limit

> traceback()  
16: stop(.error_worker_comm(e, "stop worker failed"))  
15: value[[3L]](cond)  
14: tryCatchOne(expr, names, parentenv, handlers[[1L]])  
13: tryCatchList(expr, classes, parentenv, handlers)  

> sessionInfo()  
R version 3.3.1 (2016-06-21)  
Platform: x86_64-apple-darwin13.4.0 (64-bit)  
Running under: OS X 10.11.4 (El Capitan)  
attached base packages:  
[1] stats4  parallel  stats  graphics  grDevices utils  datasets  methods   base  
other attached packages:  
 [1] GenomicAlignments_1.8.3 Rsamtools_1.24.0           Biostrings_2.40.2  
 [4] XVector_0.12.0          SummarizedExperiment_1.2.3 Biobase_2.32.0  
 [7] GenomicRanges_1.24.2    GenomeInfoDb_1.8.1         IRanges_2.6.1  
[10] S4Vectors_0.10.1        BiocGenerics_0.18.0

It seems to me like this may be related to either computer memory (8Gb), cores (4), or something like that. Beyond using a more powerful computer, is there any way to fix (or circumvent) this??

summarizeoverlaps rnaseq bplapply rangedsummarizedexperiment read counting • 3.4k views

ADD COMMENT • link 9.5 years ago ErickF ▴ 40

score 2 · Accepted Answer · 2016-07-01

2

Entering edit mode

ErickF ▴ 40

@erickf-11032

Last seen 9.5 years ago

Update: Seems like indeed this was related to computing power (memory, cores, or something). I tried with smaller files and it worked. So I added a yieldSize parameter to BamFileList when creating "bamLst", to limit the number of reads scanned from the file at one time:

bamLst <- BamFileList(files1, yieldSize=7500000)

Seems like problem fixed, although I wonder if it makes things run slower too. If anyone has any other suggestions, let me know!!

ADD COMMENT • link 9.5 years ago ErickF ▴ 40

0

Entering edit mode

I would expect yieldSize of > 100000 to be ok for speed. You could process in serial with

BiocParallel::register(BiocParallel::SerialParam())

or perhaps see the Rsubread::featureCounts() or bamsignals packages.

ADD REPLY • link 9.5 years ago Martin Morgan 25k

0

Entering edit mode

Will definitely try those --thanks!

ADD REPLY • link 9.5 years ago ErickF ▴ 40