GenomicAlignments is too slow with summarizeOverlaps
7
0
Entering edit mode
bioprog • 0
@bioprog-7204
Last seen 6.7 years ago

I hope someone can help in this issue.

I have 8 bam files from mm9 alignment, each ~4-5 geg in size. When I run summarizeOverlaps over 3 files, it takes 2-3 hours to finish and it works although my computer almost freezes up. But when I inquire to summarizeOverlaps for the the 8 bam files together, then keep it overnight (as it takes too long to wait), the computer freezes (although it is 16 geg i7 mac, so supposed to be powerful) and the command never results in anything. I even had it run for 30 hours and it looked like it was consuming memory (~600 mega of ram) but still got nothing. I had to reboot the laptop.

I am making my own txdb file from gtf that I used for the alignment to match the naming of the chromosomes. (script is below).

Do you have any tips on how I can get the summzerOverlaps to work on the 8 files to create one se file without freezing up the computer? I have been trying to do that for the past 2 week and always same result.

Any input is appreciated.

here’s the script:

library("DESeq2")
library("GenomicFeatures")
library("Rsamtools")
library("GenomicAlignments")
library("GenomicRanges”)

mm9_from_cluster_gtf_txdb <- makeTranscriptDbFromGFF(file="~/Desktop/genes.gtf", format="gtf”)
saveDb(mm9_from_cluster_gtf_txdb, file="/Path/To/Libraries/TxDB/mm9_from_cluster_Ensembl_txdb.sqlite”)
exonsByGene<-exonsBy(mm9_from_cluster_gtf_txdb,by="gene")
seqinfo(exonsByGene)

fls <- list.files("Path/To/BamFiles", pattern="paired.accepted_hits.bam", full= TRUE)
fls

Experiment <- c(fls[2:8], fls[1])
Experiment

bamLst_experiment <- BamFileList(Experiment, yieldSize=100000)
seqinfo(bamLst_experiment)

se_test_experiment <- summarizeOverlaps(exonsByGene,bamLst_experiment, mode="Union", singleEnd=FALSE, ignore.strand=TRUE, fragments=TRUE) <<<This is the step that freezes the computer when I run the 8 of the files together.

Sessioninfo()

R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] GenomicAlignments_1.2.0 Rsamtools_1.18.1 Biostrings_2.34.0 XVector_0.6.0
[5] GenomicRanges_1.18.3 GenomeInfoDb_1.2.2 IRanges_2.0.0 S4Vectors_0.4.0
[9] BiocGenerics_0.12.0 BiocInstaller_1.16.1

loaded via a namespace (and not attached):
[1] base64enc_0.1-2 BatchJobs_1.5 BBmisc_1.8 BiocParallel_1.0.0 bitops_1.0-6
[6] brew_1.0-6 checkmate_1.5.0 codetools_0.2-9 DBI_0.3.1 digest_0.6.4
[11] fail_1.2 foreach_1.4.2 iterators_1.0.7 RSQLite_1.0.0 sendmailR_1.2-1
[16] stringr_0.6.2 tools_3.1.2 zlibbioc_1.12.0
3
Entering edit mode
@valerie-obenchain-4275
Last seen 8 months ago
United States

Hi,

The summarizeOverlaps() method for BamFileList runs the files in parallel with bplapply() from the BiocParallel package. This is described on the man page in the first paragraph of 'Details'.

The default number of cores used in bplapply() is determined by detectCores(), which is generally the max available. While 16GIG was enough to process 3  files concurrently it isn't enough to do all 8 (or the max number of cores you have).

You can control the number of cores used by registering a BPPARAM. The default param on unix and mac is MulticoreParam, on windows it is SnowParam. On my machine you can see the default is a MulticoreParam with 4 workers.

> library(BiocParallel)
> registered()
$MulticoreParam class: MulticoreParam; bpisup: TRUE; bpworkers: 4; catch.errors: TRUE setSeed: TRUE; recursive: TRUE; cleanup: TRUE; cleanupSignal: 15; verbose: FALSE$SnowParam
class: SnowParam; bpisup: FALSE; bpworkers: 4; catch.errors: TRUE
cluster spec: 4; type: PSOCK

$BatchJobsParam class: BatchJobsParam; bpisup: TRUE; bpworkers: NA; catch.errors: TRUE cleanup: TRUE; stop.on.error: FALSE; progressbar: TRUE$SerialParam
class: SerialParam; bpisup: TRUE; bpworkers: 1; catch.errors: TRUE

Change the number of workers (and other parameters) by registering a param:

> register(MulticoreParam(workers = 2))
> registered()
$MulticoreParam class: MulticoreParam; bpisup: TRUE; bpworkers: 2; catch.errors: TRUE setSeed: TRUE; recursive: TRUE; cleanup: TRUE; cleanupSignal: 15; verbose: FALSE$SnowParam
class: SnowParam; bpisup: FALSE; bpworkers: 4; catch.errors: TRUE
cluster spec: 4; type: PSOCK

$BatchJobsParam class: BatchJobsParam; bpisup: TRUE; bpworkers: NA; catch.errors: TRUE cleanup: TRUE; stop.on.error: FALSE; progressbar: TRUE$SerialParam
class: SerialParam; bpisup: TRUE; bpworkers: 1; catch.errors: TRUE

To run the code in serial,

register(MulticoreParam(workers = 1))

You may have to play around with the number of cores and 'yieldSize' in the BamFileList' to find the optimal combination given the file size, cores and available memory. Let me know how it goes.

Valerie

3
Entering edit mode
@gordon-smyth
Last seen 10 minutes ago
WEHI, Melbourne, Australia

You might try the featureCounts() function in the Rsubread package. Comparisons published in:

show that it is considerably faster than compareOverlaps and uses considerably less memory. The efficiency gain is more than an order of magnitude for both time and memory, and is even greater if you have an unusually large number of genomic features. featureCounts() can make use of multiple cores (even without the BiocParallel package).

0
Entering edit mode
bioprog • 0
@bioprog-7204
Last seen 6.7 years ago

Hi Valerie-

Thank you for your prompt response. I ran the following

> library(BiocParallel)
> registered()

OUTPUT:
$MulticoreParam class: MulticoreParam; bpisup: TRUE; bpworkers: 8; catch.errors: TRUE setSeed: TRUE; recursive: TRUE; cleanup: TRUE; cleanupSignal: 15; verbose: FALSE$SnowParam
class: SnowParam; bpisup: FALSE; bpworkers: 8; catch.errors: TRUE
cluster spec: 8; type: PSOCK

$BatchJobsParam class: BatchJobsParam; bpisup: TRUE; bpworkers: NA; catch.errors: TRUE cleanup: TRUE; stop.on.error: FALSE; progressbar: TRUE$SerialParam
class: SerialParam; bpisup: TRUE; bpworkers: 1; catch.errors: TRUE

I will try the script again after using:

register(MulticoreParam(workers = 1))

I will run this overnight now. But, what did you mean by "To run the code in serial .."? Does this mean if i want to run the 8 samples? I am not a computer savvy so i am not sure what most of register() output is indicating initially. After choosing workers=1, i got the following:

$MulticoreParam class: MulticoreParam; bpisup: TRUE; bpworkers: 1; catch.errors: TRUE setSeed: TRUE; recursive: TRUE; cleanup: TRUE; cleanupSignal: 15; verbose: FALSE$SnowParam
class: SnowParam; bpisup: FALSE; bpworkers: 8; catch.errors: TRUE
cluster spec: 8; type: PSOCK

$BatchJobsParam class: BatchJobsParam; bpisup: TRUE; bpworkers: NA; catch.errors: TRUE cleanup: TRUE; stop.on.error: FALSE; progressbar: TRUE$SerialParam
class: SerialParam; bpisup: TRUE; bpworkers: 1; catch.errors: TRUE

Thank you for your promptness. I've been stuck for almost 2 weeks now at the same step.

0
Entering edit mode

'Serial' means to run one file at a time, no parallel execution. This is what happens when you set the number of workers to 1. I meant that example more as a useful way to debug or test code and not what I would recommended for a final execution.I should have explained that more clearly.

The output of registered() reports what parallel back-end will be used when code is run with a function from the BiocParallel package. Because you are on a mac you can take advantage of shared memory with Multicore so that appears as your default (vs a Snow cluster or BatchJobs). When you call register(MulticoreParam(workers = 2)) you are specifying that you want to use 2 workers instead of the default of 8 workers.

You said a previous run of 3 files took 2-3 hours and the memory was taxed (ie, "computer almost freezes up"). Try running 2 files with a MulticoreParam(workers = 2) leaving yieldSize as is. If that goes ok then try the 8 files with the same param. Otherwise try reducing the yieldSize.

Valerie

0
Entering edit mode

Thanks Valerie- I am now setting the workers to 2 and trying 2 files. if that works, then i will keep everything as is and try with 8 files, then decrease the yield size if it still is slow.

I was wondering if I run each file at once (not in parallel), is there a way to combine the outcomes into a single table that i can input into DESeq2? My ultimate goal is to use DESeq2 for analysis of the RNASeq data.

Thanks

0
Entering edit mode

Using register(MulticoreParam(workers = 1)) with a BamFileList of all 8 files will run one file at a time and return all results together as a SummarizedExperiment object that can be used in DESeq2. The counts will be a matrix (with 8 columns) that you can access with assays(SummarizedExperiment).

Valerie

0
Entering edit mode

How long should this process take for BAM files ~5 gega each that are pair-ends data? Beyong which I would know that i should probably restart the computer. Please advise.

Hussein

0
Entering edit mode

I used a 4.4 GIG paired-end test file sorted by position. It took < 9.5 minutes and ~ 0.5 GIG of RAM to run summarizeOverlaps with a yieldSize of 100000.

When you run into memory limitations or any performance issue it's best to break the problem down. Try 1 file (or a subset, see ?ScanBamParam) instead of all 8. Trying out different values for the yieldSize is much easier with 1 vs waiting for all 8 to finish. Once you've got a feel for a good yieldSize and how much memory is used add another worker. Go from there.

Valerie

0
Entering edit mode

thanks Valerie-

I will try all algorithms then summarize my findings here so that it may be of use to others. Much appreciated.

0
Entering edit mode
bioprog • 0
@bioprog-7204
Last seen 6.7 years ago

By the way, do you recommend that i increase or decrease the yield size. Frankly, the significance of choosing 100,000 was totally arbitrary to me. I did not find a good justification for that.

0
Entering edit mode

Valerie
0
Entering edit mode
bioprog • 0
@bioprog-7204
Last seen 6.7 years ago

Thank you Valerie for your feedback. It finally worked. What i did is I used worker=1 and had the run go overnight. It worked without interruptions. I checked it 7 hours later and the results were there (it probably took less time). I maintained the yieldSize as 100000. I guess it is best to run it in serial (not parallel) in case many files would slow it down.

Thank you-

0
Entering edit mode
@valerie-obenchain-4275
Last seen 8 months ago
United States

You're welcome. I'm glad it worked for you.

The number of workers to use will depend on file size, the operation and available RAM. You're probably anxious to move to the next step with DESeq2 but I think with some exploration you could use multiple cores. Much of the code in Bioconductor runs in parallel if the resources are available. The general ideas for managing runs are those of yieldSize (or chunking) and choosing the number of workers. You can also use the functions directly from BiocParallel to run your own analysis in parallel. If you have funding available, another resource for large data computing is the Bioconductor AMI:

http://www.bioconductor.org/help/bioconductor-cloud-ami/

Valerie

0
Entering edit mode
bioprog • 0
@bioprog-7204
Last seen 6.7 years ago

Thanks Valerie-

I think you make a good point. I will try to run things with 2 workers and yield size 100000 and see how it goes. I will look more into BiocParallel ro run the analysis as well. Thanks a lot!