Question: DESeq2 "summarizeOverlaps" "se" not working, freezing, or too slow
gravatar for BioinfStudent
23 months ago by
BioinfStudent0 wrote:

Hey everyone, I'm having an issue with running DESeq2 and hoping someone can help or give some advice. I'm very new to this and don't know a lot about CPU memory/parallel or series program running, etc. It's similar to a question posted 2 years ago @ GenomicAlignments is too slow with summarizeOverlaps . Sorry in advance for posting such a wordy Q!

Basically, I have a folder with 26 bam files (13 paired comparisons of 2 tissue types) and one GTF file. I first tried running DESeq2 with just one comparison, so the 2 bam files with GTF.  After the first attempt froze my computer, I changed the yieldSize = 1000000 -as suggested in previous post- and ran it just on my home desktop. It did work, with the last step "summarizeOverlaps" taking about 2 hours to yield a result.

So I then moved on to try to run all comparisons in one go. We have access to two remote desktops which have far greater memory and capacity then my home desktop, so I ran this on our remote desktop. The files themselves are LARGE. They range from 13 to 29 GB and the GTF 1.25 GB. I would have to ask about the memory capacity of our remote desktop, but we have run similar files with other programs there before with success so I am not especially concerned about memory or CPU capacity.

What I'm wondering is if anyone has any advice for me. I have changed the yield size and played around a bit with MultiParam as suggested in the previous post, but I'm not having any luck. I've tried running it twice overnight without success and again I ran the summarizeOverlaps se command over the weekend from Friday-Monday but still it's either still running, froze, or just not working. The commands I ran in R are here:

*note-- they are the same commands I ran with the 2 samples successfully*


library (Rsamtools)
library (GenomicFeatures)
library (GenomicAlignments)

> Table <- data.frame(
+ sampleName = c("11-1732a", "11-1732b", "11-2033a", "11-2033b", "11-2104a", "11-2104b", "11-2154a", "11-2154b", "11-2253a", "11-2253b", "12-0046a", "12-0046b", "12-0210a", "12-0210b", "12-0276a", "12-0276b", "12-0396a", "12-0396b", "12-0408a", "12-0408b", "12-0530a", "12-0530b", "12-0538a", "12-0538b", "12-0732a", "12-0732b"),
+ fileName=c("11-1732-PR.hg38.coord.bam", "11-1732-PRCS.hg38.coord.bam", "11-2033-PR.hg38.coord.bam", "11-2033-PRCS.hg38.coord.bam", "11-2104-PR.hg38.coord.bam", "11-2104-PRCS.hg38.coord.bam", "11-2154-PR.hg38.coord.bam", "11-2154-PRCS.hg38.coord.bam", "11-2253-PR.hg38.coord.bam", "11-2253-PRCS.hg38.coord.bam", "12-0046-PR.hg38.coord.bam", "12-0046-PRCS.hg38.coord.bam", "12-0210-PR.hg38.coord.bam", "12-0210-PRCS.hg38.coord.bam", "12-0276-PR.hg38.coord.bam", "12-0276-PRCS.hg38.coord.bam", "12-0396-PR.hg38.coord.bam", "12-0396-PRCS.hg38.coord.bam", "12-0408-PR.hg38.coord.bam", "12-0408-PRCS.hg38.coord.bam", "12-0530-PR.hg38.coord.bam", "12-0530-PRCS.hg38.coord.bam", "12-0538-PR.hg38.coord.bam", "12-0538-PRCS.hg38.coord.bam", "12-0732-PR.hg38.coord.bam", "12-0732-PRCS.hg38.coord.bam"),
+ condition=c("PR", "PRCS", "PR", "PRCS", "PR", "PRCS", "PR", "PRCS", "PR", "PRCS", "PR", "PRCS", "PR", "PRCS", "PR", "PRCS", "PR", "PRCS", "PR", "PRCS", "PR", "PRCS", "PR", "PRCS", "PR", "PRCS")
+ )

> bamFiles <- file.path(Table$fileName)
> seqinfo(BamFile(bamFiles[1]))
> hse <- makeTxDbFromGFF("filepathway/gencode.v25.annotation.gtf", format="gtf")
> exonsByGene <- exonsBy (hse, by="gene")
> se <- summarizeOverlaps (exonsByGene, BamFileList (bamFiles, yieldSize = 1000000), mode="Union", singleEnd=FALSE, ignore.strand=TRUE, fragments=TRUE)

Any advice or comments are welcomed.


ADD COMMENTlink modified 23 months ago by Michael Love20k • written 23 months ago by BioinfStudent0

Addn: The remote desktop I am using has 32 GB RAM.

ADD REPLYlink written 23 months ago by BioinfStudent0

can you add sessioInfo() to your output? also indicate operating system of the 'remote desktop'.

ADD REPLYlink written 23 months ago by Martin Morgan ♦♦ 22k

I can, and I will do so as soon as I can. We are going through some issues with our firewall but when things are running smoothly, I'll post!

ADD REPLYlink written 23 months ago by BioinfStudent0

The operating system is CentOS 6. Admin is in the process of upgrading the system to CentOS 7. Here is what was given after including sessionInfo():

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.8 (Final)

[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C                
[9] LC_ADDRESS=C               LC_TELEPHONE=C            

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base 



ADD REPLYlink written 23 months ago by BioinfStudent0
gravatar for Michael Love
23 months ago by
Michael Love20k
United States
Michael Love20k wrote:


If you continue to have difficulty using summarizeOverlaps, another option which I am now recommending for new users, is to quantify at the transcript level using a method such as Salmon or kallisto, and then to use tximport to generate a DESeqDataSet.

All you need for this pipeline are:

* a FASTA file of the transcript sequences
* a GTF file which links transcripts to genes

The transcript quantifiers take a few minutes to run per sample.

Once you've run one of the transcript quantifiers, you can follow the code here to produce a DESeqDataSet:

ADD COMMENTlink written 23 months ago by Michael Love20k

Thank you for the feedback- I will look into running this program next time if I am unable to get summarizeOverlaps to work with my data. I appreciate the input!

ADD REPLYlink written 23 months ago by BioinfStudent0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 280 users visited in the last hour