DESeq2 "summarizeOverlaps" "se" not working, freezing, or too slow
1
0
Entering edit mode
@bioinfstudent-12010
Last seen 7.9 years ago

Hey everyone, I'm having an issue with running DESeq2 and hoping someone can help or give some advice. I'm very new to this and don't know a lot about CPU memory/parallel or series program running, etc. It's similar to a question posted 2 years ago @ GenomicAlignments is too slow with summarizeOverlaps . Sorry in advance for posting such a wordy Q!

Basically, I have a folder with 26 bam files (13 paired comparisons of 2 tissue types) and one GTF file. I first tried running DESeq2 with just one comparison, so the 2 bam files with GTF.  After the first attempt froze my computer, I changed the yieldSize = 1000000 -as suggested in previous post- and ran it just on my home desktop. It did work, with the last step "summarizeOverlaps" taking about 2 hours to yield a result.

So I then moved on to try to run all comparisons in one go. We have access to two remote desktops which have far greater memory and capacity then my home desktop, so I ran this on our remote desktop. The files themselves are LARGE. They range from 13 to 29 GB and the GTF 1.25 GB. I would have to ask about the memory capacity of our remote desktop, but we have run similar files with other programs there before with success so I am not especially concerned about memory or CPU capacity.

What I'm wondering is if anyone has any advice for me. I have changed the yield size and played around a bit with MultiParam as suggested in the previous post, but I'm not having any luck. I've tried running it twice overnight without success and again I ran the summarizeOverlaps se command over the weekend from Friday-Monday but still it's either still running, froze, or just not working. The commands I ran in R are here:

*note-- they are the same commands I ran with the 2 samples successfully*

 

source("https://bioconductor.org/biocLite.R")
library (Rsamtools)
library (GenomicFeatures)
library (GenomicAlignments)

> Table <- data.frame(
+ sampleName = c("11-1732a", "11-1732b", "11-2033a", "11-2033b", "11-2104a", "11-2104b", "11-2154a", "11-2154b", "11-2253a", "11-2253b", "12-0046a", "12-0046b", "12-0210a", "12-0210b", "12-0276a", "12-0276b", "12-0396a", "12-0396b", "12-0408a", "12-0408b", "12-0530a", "12-0530b", "12-0538a", "12-0538b", "12-0732a", "12-0732b"),
+ fileName=c("11-1732-PR.hg38.coord.bam", "11-1732-PRCS.hg38.coord.bam", "11-2033-PR.hg38.coord.bam", "11-2033-PRCS.hg38.coord.bam", "11-2104-PR.hg38.coord.bam", "11-2104-PRCS.hg38.coord.bam", "11-2154-PR.hg38.coord.bam", "11-2154-PRCS.hg38.coord.bam", "11-2253-PR.hg38.coord.bam", "11-2253-PRCS.hg38.coord.bam", "12-0046-PR.hg38.coord.bam", "12-0046-PRCS.hg38.coord.bam", "12-0210-PR.hg38.coord.bam", "12-0210-PRCS.hg38.coord.bam", "12-0276-PR.hg38.coord.bam", "12-0276-PRCS.hg38.coord.bam", "12-0396-PR.hg38.coord.bam", "12-0396-PRCS.hg38.coord.bam", "12-0408-PR.hg38.coord.bam", "12-0408-PRCS.hg38.coord.bam", "12-0530-PR.hg38.coord.bam", "12-0530-PRCS.hg38.coord.bam", "12-0538-PR.hg38.coord.bam", "12-0538-PRCS.hg38.coord.bam", "12-0732-PR.hg38.coord.bam", "12-0732-PRCS.hg38.coord.bam"),
+ condition=c("PR", "PRCS", "PR", "PRCS", "PR", "PRCS", "PR", "PRCS", "PR", "PRCS", "PR", "PRCS", "PR", "PRCS", "PR", "PRCS", "PR", "PRCS", "PR", "PRCS", "PR", "PRCS", "PR", "PRCS", "PR", "PRCS")
+ )

> bamFiles <- file.path(Table$fileName)
> seqinfo(BamFile(bamFiles[1]))
> hse <- makeTxDbFromGFF("filepathway/gencode.v25.annotation.gtf", format="gtf")
> exonsByGene <- exonsBy (hse, by="gene")
> se <- summarizeOverlaps (exonsByGene, BamFileList (bamFiles, yieldSize = 1000000), mode="Union", singleEnd=FALSE, ignore.strand=TRUE, fragments=TRUE)

Any advice or comments are welcomed.

 

bioclite bioconductor summarizeoverlaps genomicfeatures genomicalignments • 1.6k views
ADD COMMENT
0
Entering edit mode

Addn: The remote desktop I am using has 32 GB RAM.

ADD REPLY
0
Entering edit mode

can you add sessioInfo() to your output? also indicate operating system of the 'remote desktop'.

ADD REPLY
0
Entering edit mode

I can, and I will do so as soon as I can. We are going through some issues with our firewall but when things are running smoothly, I'll post!

ADD REPLY
0
Entering edit mode

The operating system is CentOS 6. Admin is in the process of upgrading the system to CentOS 7. Here is what was given after including sessionInfo():

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.8 (Final)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8  
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C                
[9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base 

 

 

ADD REPLY
2
Entering edit mode
@mikelove
Last seen 10 hours ago
United States

hi,

If you continue to have difficulty using summarizeOverlaps, another option which I am now recommending for new users, is to quantify at the transcript level using a method such as Salmon or kallisto, and then to use tximport to generate a DESeqDataSet.

All you need for this pipeline are:

* a FASTA file of the transcript sequences
* a GTF file which links transcripts to genes

The transcript quantifiers take a few minutes to run per sample.

Once you've run one of the transcript quantifiers, you can follow the code here to produce a DESeqDataSet:

http://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html

ADD COMMENT
0
Entering edit mode

Thank you for the feedback- I will look into running this program next time if I am unable to get summarizeOverlaps to work with my data. I appreciate the input!

ADD REPLY

Login before adding your answer.

Traffic: 600 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6