Hey everyone, I'm having an issue with running DESeq2 and hoping someone can help or give some advice. I'm very new to this and don't know a lot about CPU memory/parallel or series program running, etc. It's similar to a question posted 2 years ago @ GenomicAlignments is too slow with summarizeOverlaps . Sorry in advance for posting such a wordy Q!
Basically, I have a folder with 26 bam files (13 paired comparisons of 2 tissue types) and one GTF file. I first tried running DESeq2 with just one comparison, so the 2 bam files with GTF. After the first attempt froze my computer, I changed the yieldSize = 1000000 -as suggested in previous post- and ran it just on my home desktop. It did work, with the last step "summarizeOverlaps" taking about 2 hours to yield a result.
So I then moved on to try to run all comparisons in one go. We have access to two remote desktops which have far greater memory and capacity then my home desktop, so I ran this on our remote desktop. The files themselves are LARGE. They range from 13 to 29 GB and the GTF 1.25 GB. I would have to ask about the memory capacity of our remote desktop, but we have run similar files with other programs there before with success so I am not especially concerned about memory or CPU capacity.
What I'm wondering is if anyone has any advice for me. I have changed the yield size and played around a bit with MultiParam as suggested in the previous post, but I'm not having any luck. I've tried running it twice overnight without success and again I ran the summarizeOverlaps se command over the weekend from Friday-Monday but still it's either still running, froze, or just not working. The commands I ran in R are here:
*note-- they are the same commands I ran with the 2 samples successfully*
source("https://bioconductor.org/biocLite.R")
library (Rsamtools)
library (GenomicFeatures)
library (GenomicAlignments)
> Table <- data.frame(
+ sampleName = c("11-1732a", "11-1732b", "11-2033a", "11-2033b", "11-2104a", "11-2104b", "11-2154a", "11-2154b", "11-2253a", "11-2253b", "12-0046a", "12-0046b", "12-0210a", "12-0210b", "12-0276a", "12-0276b", "12-0396a", "12-0396b", "12-0408a", "12-0408b", "12-0530a", "12-0530b", "12-0538a", "12-0538b", "12-0732a", "12-0732b"),
+ fileName=c("11-1732-PR.hg38.coord.bam", "11-1732-PRCS.hg38.coord.bam", "11-2033-PR.hg38.coord.bam", "11-2033-PRCS.hg38.coord.bam", "11-2104-PR.hg38.coord.bam", "11-2104-PRCS.hg38.coord.bam", "11-2154-PR.hg38.coord.bam", "11-2154-PRCS.hg38.coord.bam", "11-2253-PR.hg38.coord.bam", "11-2253-PRCS.hg38.coord.bam", "12-0046-PR.hg38.coord.bam", "12-0046-PRCS.hg38.coord.bam", "12-0210-PR.hg38.coord.bam", "12-0210-PRCS.hg38.coord.bam", "12-0276-PR.hg38.coord.bam", "12-0276-PRCS.hg38.coord.bam", "12-0396-PR.hg38.coord.bam", "12-0396-PRCS.hg38.coord.bam", "12-0408-PR.hg38.coord.bam", "12-0408-PRCS.hg38.coord.bam", "12-0530-PR.hg38.coord.bam", "12-0530-PRCS.hg38.coord.bam", "12-0538-PR.hg38.coord.bam", "12-0538-PRCS.hg38.coord.bam", "12-0732-PR.hg38.coord.bam", "12-0732-PRCS.hg38.coord.bam"),
+ condition=c("PR", "PRCS", "PR", "PRCS", "PR", "PRCS", "PR", "PRCS", "PR", "PRCS", "PR", "PRCS", "PR", "PRCS", "PR", "PRCS", "PR", "PRCS", "PR", "PRCS", "PR", "PRCS", "PR", "PRCS", "PR", "PRCS")
+ )
> bamFiles <- file.path(Table$fileName)
> seqinfo(BamFile(bamFiles[1]))
> hse <- makeTxDbFromGFF("filepathway/gencode.v25.annotation.gtf", format="gtf")
> exonsByGene <- exonsBy (hse, by="gene")
> se <- summarizeOverlaps (exonsByGene, BamFileList (bamFiles, yieldSize = 1000000), mode="Union", singleEnd=FALSE, ignore.strand=TRUE, fragments=TRUE)
Any advice or comments are welcomed.
Addn: The remote desktop I am using has 32 GB RAM.
can you add sessioInfo() to your output? also indicate operating system of the 'remote desktop'.
I can, and I will do so as soon as I can. We are going through some issues with our firewall but when things are running smoothly, I'll post!
The operating system is CentOS 6. Admin is in the process of upgrading the system to CentOS 7. Here is what was given after including sessionInfo():
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.8 (Final)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base