Hi,
I'm having major difficulties trying to get DEXSeq to run properly.
I've got 84 RNA-seq samples spread across two groups (49 in one, 35 in another) and my goal is to test for evidence of differential exon usage between the two. I'm using the GENCODE v.33 comprehensive annotation GTF (ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencodehuman/release33/gencode.v33.annotation.gtf.gz) which I've processed into GFF using the provided Python script. I'm trying to run DEXSeq with the entire annotation rather than a subset, though I'm only interested in differential exon usage across a set of about 900 genes.
My code is:
library("DEXSeq")
countFiles <- c(list.files("/media/drive2/diff_exon_usage_dexseq/young/",full.names = T),list.files("/media/drive2/diff_exon_usage_dexseq/old/",full.names = T))
flattenedFile <- list.files("/home/oates_binf_1/ncbi/reference/",pattern="gff$",full.names=T)
rNames <- basename(countFiles)
rNames <- gsub(".bam_Aligned.out.sorted.bam_.*_dexseq.txt","",rNames)
rNames <- gsub("-","_",rNames)
sampleTable <- data.frame(row.names=rNames,condition=c(rep("young", 49),rep("old",35)))
dxd <- DEXSeqDataSetFromHTSeq(countFiles,sampleData = sampleTable,design = ~ sample + exon + condition:exon, flattenedfile = flattenedFile)
#normalisation
dxd <- estimateSizeFactors(dxd)
#multicore use
BPPARAM = MulticoreParam(8)
# estimate and plot dispersions
dxd <- estimateDispersions(dxd,BPPARAM=BPPARAM)
plotDispEsts(dxd)
#test for diff exon usage
dxd <- testForDEU( dxd, BPPARAM=BPPARAM)
dxd <- estimateExonFoldChanges(dxd,fitExpToVar="condition",BPPARAM=BPPARAM)
#results
dxr1 = DEXSeqResults(dxd)
Running this locally on 8 cores with 64GB RAM causes R to get stuck at the estimateDispersions() step. After running some tests locally I was able to determine that the most my computer can handle is 16 samples, which takes about 2 hours to complete. Going up to 18 samples causes R to get stuck. I also tried subsetting the dxd object to only have ~500 rows and running that on the full 84 samples, but to no avail.
Finally, I tried running this on my university's HPC cluster with access to 40 CPUs and 256GB RAM (setting BPPARAM = MulticoreParam(40)). DEXSeq used all the CPUs and most of the RAM with 84 samples, ran for 24hrs walltime and then the job died, having gotten stuck on the estimateDispersions() step.
My main question is: what on earth do I need to do to get DEXSeq to run on my whole sample set?
Supplementary questions:
- Am I asking the computer to perform some astronomically ridiculous calculation with estimateDispersions() on 84 samples, and if so, how do I speed up/optimise this?
- Am I configuring something wrong when running this on an HPC cluster? The manual mentions that it maybe necessary to configure BatchJobsParam() classes when running on a cluster - how do I do this and will it make a difference?
Thanks :) This has really been driving me up the wall.
Thanks Alejandro!
I'll give the alternative implementation a try. Re: BatchJobsParam() - apologies for my ignorance, but I've never had to implement BiocParallel on a cluster before. Would the implementation with DEXSeq be the same as for MulticoreParam(), i.e.:
followed by
and so on, or is it something else? I couldn't quite find a clear answer from looking at the various manual pages.
Thanks again!
Exactly, that would be it. The BiocParallel and batchtools vignettes have several templates to configure the cluster functions.