DEXSeq with large(ish) number of samples
1
0
Entering edit mode
@andreismolnikov00-23426
Last seen 19 months ago

Hi,

I'm having major difficulties trying to get DEXSeq to run properly.

I've got 84 RNA-seq samples spread across two groups (49 in one, 35 in another) and my goal is to test for evidence of differential exon usage between the two. I'm using the GENCODE v.33 comprehensive annotation GTF (ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencodehuman/release33/gencode.v33.annotation.gtf.gz) which I've processed into GFF using the provided Python script. I'm trying to run DEXSeq with the entire annotation rather than a subset, though I'm only interested in differential exon usage across a set of about 900 genes.

My code is:

library("DEXSeq")
countFiles <-  c(list.files("/media/drive2/diff_exon_usage_dexseq/young/",full.names = T),list.files("/media/drive2/diff_exon_usage_dexseq/old/",full.names = T))
flattenedFile <- list.files("/home/oates_binf_1/ncbi/reference/",pattern="gff\$",full.names=T)
rNames <- basename(countFiles)
rNames <- gsub(".bam_Aligned.out.sorted.bam_.*_dexseq.txt","",rNames)
rNames <- gsub("-","_",rNames)
sampleTable <- data.frame(row.names=rNames,condition=c(rep("young", 49),rep("old",35)))
dxd <- DEXSeqDataSetFromHTSeq(countFiles,sampleData = sampleTable,design = ~ sample + exon + condition:exon, flattenedfile = flattenedFile)
#normalisation
dxd <- estimateSizeFactors(dxd)
#multicore use
BPPARAM = MulticoreParam(8)
# estimate and plot dispersions
dxd <- estimateDispersions(dxd,BPPARAM=BPPARAM)
plotDispEsts(dxd)
#test for diff exon usage
dxd <- testForDEU( dxd, BPPARAM=BPPARAM)
dxd <- estimateExonFoldChanges(dxd,fitExpToVar="condition",BPPARAM=BPPARAM)
#results
dxr1 = DEXSeqResults(dxd)


Running this locally on 8 cores with 64GB RAM causes R to get stuck at the estimateDispersions() step. After running some tests locally I was able to determine that the most my computer can handle is 16 samples, which takes about 2 hours to complete. Going up to 18 samples causes R to get stuck. I also tried subsetting the dxd object to only have ~500 rows and running that on the full 84 samples, but to no avail.

Finally, I tried running this on my university's HPC cluster with access to 40 CPUs and 256GB RAM (setting BPPARAM = MulticoreParam(40)). DEXSeq used all the CPUs and most of the RAM with 84 samples, ran for 24hrs walltime and then the job died, having gotten stuck on the estimateDispersions() step.

My main question is: what on earth do I need to do to get DEXSeq to run on my whole sample set?

Supplementary questions:

1. Am I asking the computer to perform some astronomically ridiculous calculation with estimateDispersions() on 84 samples, and if so, how do I speed up/optimise this?
2. Am I configuring something wrong when running this on an HPC cluster? The manual mentions that it maybe necessary to configure BatchJobsParam() classes when running on a cluster - how do I do this and will it make a difference?

Thanks :) This has really been driving me up the wall.

1
Entering edit mode
Alejandro Reyes ★ 1.8k
@alejandro-reyes-5124
Last seen 8 days ago
Novartis Institutes for BioMedical Rese…

Hi @andreismolnikov00,

Thanks for reporting this. We are aware of the scaling problems that DEXSeq has and are working on them. Answers to your questions:

My main question is: what on earth do I need to do to get DEXSeq to run on my whole sample set?

A very fast GLM fitter or an equivalent model that does not fit a coefficient for each sample. The first one does not exist to my knowledge. The second, we and others are working on it.

Am I asking the computer to perform some astronomically ridiculous calculation with estimateDispersions() on 84 samples, and if so, how do I speed up/optimise this?

I don't know if it is "astronomically ridiculous", but it is a quite big computation. I would recommend you to have a look at this alternative implementation of DEXSeq: https://github.com/areyesq89/DEXSeqAlt?files=1

It uses a faster GLM and skips some of the steps (like information sharing) that are not really needed for large number of samples.

Am I configuring something wrong when running this on an HPC cluster? The manual mentions that it maybe necessary to configure BatchJobsParam() classes when running on a cluster - how do I do this and will it make a difference?

Yes, BatchJobsParam will make a difference. I would recommend you to use this in combination with the DEXSeq version of the github repository.

Alejandro

0
Entering edit mode

Thanks Alejandro!

I'll give the alternative implementation a try. Re: BatchJobsParam() - apologies for my ignorance, but I've never had to implement BiocParallel on a cluster before. Would the implementation with DEXSeq be the same as for MulticoreParam(), i.e.:

BPPARAM = BatchJobsParam(*options*)


followed by

dxd <- estimateDispersions(dxd,BPPARAM=BPPARAM)


and so on, or is it something else? I couldn't quite find a clear answer from looking at the various manual pages.

Thanks again!

0
Entering edit mode

Exactly, that would be it. The BiocParallel and batchtools vignettes have several templates to configure the cluster functions.