Pair-wise DESeq2 using tximport (running time and multi-threading)
1
0
Entering edit mode
Marcus • 0
@marcus-24886
Last seen 7 months ago
Hong Kong

Hi guys,

I've encountered one problem when doing pair-wise transcript-level differential expression analysis using DESeq2.

I used tximport to import counts from Salmon quantification results, which consists of 65 pairs of samples (130 samples in total).

samples = read.table('Sample_beforeAfter_paired.txt', header=TRUE)
data_dir = "Salmon_cleaned_subset"
files_salmon = file.path(data_dir, samples$Sample_id, "quant.sf") names(files_salmon) = samples$Sample_id
samples$State = factor(samples$State, levels=c("BEFORE","AFTER"))
samples$State = relevel(samples$State, ref="BEFORE")
samples$Paired_group = as.factor(samples$Paired_group)

txi = tximport(files_salmon, type="salmon", txOut=TRUE)
dds = DESeqDataSetFromTximport(txi, samples, design = ~ Paired_group + State)

dds = DESeq(dds, parallel=TRUE, BPPARAM=MulticoreParam(35))


The 'samples' file is like:

Sample_id   Paired_group    State
paired_01_1 paired_01   BEFORE
paired_01_2 paired_01   AFTER
paired_02_1 paired_02   BEFORE
paired_02_2 paired_02   AFTER
paired_03_1 paired_03   BEFORE
paired_03_2 paired_03   AFTER


I'm using a server with 64 cores and 500G RAM in total, while I registered for 35 cores so that other people and also use the server. The total number of transcripts is around 200k.

My question is: 1) Even though I use DESeq(dds, parallel=TRUE, BPPARAM=MulticoreParam(35)), the program actually takes all the cores for running (at least for estimateDispersions), which can be found using 'htop' where 64 cores are all running with 100% capacity. Is there a method that can limit the number of cores used by this step?

2) The 'estimate dispersion' step took too much time than I expected. For this 65 paired samples, it took 64 cores running for over 24 hours, after which I terminated the program and tried to find out what happened. I remember when I was using DESeq2 without tximport (that said, input a count matrix with DESeqDataSetFromMatrix), it didn't take this much time. The interesting thing is, from my experience, using tximport is often faster than giving a count matrix directly for the step 'DESeq'. So I'm really confused whether it is normal and should let it run.

Additional information: I've tried to perform a subset of 25 pairs (50 samples) using the same code, and it takes around 5 minutes. I've also checked the integrity of the Salmon output file (quant.sf) and that looks just fine.

Does anyone has suggestions for this situation? I appreciate it a lot.

Regards

Marcus

tximport DESeq2 • 285 views
0
Entering edit mode
@mikelove
Last seen 49 minutes ago
United States

DESeq2 only uses BiocParallel for its parallel step so it's not going outside that mechanism. We've seen before on the support site, it seems that perhaps when the underlying linear algebra libraries are used, those may be configured on certain clusters in a way to use multiple cores. But it's not DESeq2 asking for more than the number of cores specified.

I would mention, I haven't seen benefits in using much more than 10 cores. I think at some point the sending of data around a network leads to sub-optimal performance. I'd recommend trying out your code with something like 5-10 cores on the first 500 transcripts to get a sense for the speed.

Also you probably don't have 200,000 transcripts with sufficient expression level, so you can pre-filter, e.g.:

keep <- rowSums(counts(dds) >= 10) >= X
dds <- dds[keep,]
dds <- DESeq(dds)


For your dataset, it may make sense to require that e.g. 15 or more samples have minimal expression level.

I remember when I was using DESeq2 without tximport (that said, input a count matrix with DESeqDataSetFromMatrix), it didn't take this much time.

These steps are totally unrelated, so it was probably another factor.

Finally, are you using the latest version of DESeq2? The speed has been improved a lot in the past year, so make sure to keep your software up to date.

0
Entering edit mode

See the thread below, where I show that for a paired dataset with 70 pairs (total sample size 140), DESeq2 takes 1 minute per 500 genes (this is benefiting from recent code improvements over the past year), and an additional speed up allows for cutting that run time in half using glmGamPoi.

Answer: perform DESeq2 based on paired sample