Hi guys,
I've encountered one problem when doing pair-wise transcript-level differential expression analysis using DESeq2.
I used tximport to import counts from Salmon quantification results, which consists of 65 pairs of samples (130 samples in total).
samples = read.table('Sample_beforeAfter_paired.txt', header=TRUE)
data_dir = "Salmon_cleaned_subset"
files_salmon = file.path(data_dir, samples$Sample_id, "quant.sf")
names(files_salmon) = samples$Sample_id
samples$State = factor(samples$State, levels=c("BEFORE","AFTER"))
samples$State = relevel(samples$State, ref="BEFORE")
samples$Paired_group = as.factor(samples$Paired_group)
txi = tximport(files_salmon, type="salmon", txOut=TRUE)
dds = DESeqDataSetFromTximport(txi, samples, design = ~ Paired_group + State)
dds = DESeq(dds, parallel=TRUE, BPPARAM=MulticoreParam(35))
The 'samples' file is like:
Sample_id Paired_group State
paired_01_1 paired_01 BEFORE
paired_01_2 paired_01 AFTER
paired_02_1 paired_02 BEFORE
paired_02_2 paired_02 AFTER
paired_03_1 paired_03 BEFORE
paired_03_2 paired_03 AFTER
I'm using a server with 64 cores and 500G RAM in total, while I registered for 35 cores so that other people and also use the server. The total number of transcripts is around 200k.
My question is:
1) Even though I use DESeq(dds, parallel=TRUE, BPPARAM=MulticoreParam(35))
, the program actually takes all the cores for running (at least for estimateDispersions), which can be found using 'htop' where 64 cores are all running with 100% capacity.
Is there a method that can limit the number of cores used by this step?
2) The 'estimate dispersion' step took too much time than I expected. For this 65 paired samples, it took 64 cores running for over 24 hours, after which I terminated the program and tried to find out what happened. I remember when I was using DESeq2 without tximport (that said, input a count matrix with DESeqDataSetFromMatrix), it didn't take this much time. The interesting thing is, from my experience, using tximport is often faster than giving a count matrix directly for the step 'DESeq'. So I'm really confused whether it is normal and should let it run.
Additional information: I've tried to perform a subset of 25 pairs (50 samples) using the same code, and it takes around 5 minutes. I've also checked the integrity of the Salmon output file (quant.sf) and that looks just fine.
Does anyone has suggestions for this situation? I appreciate it a lot.
Regards
Marcus
See the thread below, where I show that for a paired dataset with 70 pairs (total sample size 140), DESeq2 takes 1 minute per 500 genes (this is benefiting from recent code improvements over the past year), and an additional speed up allows for cutting that run time in half using glmGamPoi.
Answer: perform DESeq2 based on paired sample