I've encountered one problem when doing pair-wise transcript-level differential expression analysis using DESeq2.
I used tximport to import counts from Salmon quantification results, which consists of 65 pairs of samples (130 samples in total).
samples = read.table('Sample_beforeAfter_paired.txt', header=TRUE) data_dir = "Salmon_cleaned_subset" files_salmon = file.path(data_dir, samples$Sample_id, "quant.sf") names(files_salmon) = samples$Sample_id samples$State = factor(samples$State, levels=c("BEFORE","AFTER")) samples$State = relevel(samples$State, ref="BEFORE") samples$Paired_group = as.factor(samples$Paired_group) txi = tximport(files_salmon, type="salmon", txOut=TRUE) dds = DESeqDataSetFromTximport(txi, samples, design = ~ Paired_group + State) dds = DESeq(dds, parallel=TRUE, BPPARAM=MulticoreParam(35))
The 'samples' file is like:
Sample_id Paired_group State paired_01_1 paired_01 BEFORE paired_01_2 paired_01 AFTER paired_02_1 paired_02 BEFORE paired_02_2 paired_02 AFTER paired_03_1 paired_03 BEFORE paired_03_2 paired_03 AFTER
I'm using a server with 64 cores and 500G RAM in total, while I registered for 35 cores so that other people and also use the server. The total number of transcripts is around 200k.
My question is:
1) Even though I use
DESeq(dds, parallel=TRUE, BPPARAM=MulticoreParam(35)), the program actually takes all the cores for running (at least for estimateDispersions), which can be found using 'htop' where 64 cores are all running with 100% capacity.
Is there a method that can limit the number of cores used by this step?
2) The 'estimate dispersion' step took too much time than I expected. For this 65 paired samples, it took 64 cores running for over 24 hours, after which I terminated the program and tried to find out what happened. I remember when I was using DESeq2 without tximport (that said, input a count matrix with DESeqDataSetFromMatrix), it didn't take this much time. The interesting thing is, from my experience, using tximport is often faster than giving a count matrix directly for the step 'DESeq'. So I'm really confused whether it is normal and should let it run.
Additional information: I've tried to perform a subset of 25 pairs (50 samples) using the same code, and it takes around 5 minutes. I've also checked the integrity of the Salmon output file (quant.sf) and that looks just fine.
Does anyone has suggestions for this situation? I appreciate it a lot.