Hi, I recently upgraded to latest R 4.3 from R 4.1. While running the DEA workflow on scRNA-seq data, I noticed it was dramatically slower with my latest setup. I wonder if anyone are aware of changes that may have contributed to this, and how I can make the dispersion estimation step to run as fast as before?
Using a small dataset with 49 cells as demo (22 vs. 27 in two tissues), here's how I run DESeq()
:
dds <- DESeqDataSetFromMatrix(countData = counts(sce),
colData = droplevels(colData(sce)[,c("Barcode","tissue")]),
rowData = rowData(sce), design = ~ tissue)
sizeFactors(dds) <- sizeFactors(sce)
dds
system.time({
dds <- DESeq(dds, test = "LRT", useT = TRUE, minmu = 1e-6, minReplicatesForReplace = Inf,
fitType = "glmGamPoi", sfType = "poscounts", reduced = ~1, quiet = FALSE)
})
# dds
class: DESeqDataSet
dim: 17881 49
metadata(1): version
assays(1): counts
rownames(17881): Xkr4 Gm37381 ... CAAA01147332.1 AC149090.1
rowData names(5): ID Symbol Type SEQNAME is_mito
colnames(49): LR03_AGGCCGTTCACCTCGT-1 LR03_AGGGATGAGTGCGATG-1 ... LR04_TTGACTTAGGTGTTAA-1 LR04_TTGACTTGTCTGCGGT-1
colData names(3): Barcode tissue sizeFactor
When using R version 4.1.3
, DESeq2_1.34.0
and glmGamPoi_1.6.0
, the run time is:
user system elapsed
10.996 6.414 9.632
When using R version 4.3.2
, DESeq2_1.42.0
and glmGamPoi_1.14.3
, the run time is:
user system elapsed
1661.721 25.540 1679.590
Edit:
More digging into the codes themselves, there seems to a change in how initial_lp
and last_lp
is calculated.
This greatly increase the amount of time vapply
takes to perform the calculation.
In the current version, the whole fitMu
matrix is used as mu
initial_lp <- vapply(which(fitidx), function(idx){
sum(dnbinom(Counts[idx, ], mu = fitMu, size = 1 / alpha_hat[idx], log = TRUE))
}, FUN.VALUE = 0.0)
last_lp <- vapply(which(fitidx), function(idx){
sum(dnbinom(Counts[idx, ], mu = fitMu, size = 1 / alpha_hat_new[idx], log = TRUE))
}, FUN.VALUE = 0.0)
Whereas in the previous version, the per-gene vector fitMu[idx, ]
is used as mu
, which I think make more sense, isn't it?
I think this is how fitMu
was used in the fitting process in DESeq2.cpp
initial_lp <- vapply(which(fitidx), function(idx){
sum(dnbinom(Counts[idx, ], mu = fitMu[idx, ], size = 1 / alpha_hat[idx], log = TRUE))
}, FUN.VALUE = 0.0)
last_lp <- vapply(which(fitidx), function(idx){
sum(dnbinom(Counts[idx, ], mu = fitMu[idx, ], size = 1 / alpha_hat_new[idx], log = TRUE))
}, FUN.VALUE = 0.0)
Also, why is fitMu
used as mean
input when running glmGamPoi::overdispersion_mle()
, and not the subset (fitMu[fitidx, ]
) as in the previous version?
# New version
dispersion_fits <- glmGamPoi::overdispersion_mle(Counts[fitidx, ], mean = fitMu,
model_matrix = modelMatrix, verbose = ! quiet)
# Old version
dispersion_fits <- glmGamPoi::overdispersion_mle(Counts[fitidx, ], mean = fitMu[fitidx, ],
model_matrix = modelMatrix, verbose = ! quiet)
Thanks Michael Love