Question

Pseudobulk running all groups together vs subsetting first

0

Entering edit mode

nhaus ▴ 70

@789c70a6

Last seen 6 months ago

Switzerland

Hi,

I am not sure if this is the right place to ask, so feel free to delete or redirect me to a more suitable place.

I am currently performing pseudobulk differential expression for single cell data and am working through chapter 4 of the OSCA multisample book.

Right at the beginning of section 4.4, the authors subset the dataset to only contain one single celltype and then proceed with the very normal differential gene expression analysis. I was wondering why this was done in the first place. If I remember correctly, it is advisable to run all groups together and afterwards use contrasts to extract the comparisons you are interested in (i.e. the comparisons between the celltypes), because this gives you more confidence when estimating the dispersion parameters for your genes.

If I understand everything correctly this setup:

# subset dataset 
label <- "Mesenchyme"
current <- summed[,label==summed$celltype.mapped]
...
...
design <- model.matrix(~ treatment, y$samples) # because we subsetted the data, we dont have to include the "celltype" here
fit <- glmQLFit(y, design, robust=TRUE)
res <- glmQLFTest(fit, coef=ncol(design))
....

and this setup:

# no subsetting, use all data.

design <- model.matrix(~ treatment*celltype, y$samples) # include the "celltype" here because no subsetting


y <- estimateDisp(y, design)
y <- glmQLFit(y, design, robust=TRUE)

contr <- limma::makeContrasts(`treatment_drugB` + `treatment_drugB_celltype_Mesenchyme`, 
                              levels = y$design))
res_edgeR <- glmQLFTest(fit_edgeR, contrast = contr)
...

are both answering the same question.

Specifically: Which genes change the most in Mesenchyme cells when we treat with drugB.

Any help is much appreciated!

DESeq2 edgeR SingleCell • 1.1k views

ADD COMMENT • link updated 2.1 years ago by Wolfgang Huber ★ 13k • written 2.1 years ago by nhaus ▴ 70

score 1 · Answer 1 · 2023-02-11

As you say yourself ("it is advisable ..."), this choice is not about hard scientific fact or mathematical proof, but a bit philosophical. Perhaps the first thing to establish is whether it even matters. Ideally it does not, and problem solved.

If it does, why? Make some plots whether the dispersion estimation estimated from all cell types is really so different from that estimated from the mesenchyme cells only. Then: which seems more reasonable?