Hi,
I am not sure if this is the right place to ask, so feel free to delete or redirect me to a more suitable place.
I am currently performing pseudobulk differential expression for single cell data and am working through chapter 4 of the OSCA multisample book.
Right at the beginning of section 4.4, the authors subset the dataset to only contain one single celltype and then proceed with the very normal differential gene expression analysis. I was wondering why this was done in the first place. If I remember correctly, it is advisable to run all groups together and afterwards use contrasts to extract the comparisons you are interested in (i.e. the comparisons between the celltypes), because this gives you more confidence when estimating the dispersion parameters for your genes.
If I understand everything correctly this setup:
# subset dataset
label <- "Mesenchyme"
current <- summed[,label==summed$celltype.mapped]
...
...
design <- model.matrix(~ treatment, y$samples) # because we subsetted the data, we dont have to include the "celltype" here
fit <- glmQLFit(y, design, robust=TRUE)
res <- glmQLFTest(fit, coef=ncol(design))
....
and this setup:
# no subsetting, use all data.
design <- model.matrix(~ treatment*celltype, y$samples) # include the "celltype" here because no subsetting
y <- estimateDisp(y, design)
y <- glmQLFit(y, design, robust=TRUE)
contr <- limma::makeContrasts(`treatment_drugB` + `treatment_drugB_celltype_Mesenchyme`,
levels = y$design))
res_edgeR <- glmQLFTest(fit_edgeR, contrast = contr)
...
are both answering the same question.
Specifically: Which genes change the most in Mesenchyme
cells when we treat with drugB
.
Any help is much appreciated!