Pseudobulk running all groups together vs subsetting first
1
0
Entering edit mode
nhaus ▴ 70
@789c70a6
Last seen 12 weeks ago
Switzerland

Hi,

I am not sure if this is the right place to ask, so feel free to delete or redirect me to a more suitable place.

I am currently performing pseudobulk differential expression for single cell data and am working through chapter 4 of the OSCA multisample book.

Right at the beginning of section 4.4, the authors subset the dataset to only contain one single celltype and then proceed with the very normal differential gene expression analysis. I was wondering why this was done in the first place. If I remember correctly, it is advisable to run all groups together and afterwards use contrasts to extract the comparisons you are interested in (i.e. the comparisons between the celltypes), because this gives you more confidence when estimating the dispersion parameters for your genes.

If I understand everything correctly this setup:

# subset dataset 
label <- "Mesenchyme"
current <- summed[,label==summed$celltype.mapped]
...
...
design <- model.matrix(~ treatment, y$samples) # because we subsetted the data, we dont have to include the "celltype" here
fit <- glmQLFit(y, design, robust=TRUE)
res <- glmQLFTest(fit, coef=ncol(design))
....

and this setup:

# no subsetting, use all data.

design <- model.matrix(~ treatment*celltype, y$samples) # include the "celltype" here because no subsetting


y <- estimateDisp(y, design)
y <- glmQLFit(y, design, robust=TRUE)

contr <- limma::makeContrasts(`treatment_drugB` + `treatment_drugB_celltype_Mesenchyme`, 
                              levels = y$design))
res_edgeR <- glmQLFTest(fit_edgeR, contrast = contr)
...

are both answering the same question.

Specifically: Which genes change the most in Mesenchyme cells when we treat with drugB.

Any help is much appreciated!

DESeq2 edgeR SingleCell • 1.0k views
ADD COMMENT
1
Entering edit mode
@wolfgang-huber-3550
Last seen 3 months ago
EMBL European Molecular Biology Laborat…

As you say yourself ("it is advisable ..."), this choice is not about hard scientific fact or mathematical proof, but a bit philosophical. Perhaps the first thing to establish is whether it even matters. Ideally it does not, and problem solved.

If it does, why? Make some plots whether the dispersion estimation estimated from all cell types is really so different from that estimated from the mesenchyme cells only. Then: which seems more reasonable?

ADD COMMENT

Login before adding your answer.

Traffic: 1019 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6