Question

Pseudo-bulk Limma Approaches

1

Entering edit mode

andrew.j.skelton73 ▴ 370

@andrewjskelton73-7074

Last seen 6 weeks ago

United Kingdom

I'm experimenting with Limma to perform differential expression of Pseudo-bulk'd single cell gene expression data. Workflow roughly as follows:

Assign clusters / inferred cell types from a case / control experiment with 6 biological replicates
Sum counts by cluster + condition + donor
Paired Design Matrix (Primary Term is a concatenation of Cluster_Condition)
Remove any samples with particularly high scaling factors
Filter by Expression
Fit model + topTable

I've tried two different approaches here, and got some quite different results. I've got an idea of what the right thing to do is, but any comments from the community are very much welcome to understand what's going on under the hood.

Approach 1 - Use all cluster / inferred cell types as one big experiment, derive a contrast matrix to test case / control for each cluster

Approach 2 - Create a new model for each cluster

Approach 2 seems the right thing to do given the results I see with approach 1. Approach 1 has in some tests given me differentially expressed genes where for both conditions being tested, the expression of those genes is zero. I suspect this is likely an artefact of having highly heterogeneous samples, in that there will be some with zero gene expression, some with very high, and that's between sample variability rather than between condition (case / control).

My only hesitation is that Approach 1 in some cases appears (artificially or not), more powerful in some core comparisons where I have a good idea of the biology at play. These same comparisons in Approach 2 barely reach statistical significance. I'd like to understand more what's going on here and if there's any way I could improve Approach 1.

Thanks

limma • 1.8k views

ADD COMMENT • link updated 23 months ago by Gordon Smyth 50k • written 23 months ago by andrew.j.skelton73 ▴ 370

1

Entering edit mode

You might find the DE analysis section of the OSCA book helpful. In particular, Aaron’s reasoning for fitting separate models - “We do not use all labels for GLM fitting as the strong DE between labels makes it difficult to compute a sensible average abundance to model the mean-dispersion trend. Moreover, label-specific batch effects would not be easily handled with a single additive term in the design matrix for the batch.”

ADD REPLY • link 23 months ago jma1991 ▴ 70

score 4 · Answer 1 · 2022-05-31

I have only ever used Approach 1 but I first remove clusters with very small cell numbers. I also tend to define conservative clusters in the first place (with lower than default clustering resolution) so that I don't tend to have many small clusters. You could use Approach 2, I wouldn't rule it out, and if you are being aggressive with the clustering, so you have lots of small clusters, then Approach 2 might be wise.

I've never seen a DE result for a non-expressed gene, not for pseudo-bulk or any other analysis. But, if you have genes that are entirely unobserved in some groups, then edgeR::voomLmFit() will behave better than voom() and lmFit() separately. You might also try TMMwsp instead of TMM.