Pseudo-bulk Limma Approaches
1
1
Entering edit mode
@andrewjskelton73-7074
Last seen 4 months ago
United Kingdom

I'm experimenting with Limma to perform differential expression of Pseudo-bulk'd single cell gene expression data. Workflow roughly as follows:

  • Assign clusters / inferred cell types from a case / control experiment with 6 biological replicates
  • Sum counts by cluster + condition + donor
  • Paired Design Matrix (Primary Term is a concatenation of Cluster_Condition)
  • Remove any samples with particularly high scaling factors
  • Filter by Expression
  • Fit model + topTable

I've tried two different approaches here, and got some quite different results. I've got an idea of what the right thing to do is, but any comments from the community are very much welcome to understand what's going on under the hood.

Approach 1 - Use all cluster / inferred cell types as one big experiment, derive a contrast matrix to test case / control for each cluster

Approach 2 - Create a new model for each cluster

Approach 2 seems the right thing to do given the results I see with approach 1. Approach 1 has in some tests given me differentially expressed genes where for both conditions being tested, the expression of those genes is zero. I suspect this is likely an artefact of having highly heterogeneous samples, in that there will be some with zero gene expression, some with very high, and that's between sample variability rather than between condition (case / control).

My only hesitation is that Approach 1 in some cases appears (artificially or not), more powerful in some core comparisons where I have a good idea of the biology at play. These same comparisons in Approach 2 barely reach statistical significance. I'd like to understand more what's going on here and if there's any way I could improve Approach 1.

Thanks

limma • 2.0k views
ADD COMMENT
1
Entering edit mode

You might find the DE analysis section of the OSCA book helpful. In particular, Aaron’s reasoning for fitting separate models - “We do not use all labels for GLM fitting as the strong DE between labels makes it difficult to compute a sensible average abundance to model the mean-dispersion trend. Moreover, label-specific batch effects would not be easily handled with a single additive term in the design matrix for the batch.”

ADD REPLY
4
Entering edit mode
@gordon-smyth
Last seen 35 minutes ago
WEHI, Melbourne, Australia

I have only ever used Approach 1 but I first remove clusters with very small cell numbers. I also tend to define conservative clusters in the first place (with lower than default clustering resolution) so that I don't tend to have many small clusters. You could use Approach 2, I wouldn't rule it out, and if you are being aggressive with the clustering, so you have lots of small clusters, then Approach 2 might be wise.

I've never seen a DE result for a non-expressed gene, not for pseudo-bulk or any other analysis. But, if you have genes that are entirely unobserved in some groups, then edgeR::voomLmFit() will behave better than voom() and lmFit() separately. You might also try TMMwsp instead of TMM.

ADD COMMENT

Login before adding your answer.

Traffic: 514 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6