Question

Guidance on limma for pseudobulk scRNA-seq with multi-factor design

0

Entering edit mode

abaed • 0

@292e8ed4

Last seen 11 weeks ago

london

Hi everyone, I would be very grateful for your advice on the experimental design described below. Perhaps Gordon Smyth could also share his thoughts.

I generated a pseudobulk expression matrix from scRNA-seq data. The design can be described as a 3-factor setting:

Regions: 2 sampled regions from the same tissue/organ
Disease condition: 2 groups (15 healthy vs. 15 diseased donors)
Cell types: approximately 10 cell types or clusters

In total, I have 60 biological samples derived from 30 individuals (each individual contributes samples from 2 regions). As is common in single-cell data, not all cell types are represented in every donor or region, resulting in an unbalanced design. Please correct me if I am misusing terminology, as I am still new to this type of analysis.

My main concern relates to within-individual correlation. Since samples from the same donor are not independent, my understanding is that I should use voomLmFit with the block argument to model donor as a random effect, allowing intra-individual correlation to be handled appropriately via duplicateCorrelation.

For the design matrix, my plan is to create a combined "group" variable encoding all possible combinations of region, disease, and cell type, and then set up contrasts for my hypotheses of interest (for example, global disease vs healthy effects, disease vs healthy within a specific cell type, or within a specific region). From what I have read, this is generally recommended over using a factorial design with interaction terms. Could you confirm this is the best practice here?

A smaller technical question: I saw a post from another user raising doubts about whether to scale or center numeric variables such as age or RIN before including them in the model. I had assumed scaling was unnecessary, but I wanted to confirm your view on this.

Finally, do you consider limma-voom to be more appropriate than edgeR in this setting, mainly because I would like to use duplicateCorrelation to account for repeated measures, which edgeR does not support?

limma RNASeq scRNAseq • 1.3k views

ADD COMMENT • link updated 3 months ago by Gordon Smyth 53k • written 3 months ago by abaed • 0

score 3 · Accepted Answer · 2025-09-11

3

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

What you describe sounds a good approach to me. I would also use sample weights (sample.weights=TRUE when running voomLmFit) and robust empirical Bayes (robust=TRUE when running eBayes). I would use sample weights in preference to RIN.

Scaling or centering covariates makes no difference to the DE results. However, centering would allow the group fitted coefficients to be interpretted as group means if you wish to do that.

ADD COMMENT • link 3 months ago Gordon Smyth 53k

0

Entering edit mode

Thank you so much for your helpful response!

I have a follow-up question regarding preprocessing of the count matrix. Should steps like gene filtering with filterByExpr() and normalization using CPM/normalization factors be performed separately for each cell type (e.g., by iterating over cell types or using the group argument)? Or is it acceptable to apply these steps globally across all samples regardless of cell type?

Thanks again for your guidance!

ADD REPLY • link 3 months ago abaed • 0

2

Entering edit mode

Normalization or filtering are never done for subsets of samples in any limma or edgeR analysis. These things must always be done globally.

ADD REPLY • link 3 months ago Gordon Smyth 53k

0

Entering edit mode

Understood, thanks. The issue I'm seeing is that even after normalization, the MDSplot still orders samples by library size. I suspect cell type heterogeneity is biasing the normalization factors: marker genes highly expressed in one cell type but absent in others could skew M-values when comparing across cell types. Would you recommend adjusting TMM parameters to mitigate this? I tried TMMwsp, but results were nearly identical to default TMM. Haven't seen much discussion on this, so thought it was worth raising.

Each dot represents a pseudobulk sample (only a specific cell type is being represented there). Despite normalization, the sample distribution appears to follow library size

ADD REPLY • link 3 months ago abaed • 0

0

Entering edit mode

It is not uncommon for library size to be apparent in an MDS plot. You can use sva or RUV to account for that.

ADD REPLY • link 3 months ago James W. MacDonald 68k

0

Entering edit mode

Nice. That's great to hear. sva was actually the next step I was planning to try. My colleagues haven't seen this kind of pattern before since they've mostly worked with bulk RNA-seq, so it's reassuring to know it's not uncommon in pseudobulk. Thanks for the suggestion, James.

ADD REPLY • link 3 months ago abaed • 0

0

Entering edit mode

It may not be that common for bulk RNA-Seq, but in my experience it's very common for bulk miRNA-Seq.

ADD REPLY • link 3 months ago James W. MacDonald 68k

0

Entering edit mode

How did you make this plot? I've never seen an MDS plot from limma/edgeR with axis scale going from -200 to 200.

Personally, I would go ahead and do the DE analysis and perhaps try log-lib-size as a covariate in the analysis to see if there a non-ignorable trend. The DE analysis might be less sensitive than the MDS plot. The MDS plot is designed to highlight things like this rather than to hide them, and library size normalization won't reduce the effect for the purpose of the plot. The MDS plot doesn't take into account voom weights or sample weights.

ADD REPLY • link 3 months ago Gordon Smyth 53k

0

Entering edit mode

Apologies for not clarifying earlier, you're right: it is a PCA, not an MDS. In local machine, computing the MDS takes quite a long time, PCA was a faster approach (and shows the same lib.size pattern). Below is the MDS applied to the same normalized expression matrix used for the PCA.

enter image description here

ADD REPLY • link 3 months ago abaed • 0

0

Entering edit mode

MDS of what quantity? LogCPM from cpm()?

I don't have a definitive answer, but my feeling that is that library size effects are to be expected in pseudobulk and are not necessarily something you need to correct for. In pseudobulk, library size primarily reflects the number of cells that are aggregated, so it is proportional to cell frequency. Cell frequence in turn is likely to be correlated with gene expression. Some clusters in particular will have more cells, and any expression changes in that cluster will be correlated therefore with library size. So library size may reflect real biological differences that you shouldn't correct for. I haven't explored this issue very deeply, and that is my gut feeling.

ADD REPLY • link 3 months ago Gordon Smyth 53k

0

Entering edit mode

Exactly, MDS was computed on logCPM values from cpm(x, log=TRUE) after running calcNormFactors(method="TMMwsp") on the DGE object. I also tried plotMDS directly on the DGE and observed the same pattern: the first component is strongly driven by library size.

As you mention, this seems to be due to cell frequency, but I initially thought that library size normalization would be sufficient. Perhaps what's happening is that pseudobulk samples with fewer cells have many zero counts across genes, while those with more cells preserve additional signal. This difference in information content is likely what PC1 is capturing in PCA/MDS.

What I find interesting, though, is that when I run PCA using glmpca(x, L = 2, fam = "nb") from this paper, the bias essentially disappears. Since glmPCA models raw counts and includes an offset for library size, it seems to correct for the effect. That makes me suspect the issue is not just a lack of information in smaller samples, but rather that TMM/TMMwsp normalization alone does not fully remove the library size bias.

In any case, I'll follow your recommendation and include library size in the linear model, which I expect should be sufficient to address this. Thanks again!

glmPCA applied to raw counts

ADD REPLY • link 3 months ago abaed • 0

0

Entering edit mode

Hello again Gordon Smyth,

I've now applied the analysis using a more detailed annotation of cell types/clusters, which includes about 100 groups. This results in a matrix of ~30,000 genes x ~6,000 samples. The run has been going for 5 days and still hasn't completed the first step of estimating sample weights.

Do you have any advice for situations like this? The strategies I can think of (assuming I want to keep this fine-grained resolution) are:

Applying a very strict gene filter with edgeR to reduce the number of genes. However, given the large number of samples, I suspect this will have only a minor effect.
Running separate models for each cell type. The downside is losing statistical power, but the advantage is that I could parallelize across 100 analyses, which would run much faster.

Of course, if I collapsed the data into broader cell types, the number of samples would drop considerably, but I'm particularly interested in maintaining this higher resolution. Sequencing depth is sufficient, so that's not a limiting factor here.

Thanks in advance for your guidance.

ADD REPLY • link 3 months ago abaed • 0

0

Entering edit mode

I'm not sure that I understand. You have a dataset that, even after pseudo-bulking, still has 6000 samples? That is surprising.

Retaining 30,000 genes for a single cell analysis is also a bit surprising to me. I wouldn't do that even for a regular bulk RNA-seq analysis.

Anyway, I don't have any experience with this sort of scenario, so I can't give any advice. We have not yet found to a way to run voomLmFit with block correlations for such large datasets.

ADD REPLY • link 3 months ago Gordon Smyth 53k