Guidance on limma for pseudobulk scRNA-seq with multi-factor design
1
0
Entering edit mode
abaed • 0
@292e8ed4
Last seen 10 minutes ago
london

Hi everyone, I would be very grateful for your advice on the experimental design described below. Perhaps Gordon Smyth could also share his thoughts.

I generated a pseudobulk expression matrix from scRNA-seq data. The design can be described as a 3-factor setting:

  • Regions: 2 sampled regions from the same tissue/organ
  • Disease condition: 2 groups (15 healthy vs. 15 diseased donors)
  • Cell types: approximately 10 cell types or clusters

In total, I have 60 biological samples derived from 30 individuals (each individual contributes samples from 2 regions). As is common in single-cell data, not all cell types are represented in every donor or region, resulting in an unbalanced design. Please correct me if I am misusing terminology, as I am still new to this type of analysis.

My main concern relates to within-individual correlation. Since samples from the same donor are not independent, my understanding is that I should use voomLmFit with the block argument to model donor as a random effect, allowing intra-individual correlation to be handled appropriately via duplicateCorrelation.

For the design matrix, my plan is to create a combined "group" variable encoding all possible combinations of region, disease, and cell type, and then set up contrasts for my hypotheses of interest (for example, global disease vs healthy effects, disease vs healthy within a specific cell type, or within a specific region). From what I have read, this is generally recommended over using a factorial design with interaction terms. Could you confirm this is the best practice here?

A smaller technical question: I saw a post from another user raising doubts about whether to scale or center numeric variables such as age or RIN before including them in the model. I had assumed scaling was unnecessary, but I wanted to confirm your view on this.

Finally, do you consider limma-voom to be more appropriate than edgeR in this setting, mainly because I would like to use duplicateCorrelation to account for repeated measures, which edgeR does not support?

limma RNASeq scRNAseq • 423 views
ADD COMMENT
3
Entering edit mode
@gordon-smyth
Last seen 41 minutes ago
WEHI, Melbourne, Australia

What you describe sounds a good approach to me. I would also use sample weights (sample.weights=TRUE when running voomLmFit) and robust empirical Bayes (robust=TRUE when running eBayes). I would use sample weights in preference to RIN.

Scaling or centering covariates makes no difference to the DE results. However, centering would allow the group fitted coefficients to be interpretted as group means if you wish to do that.

ADD COMMENT
0
Entering edit mode

Thank you so much for your helpful response!

I have a follow-up question regarding preprocessing of the count matrix. Should steps like gene filtering with filterByExpr() and normalization using CPM/normalization factors be performed separately for each cell type (e.g., by iterating over cell types or using the group argument)? Or is it acceptable to apply these steps globally across all samples regardless of cell type?

Thanks again for your guidance!

ADD REPLY
2
Entering edit mode

Normalization or filtering are never done for subsets of samples in any limma or edgeR analysis. These things must always be done globally.

ADD REPLY
0
Entering edit mode

Understood, thanks. The issue I'm seeing is that even after normalization, the MDSplot still orders samples by library size. I suspect cell type heterogeneity is biasing the normalization factors: marker genes highly expressed in one cell type but absent in others could skew M-values when comparing across cell types. Would you recommend adjusting TMM parameters to mitigate this? I tried TMMwsp, but results were nearly identical to default TMM. Haven't seen much discussion on this, so thought it was worth raising.

Each dot represents a pseudobulk sample (only a specific cell type is being represented there). Despite normalization, the sample distribution appears to follow library size

ADD REPLY
0
Entering edit mode

It is not uncommon for library size to be apparent in an MDS plot. You can use sva or RUV to account for that.

0
Entering edit mode

Nice. That's great to hear. sva was actually the next step I was planning to try. My colleagues haven't seen this kind of pattern before since they've mostly worked with bulk RNA-seq, so it's reassuring to know it's not uncommon in pseudobulk. Thanks for the suggestion, James.

ADD REPLY
0
Entering edit mode

It may not be that common for bulk RNA-Seq, but in my experience it's very common for bulk miRNA-Seq.

0
Entering edit mode

How did you make this plot? I've never seen an MDS plot from limma/edgeR with axis scale going from -200 to 200.

Personally, I would go ahead and do the DE analysis and perhaps try log-lib-size as a covariate in the analysis to see if there a non-ignorable trend. The DE analysis might be less sensitive than the MDS plot. The MDS plot is designed to highlight things like this rather than to hide them, and library size normalization won't reduce the effect for the purpose of the plot. The MDS plot doesn't take into account voom weights or sample weights.

ADD REPLY
0
Entering edit mode

Apologies for not clarifying earlier, you're right: it is a PCA, not an MDS. In local machine, computing the MDS takes quite a long time, PCA was a faster approach (and shows the same lib.size pattern). Below is the MDS applied to the same normalized expression matrix used for the PCA.

enter image description here

ADD REPLY
0
Entering edit mode

MDS of what quantity? LogCPM from cpm()?

I don't have a definitive answer, but my feeling that is that library size effects are to be expected in pseudobulk and are not necessarily something you need to correct for. In pseudobulk, library size primarily reflects the number of cells that are aggregated, so it is proportional to cell frequency. Cell frequence in turn is likely to be correlated with gene expression. Some clusters in particular will have more cells, and any expression changes in that cluster will be correlated therefore with library size. So library size may reflect real biological differences that you shouldn't correct for. I haven't explored this issue very deeply, and that is my gut feeling.

ADD REPLY
0
Entering edit mode

Exactly, MDS was computed on logCPM values from cpm(x, log=TRUE) after running calcNormFactors(method="TMMwsp") on the DGE object. I also tried plotMDS directly on the DGE and observed the same pattern: the first component is strongly driven by library size.

As you mention, this seems to be due to cell frequency, but I initially thought that library size normalization would be sufficient. Perhaps what's happening is that pseudobulk samples with fewer cells have many zero counts across genes, while those with more cells preserve additional signal. This difference in information content is likely what PC1 is capturing in PCA/MDS.

What I find interesting, though, is that when I run PCA using glmpca(x, L = 2, fam = "nb") from this paper, the bias essentially disappears. Since glmPCA models raw counts and includes an offset for library size, it seems to correct for the effect. That makes me suspect the issue is not just a lack of information in smaller samples, but rather that TMM/TMMwsp normalization alone does not fully remove the library size bias.

In any case, I'll follow your recommendation and include library size in the linear model, which I expect should be sufficient to address this. Thanks again!

glmPCA applied to raw counts

ADD REPLY

Login before adding your answer.

Traffic: 1304 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6