Hi everyone, I would be very grateful for your advice on the experimental design described below. Perhaps Gordon Smyth could also share his thoughts.
I generated a pseudobulk expression matrix from scRNA-seq data. The design can be described as a 3-factor setting:
- Regions: 2 sampled regions from the same tissue/organ
- Disease condition: 2 groups (15 healthy vs. 15 diseased donors)
- Cell types: approximately 10 cell types or clusters
In total, I have 60 biological samples derived from 30 individuals (each individual contributes samples from 2 regions). As is common in single-cell data, not all cell types are represented in every donor or region, resulting in an unbalanced design. Please correct me if I am misusing terminology, as I am still new to this type of analysis.
My main concern relates to within-individual correlation. Since samples from the same donor are not independent, my understanding is that I should use voomLmFit with the block argument to model donor as a random effect, allowing intra-individual correlation to be handled appropriately via duplicateCorrelation.
For the design matrix, my plan is to create a combined "group" variable encoding all possible combinations of region, disease, and cell type, and then set up contrasts for my hypotheses of interest (for example, global disease vs healthy effects, disease vs healthy within a specific cell type, or within a specific region). From what I have read, this is generally recommended over using a factorial design with interaction terms. Could you confirm this is the best practice here?
A smaller technical question: I saw a post from another user raising doubts about whether to scale or center numeric variables such as age or RIN before including them in the model. I had assumed scaling was unnecessary, but I wanted to confirm your view on this.
Finally, do you consider limma-voom to be more appropriate than edgeR in this setting, mainly because I would like to use duplicateCorrelation to account for repeated measures, which edgeR does not support?
Thank you so much for your helpful response!
I have a follow-up question regarding preprocessing of the count matrix. Should steps like gene filtering with filterByExpr() and normalization using CPM/normalization factors be performed separately for each cell type (e.g., by iterating over cell types or using the group argument)? Or is it acceptable to apply these steps globally across all samples regardless of cell type?
Thanks again for your guidance!
Normalization or filtering are never done for subsets of samples in any limma or edgeR analysis. These things must always be done globally.
Understood, thanks. The issue I'm seeing is that even after normalization, the MDSplot still orders samples by library size. I suspect cell type heterogeneity is biasing the normalization factors: marker genes highly expressed in one cell type but absent in others could skew M-values when comparing across cell types. Would you recommend adjusting TMM parameters to mitigate this? I tried TMMwsp, but results were nearly identical to default TMM. Haven't seen much discussion on this, so thought it was worth raising.
It is not uncommon for library size to be apparent in an MDS plot. You can use
sva
orRUV
to account for that.Nice. That's great to hear.
sva
was actually the next step I was planning to try. My colleagues haven't seen this kind of pattern before since they've mostly worked with bulk RNA-seq, so it's reassuring to know it's not uncommon in pseudobulk. Thanks for the suggestion, James.It may not be that common for bulk RNA-Seq, but in my experience it's very common for bulk miRNA-Seq.
How did you make this plot? I've never seen an MDS plot from limma/edgeR with axis scale going from -200 to 200.
Personally, I would go ahead and do the DE analysis and perhaps try log-lib-size as a covariate in the analysis to see if there a non-ignorable trend. The DE analysis might be less sensitive than the MDS plot. The MDS plot is designed to highlight things like this rather than to hide them, and library size normalization won't reduce the effect for the purpose of the plot. The MDS plot doesn't take into account voom weights or sample weights.
Apologies for not clarifying earlier, you're right: it is a PCA, not an MDS. In local machine, computing the MDS takes quite a long time, PCA was a faster approach (and shows the same lib.size pattern). Below is the MDS applied to the same normalized expression matrix used for the PCA.
MDS of what quantity? LogCPM from
cpm()
?I don't have a definitive answer, but my feeling that is that library size effects are to be expected in pseudobulk and are not necessarily something you need to correct for. In pseudobulk, library size primarily reflects the number of cells that are aggregated, so it is proportional to cell frequency. Cell frequence in turn is likely to be correlated with gene expression. Some clusters in particular will have more cells, and any expression changes in that cluster will be correlated therefore with library size. So library size may reflect real biological differences that you shouldn't correct for. I haven't explored this issue very deeply, and that is my gut feeling.
Exactly, MDS was computed on logCPM values from
cpm(x, log=TRUE)
after runningcalcNormFactors(method="TMMwsp")
on the DGE object. I also triedplotMDS
directly on the DGE and observed the same pattern: the first component is strongly driven by library size.As you mention, this seems to be due to cell frequency, but I initially thought that library size normalization would be sufficient. Perhaps what's happening is that pseudobulk samples with fewer cells have many zero counts across genes, while those with more cells preserve additional signal. This difference in information content is likely what PC1 is capturing in PCA/MDS.
What I find interesting, though, is that when I run PCA using
glmpca(x, L = 2, fam = "nb")
from this paper, the bias essentially disappears. Since glmPCA models raw counts and includes an offset for library size, it seems to correct for the effect. That makes me suspect the issue is not just a lack of information in smaller samples, but rather thatTMM
/TMMwsp
normalization alone does not fully remove the library size bias.In any case, I'll follow your recommendation and include library size in the linear model, which I expect should be sufficient to address this. Thanks again!