Question

How to model DE for RNA-seq in concordant monozygotic twins?

0

Entering edit mode

ilaria.massaiu • 0

@7d5aa995

Last seen 3 months ago

Italy

Hello everyone, I'm seeking guidance on how best to model differential expression for a twin-based RNA-seq study with a few design complications.

Study design:

Whole-blood RNA-seq from concordant monozygotic twin pairs: each pair is either both healthy (7 pairs) or both diseased (6 pairs).
N = 26 samples.
Each pair was sequenced in the same run (run1, run2, run3). Group x Run: diseased = {run1: 6, run2: 0, run3: 6}; healthy = {run1: 2, run2: 4, run3: 8} -> run2 contains only healthy.
Sex is balanced across groups (healthy F/M = 8/6; diseased F/M = 6/6) but drives clear clustering in PCA/heatmaps.
Extensive clinical information is available per subject (e.g., cardiometabolic risk factors, laboratory parameters, lifestyle variables).

Modeling considerations:

Surrogate variables (RUVr): when retained, several SVs were significantly associated with Pair (linear model/ANOVA), and including them yielded many DE genes (BH FDR <= 0.05).
limma-voom + duplicateCorrelation (Pair as blocking factor) with group and Sex as fixed effects (+/- Run) identified no DE genes at BH FDR <= 0.05.
variancePartition::dream with Pair modeled as a random effect (~ group + Sex +/- Run + (1|Pair)) likewise yielded no DE genes after BH FDR correction.

Questions: 1) Given this design (concordant MZ twins, Pair dependence, a mono-group run, strong Sex effect, small effective n), which DE approach is best practice and most robust for inference (limma+duplicateCorrelation vs dream vs pair-level aggregation, etc.)? 2) Since the contrast (diseased vs healthy) does not rely on within-pair comparisons, should I still account for twin pairing (block = Pair or (1|Pair)) in the DE model? What are the consequences of ignoring Pair?

Thanks in advance!

RNASeq variancePartition RUVSeq limma edgeR • 256 views

ADD COMMENT • link updated 3 months ago by Gordon Smyth 53k • written 3 months ago by ilaria.massaiu • 0

score 0 · Answer 1 · 2025-10-02

Your question is more about science than it is about software syntax, and I can't give advice about how to conduct a scientific analysis, not being familiar with your data.

I will make a few general points. The comparison of diseased vs healthy does not leverage anything useful from the monozygotic twin structure, which makes me wonder why you are using twin data. Run is substantially confounded with diesase group, so you will loose considerable statistical power if you have to adjust for a run effect. Finally, you obviously cannot ignore pair if you include both twins in your analysis, because monozygotic twin effects are so strong. You could use a limma duplicateCorrelation analysis, but you would nearly as well by randomly choosing just one of each twin pair to include in the analysis. The only way that the twins could contribute useful information to a diseased vs healthy analysis would be to estimate the effect of clinical information covariates.