Hello everyone, I'm seeking guidance on how best to model differential expression for a twin-based RNA-seq study with a few design complications.
Study design:
- Whole-blood RNA-seq from concordant monozygotic twin pairs: each pair is either both healthy (7 pairs) or both diseased (6 pairs).
- N = 26 samples.
- Each pair was sequenced in the same run (run1, run2, run3). Group x Run: diseased = {run1: 6, run2: 0, run3: 6}; healthy = {run1: 2, run2: 4, run3: 8} -> run2 contains only healthy.
- Sex is balanced across groups (healthy F/M = 8/6; diseased F/M = 6/6) but drives clear clustering in PCA/heatmaps.
- Extensive clinical information is available per subject (e.g., cardiometabolic risk factors, laboratory parameters, lifestyle variables).
Modeling considerations:
- Surrogate variables (RUVr): when retained, several SVs were significantly associated with Pair (linear model/ANOVA), and including them yielded many DE genes (BH FDR <= 0.05).
- limma-voom + duplicateCorrelation (Pair as blocking factor) with group and Sex as fixed effects (+/- Run) identified no DE genes at BH FDR <= 0.05.
- variancePartition::dream with Pair modeled as a random effect (~ group + Sex +/- Run + (1|Pair)) likewise yielded no DE genes after BH FDR correction.
Questions: 1) Given this design (concordant MZ twins, Pair dependence, a mono-group run, strong Sex effect, small effective n), which DE approach is best practice and most robust for inference (limma+duplicateCorrelation vs dream vs pair-level aggregation, etc.)? 2) Since the contrast (diseased vs healthy) does not rely on within-pair comparisons, should I still account for twin pairing (block = Pair or (1|Pair)) in the DE model? What are the consequences of ignoring Pair?
Thanks in advance!
