Question

Design matrix for longitudinal discordant twin analysis with limma

0

Entering edit mode

ist • 0

@ist-14402

Last seen 6.4 years ago

I have data from two twin pairs, one disease discordant, one not, and including longitudinal samples (with mostly matched time points) for the discordant set.

disease <- c(rep("disease", 3), rep("control",4))
twin_pair <- c(rep(1, 5), 2, 2)
time <- c(1, 2, 3, 2, 3, NA, NA)
subject <- c(1, 1, 1, 2, 2, 3, 4)
(df <- data.frame(disease, twin_pair, time, subject))

  disease twin_pair time subject
1 disease         1    1       1
2 disease         1    2       1
3 disease         1    3       1
4 control         1    2       2
5 control         1    3       2
6 control         2   NA       3
7 control         2   NA       4

We are primarily interested in genes differently expressed in disease compared to controls, but also differences between diseases and controls at the two matched time points.

I am uncertain how to model the multilevel aspect of time within subject within twin. Should I use duplicateCorrelation, and if so, on which level? Any advice as to how best to set this model up would be appreciated.

limma limma design matrix limma paired analysis twin analysis • 1.0k views

ADD COMMENT • link updated 6.4 years ago by Aaron Lun ★ 28k • written 6.4 years ago by ist • 0

score 0 · Answer 1 · 2017-11-21

0

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 12 hours ago

The city by the bay

This is a pretty limited data set and I don't think you will be able to do most of the things you want to do. There's too many factors compared to samples. For your ideal design that accounts for time-specific condition effects, duplicateCorrelation will just spit out NAs. This is not surprising because you barely have any residual degrees of freedom to estimate the variance from design, and effectively none after blocking on subject.

group <- paste0(disease, time) # time/condition interactions
design <- model.matrix(~0 + group)
y <- matrix(rnorm(7000), ncol=7)
duplicateCorrelation(y, design, block=subject)$consensus
## NaN
duplicateCorrelation(y, design, block=twin_pair)$consensus
## NaN

The only way to proceed is to give up on one of the factors. If you're willing to assume that time has no effect, then you can average the samples for each individuals with avereps (assuming these are microarrays). Then you can use an additive design with condition and twin. The disease/control log-fold change is fully defined by the first twin pair while the second twin pair provides residual d.f. for variance estimation. You can also use arrayWeights to account for the fact that you're averaging different numbers of samples.

new.disease <- c("disease", "control", "control", "control")
new.twin <- factor(c(1,1,2,2))
new.design <- model.matrix(~new.disease + new.twin)

Alternatively, you can give up on disease and test for differences between time points. This involves discarding samples 6 and 7 as the time points are unknown. Those two samples wouldn't be used anyway after blocking on subject.

new.design2 <- model.matrix(~factor(time) + factor(subject))

ADD COMMENT • link 6.4 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks, disregarding time sounds like the best option. I should have mentioned that this is RNA-seq - this might be a silly question, but should I then not use avereps?

ADD REPLY • link 6.4 years ago ist • 0

0

Entering edit mode

Having thought about it a bit more (and after getting some dinner!), I realized that you can make it work by just not running duplicateCorrelation at all:

group <- paste0(disease, time) # time/condition interactions
design <- model.matrix(~0 + group)
# ... and go straight to lmFit().

You can now do all desired comparisons between condition/time combinations based on samples from twin pair 1, but with the variance estimated from the lone residual degrees of freedom from twin pair 2. There's no need to account for within-subject correlations, because all samples from the same subject get their own terms in the design matrix anyway. This means that the correlations will not affect the precision of the variance estimate, preserving the validity of our downstream inferences.

In the above model, the estimated variance is that between subjects. This means that comparisons between samples from different subjects (i.e., control.2 vs disease.2, control.3 vs disease.3) will be fine. However, comparisons between time points but within subjects will be affected by the aforementioned correlations and will probably be a bit conservative. This is because the correlations are likely to be positive, i.e., variance within subjects is likely smaller than that between subjects, so using the latter as the former would overstate the variance during testing. duplicateCorrelation would normally handle this kind of situation but, as mentioned in my original answer, this is not possible here.

Having said all that, keep in mind you are drawing inferences from mostly one twin pair. Put aside the fancy statistics and common sense applies regarding the potential generalisability (or lack thereof) of these results to the population.

ADD REPLY • link 6.4 years ago Aaron Lun ★ 28k