Question: Design matrix for longitudinal discordant twin analysis with limma
0
22 months ago by
ist0
ist0 wrote:

I have data from two twin pairs, one disease discordant, one not, and including longitudinal samples (with mostly matched time points) for the discordant set.

disease <- c(rep("disease", 3), rep("control",4))
twin_pair <- c(rep(1, 5), 2, 2)
time <- c(1, 2, 3, 2, 3, NA, NA)
subject <- c(1, 1, 1, 2, 2, 3, 4)
(df <- data.frame(disease, twin_pair, time, subject))
  disease twin_pair time subject
1 disease         1    1       1
2 disease         1    2       1
3 disease         1    3       1
4 control         1    2       2
5 control         1    3       2
6 control         2   NA       3
7 control         2   NA       4

We are primarily interested in genes differently expressed in disease compared to controls, but also differences between diseases and controls at the two matched time points.

I am uncertain how to model the multilevel aspect of time within subject within twin. Should I use duplicateCorrelation, and if so, on which level? Any advice as to how best to set this model up would be appreciated.

ADD COMMENTlink
modified 22 months ago by Aaron Lun24k • written 22 months ago by ist0
Answer: Design matrix for longitudinal discordant twin analysis with limma
0
22 months ago by
Aaron Lun24k
Cambridge, United Kingdom
Aaron Lun24k wrote:

This is a pretty limited data set and I don't think you will be able to do most of the things you want to do. There's too many factors compared to samples. For your ideal design that accounts for time-specific condition effects, duplicateCorrelation will just spit out NAs. This is not surprising because you barely have any residual degrees of freedom to estimate the variance from design, and effectively none after blocking on subject.

group <- paste0(disease, time) # time/condition interactions
design <- model.matrix(~0 + group)
y <- matrix(rnorm(7000), ncol=7)
duplicateCorrelation(y, design, block=subject)$consensus ## NaN duplicateCorrelation(y, design, block=twin_pair)$consensus
## NaN


The only way to proceed is to give up on one of the factors. If you're willing to assume that time has no effect, then you can average the samples for each individuals with avereps (assuming these are microarrays). Then you can use an additive design with condition and twin. The disease/control log-fold change is fully defined by the first twin pair while the second twin pair provides residual d.f. for variance estimation. You can also use arrayWeights to account for the fact that you're averaging different numbers of samples.

new.disease <- c("disease", "control", "control", "control")
new.twin <- factor(c(1,1,2,2))
new.design <- model.matrix(~new.disease + new.twin)


Alternatively, you can give up on disease and test for differences between time points. This involves discarding samples 6 and 7 as the time points are unknown. Those two samples wouldn't be used anyway after blocking on subject.

new.design2 <- model.matrix(~factor(time) + factor(subject))

ADD COMMENTlink modified 22 months ago • written 22 months ago by Aaron Lun24k

Thanks, disregarding time sounds like the best option. I should have mentioned that this is RNA-seq - this might be a silly question, but should I then not use avereps?

ADD REPLYlink written 22 months ago by ist0

Having thought about it a bit more (and after getting some dinner!), I realized that you can make it work by just not running duplicateCorrelation at all:

group <- paste0(disease, time) # time/condition interactions
design <- model.matrix(~0 + group)
# ... and go straight to lmFit().


You can now do all desired comparisons between condition/time combinations based on samples from twin pair 1, but with the variance estimated from the lone residual degrees of freedom from twin pair 2. There's no need to account for within-subject correlations, because all samples from the same subject get their own terms in the design matrix anyway. This means that the correlations will not affect the precision of the variance estimate, preserving the validity of our downstream inferences.

In the above model, the estimated variance is that between subjects. This means that comparisons between samples from different subjects (i.e., control.2 vs disease.2, control.3 vs disease.3) will be fine. However, comparisons between time points but within subjects will be affected by the aforementioned correlations and will probably be a bit conservative. This is because the correlations are likely to be positive, i.e., variance within subjects is likely smaller than that between subjects, so using the latter as the former would overstate the variance during testing. duplicateCorrelation would normally handle this kind of situation but, as mentioned in my original answer, this is not possible here.

Having said all that, keep in mind you are drawing inferences from mostly one twin pair. Put aside the fancy statistics and common sense applies regarding the potential generalisability (or lack thereof) of these results to the population.

ADD REPLYlink modified 22 months ago • written 22 months ago by Aaron Lun24k
Please log in to add an answer.

Content
Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 145 users visited in the last hour