I have data from two twin pairs, one disease discordant, one not, and including longitudinal samples (with mostly matched time points) for the discordant set.
disease <- c(rep("disease", 3), rep("control",4))
twin_pair <- c(rep(1, 5), 2, 2)
time <- c(1, 2, 3, 2, 3, NA, NA)
subject <- c(1, 1, 1, 2, 2, 3, 4)
(df <- data.frame(disease, twin_pair, time, subject))
disease twin_pair time subject 1 disease 1 1 1 2 disease 1 2 1 3 disease 1 3 1 4 control 1 2 2 5 control 1 3 2 6 control 2 NA 3 7 control 2 NA 4
We are primarily interested in genes differently expressed in disease compared to controls, but also differences between diseases and controls at the two matched time points.
I am uncertain how to model the multilevel aspect of time within subject within twin. Should I use duplicateCorrelation, and if so, on which level? Any advice as to how best to set this model up would be appreciated.
Thanks, disregarding time sounds like the best option. I should have mentioned that this is RNA-seq - this might be a silly question, but should I then not use
avereps
?Having thought about it a bit more (and after getting some dinner!), I realized that you can make it work by just not running
duplicateCorrelation
at all:You can now do all desired comparisons between condition/time combinations based on samples from twin pair 1, but with the variance estimated from the lone residual degrees of freedom from twin pair 2. There's no need to account for within-subject correlations, because all samples from the same subject get their own terms in the design matrix anyway. This means that the correlations will not affect the precision of the variance estimate, preserving the validity of our downstream inferences.
In the above model, the estimated variance is that between subjects. This means that comparisons between samples from different subjects (i.e.,
control.2
vsdisease.2
,control.3
vsdisease.3
) will be fine. However, comparisons between time points but within subjects will be affected by the aforementioned correlations and will probably be a bit conservative. This is because the correlations are likely to be positive, i.e., variance within subjects is likely smaller than that between subjects, so using the latter as the former would overstate the variance during testing.duplicateCorrelation
would normally handle this kind of situation but, as mentioned in my original answer, this is not possible here.Having said all that, keep in mind you are drawing inferences from mostly one twin pair. Put aside the fancy statistics and common sense applies regarding the potential generalisability (or lack thereof) of these results to the population.