Hello,
I've been asked to analyze some microarray data from a fairly complex design involving a cross-over design with repeated measurements in each period. I've done a lot of searching on the internet, but haven't found much about the proper way to model this in regular R, let alone limma. The main effects of interest are diet (A and B) and an immune challenge (blood taken pre- and post-challenge). Each pig was subjected to one of the diets for 1 month, measured pre- and post-challenge, then allowed a wash-out period before being subjected to the other diet for a month and measured pre- and post-challenge. The targets file:
> targets <- data.frame(Sample = 1:32, Seq = rep(c("AB","BA"), each = 16), + Diet = rep(c("A","B"), each = 8, 2), + Time = rep(c("pre","post"), each = 4, 2), + Period = rep(c(1,2,1), c(8,16,8)), + Pig = c(rep(1:4,4), rep(5:8,4))) > > targets Sample Seq Diet Time Period Pig 1 1 AB A pre 1 1 2 2 AB A pre 1 2 3 3 AB A pre 1 3 4 4 AB A pre 1 4 5 5 AB A post 1 1 6 6 AB A post 1 2 7 7 AB A post 1 3 8 8 AB A post 1 4 9 9 AB B pre 2 1 10 10 AB B pre 2 2 11 11 AB B pre 2 3 12 12 AB B pre 2 4 13 13 AB B post 2 1 14 14 AB B post 2 2 15 15 AB B post 2 3 16 16 AB B post 2 4 17 17 BA A pre 2 5 18 18 BA A pre 2 6 19 19 BA A pre 2 7 20 20 BA A pre 2 8 21 21 BA A post 2 5 22 22 BA A post 2 6 23 23 BA A post 2 7 24 24 BA A post 2 8 25 25 BA B pre 1 5 26 26 BA B pre 1 6 27 27 BA B pre 1 7 28 28 BA B pre 1 8 29 29 BA B post 1 5 30 30 BA B post 1 6 31 31 BA B post 1 7 32 32 BA B post 1 8
The researchers are mainly interested in the interaction between diet and time, although they are also interested in the effects of Diet and Time (both main effects and pairwise within each level). I'm not sure how to specify the design matrix correctly for both the cross-over and the repeated measures. If there was just the post-challenge measurement, then I think the model for the simple cross-over design would be:
#subset samples targ.post <- targets[targets$Time == "post",] #make design matrix d.cross.post <- model.matrix(~Period + Seq + Diet, data=targ.post) colnames(d.cross.post) #[1] "(Intercept)" "Period" "SeqBA" "DietB" #Add pig in as random effect cor.pig.post <- duplicateCorrelation(normValues[,targets$Time == "post"], d.cross.post, block = targ.post$Pig) #Fit model fit.cross.post <- eBayes(lmFit(normValues[,targets$Time == "post"], d.cross.post, block = targ.post$Pig, cor = cor.pig.post$cor)) #Effect of interest, Diet B vs A: topTable(fit.cross.post, coef = "DietB")
If I only took the first time period so it was a 2-way ANOVA with repeated measures on one factor, I could follow the suggestion in Section 3.5 of the edgeRUsersGuide() and model it this way:
#subset samples targ.per1 <- targets[targets$Period == 1,] #Re-number pigs within each diet: targ.per1$Pig2 <- factor(rep(1:4,4)) #Re-level Time to have pre first: targ.per1$Time <- relevel(targ.per1$Time, ref = "pre") #make design matrix d.rep.P1 <- model.matrix(~Diet*Time + Diet:Pig2, data=targ.per1) colnames(d.rep.P1) # [1] "(Intercept)" "DietB" "Timepost" "DietB:Timepost" # [5] "DietA:Pig22" "DietB:Pig22" "DietA:Pig23" "DietB:Pig23" # [9] "DietA:Pig24" "DietB:Pig24" #Fit model: fit.rep.P1 <- eBayes(lmFit(normValues[,targets$Period == 1], d.rep.P1)) #Follow Section 9.5.4 in limmaUsersGuide() to properly pull out my effects of interest...
What I can't figure out is how to properly put them together. Any help would be appreciated.
Thanks in advance,
Jenny
Just to note: my approach models the pre-vs-post effects for a given diet with separate coefficients for each sequence group, which would allow you to determine whether there are significant differences depending on which diet comes first. If that's not something you care about, I believe Aaron's design produces a single pre-vs-post coefficient for each diet, whcih is the average of the effects from when that diet is first and second. This results in a simpler model and might be preferable if that's all you care about.
The design in my answer can also be extended to have a period- and diet-specific pre-vs-post log-fold change, as I've mentioned below. It's probably best used for checking that the order of the diets has no effect, i.e., that the diet-specific pre-vs-post log-fold change is the same between the different periods. In the long run, I'd imagine that most people would want to report the final diet-specific effects by averaging across the two periods and using all samples in the design.
Thanks, Aaron and Ryan for your help! I guessed that the answer wasn't straightforward when some of my Googling pulled up a heavy publication in a Statistics journal (1999) and a German dissertation (2004), both of which were over my head (don't have the links anymore, but I could try to find them if you want them). I've tried both the approaches, but it turns out that there just aren't any diet effects or diet:time interactions no matter how you slice it. This was dropped on me to find some sort of magical statistical solution after the researchers spent over a year trying to analyze it on their own. When they gave it to me, they described it as a simple 2x2 factorial design! It wasn't until after an initial analysis when I started asking them about an apparent batch effect in the data (turned out to be period) that they even mentioned the repeated measures per pig, and then more back and forth until they said it was a cross-over design!
Well, at least I learned a lot in all of this - I've been analyzing microarray & RNA-Seq data for 10 years, and probably only a dozen times have I gotten anything more complex than a 2way ANOVA.
Cheers,
Jenny