Cross-posted from Biostars: https://www.biostars.org/p/457094/
Please see the example below.
set.seed(1)
IDs=as.factor(c(1:20))
pheno=cbind.data.frame(IDs=IDs,
gr=as.factor(c(rep("A",10),rep("B",10))),
oc=rnorm(20)
)
dat=rbind.data.frame(varA=rnorm(20),
varB=rnorm(20),
varC=rnorm(20),
varD=rnorm(20)
)
colnames(dat)=IDs
rownames(pheno)=IDs
form=as.formula(~oc*gr)
design=model.matrix(form,data=pheno)
colnames(design)=make.names(colnames(design))
colnames(design)
[1] "X.Intercept." "oc" "grB" "oc.grB"
library(limma)
fit=lmFit(dat,design)
fit2=eBayes(fit)
topTable(fit2,coef=1) # Intercept = Group A when OC == 0
topTable(fit2,coef=2) # oc = slope of OC for Group A
topTable(fit2,coef=3) # grB = difference in intercept between Group B and Group A
topTable(fit2,coef=4) # Interaction results: oc.grB = difference in slope between Group B and Group A
The question is then: How do I get within group A and within group B results to further examine the nature of the interaction? Eg. 'Post-Hoc' testing: ~oc+gr twice, splitting the groups before constructing the model. I know I can re-run the analyses stratifying by group, but it seems unnecessary.
fit3 <- contrasts.fit(fit, c(1, 0, 0, -1)) # Not sure how to specify this correctly
fit3 <- eBayes(fit3)
topTable(fit3)
Perfect! I know it is a different issue, but why does the within-group standard error and p-values differ when computed from the interaction model, as opposed to splitting the groups and run two separate model (as you showed in the answer)? I see that the estimate is the same. I am not afraid of the math, so just bring it on! :)
The standard error is the standard deviation of the residuals divided by the square root of the number of observations. If you cut your number of samples in half, you have fewer samples from which to estimate the standard deviation (which will change what you get), and you divide by a smaller number, so the estimated standard error will, all things equal, be larger. Since that's the denominator of your t-statistic, a larger denominator will give a smaller t-statistic when the numerator doesn't change.
In addition, the t-statistic will be based on half the degrees of freedom as well, so even if the t-statistic didn't change, the p-value will get larger. Plus, the degrees of freedom for the limma model fit will be larger than the
lm
fit because of theeBayes
step. See the limma User's Guide and the relevant papers referenced therein for more information.Absolutely wonderful! May I kindly ask also how to extract the results across both groups / the slope for oc across both groups?
I'm pretty sure I've already answered that question in my original answer.
My mistanke, my english is probably insufficient. The answer you wrote is correct. What I meant was a new question: how to get the slope for oc regarding all participants as one group, kinda like fitting ~oc+grp on the full data set.
By fitting ~oc+grp on the genes for which the interaction isn't significant.