I am analysing a dataset where i have subjects who were tested at baseline and then after 2 weeks of calorie restriction. I would like to see the genes DE between baseline and calorie restriction taking into account the persons age and BMI. As there is a before and after condition is only 2 weeks the age and BMI do not change. I am not sure how to adjust for this in my model.
An example of my dataset is:
SubjectID
bmi
Age
Class
1
26.41
73
T0
1
26.41
73
T2weeks
14
23.31
80
T0
14
23.31
80
T2weeks
16
27.71
85
T0
16
27.71
85
T2weeks
17
25.67
76
T0
17
25.67
76
T2weeks
18
20.99
70
T0
18
20.99
70
T2weeks
For creating the design of limma i was thinking of the following design:
design <- model.matrix(~class:ID + age + BMI)
From this i run a normal limma analysis and extract the results from the class:ID term as this tells me the difference between baseline and after 2 weeks of CR for my genes. Is this correct?
I was originally using duplicate correlation to take into account subject correlation but the concern i have is that BMI and age will be the same regardless of being baseline or after 2weeks and that this is not the correct thing to do and i need some interaction term.
Don't think there's an issue with the continuous variables being constant within the subjects - does everything work ok if you have a design with of ~bmi + age + class, with a blocking factor of id?
I suppose you could add age:class (and/or bmi:class) interaction if you're worried that a gene's age-gradient might differ between CR states, but it's hard for me to conceptualise what a class:id interaction would add, other than looking at a subject-specific effect of class.
It sounds like you basically want to treat age and BMI as confounding factors and control for their influence, and you're not interested in the genes that are differentially expressed with respect to either one. If so, the design you probably want is just
design <- model.matrix(~class + ID)
This will allow you to test for differences between T0 and T2w while controlling for all differences between individuals. This encompasses any differences due to age and bmi, as well as any other differences between individuals.
Your proposed design matrix is far, far more complicated than it needs to be. Simply do this:
ID <- factor(rep(c(1,14,16,17,18), each=2))
class <- factor(rep(c("T0", "T2weeks"), 5))
design <- model.matrix(~ID + class)
The first five coefficients are subject-specific blocking factors that are uninteresting and can be ignored. The last coefficient represents the log-fold change in expression after two weeks on caloric restriction, compared to a zero-time control. There is no need to block on BMI or age, because this is redundant with blocking on ID. Any differences in expression due to BMI or age will be modelled by the ID blocking terms.
I have used this correlation function when i am looking at the same subjects or technical replicates but i guess the correlation function it only for the technical replicates (ones that are from the same sample but repeated ) but not for this case?
Thank you very much for your help- its been a long day!
If you block on the subject ID in the design matrix, you don't need to use duplicateCorrelation. This is because any subject-specific effects that might induce correlations in the paired samples are already modelled in the systematic component of the linear model. We generally recommend using duplicateCorrelation in situations where your blocking factor is confounded with the factor of interest. For example, if some subjects were control-only and other subjects were CR-only, then obviously you wouldn't be able to block on the subject in the design matrix, because you can't estimate the blocking factors and the treatment effect at the same time. In such cases, you would block via duplicateCorrelation.
I don't think is quite right - the class:ID term means that a coefficient will be estimated for each unique {ID, class} combination, so you'll get lots of coefficients corresponding to this term (unless you've left ID as a numeric variable - you must convert it to a factor before putting it into the design). You'd then need to drop the intercept to make all the coefficients estimable, and devise a contrast that examined your many coefficients for the class:ID combinations. Also, your dataset has a variable called SubjectID rather than ID, and the capitalisation is inconsistent. You should keep bmi and age as numeric variables, but you may want to mean-centre them so that the coefficients become more obviously interpretable
I'm guessing you want to get a T0 vs T2 comparison, accounting for the paired nature of your data, rather than a per-subject estimate of this effect (which is sort of what your original model is heading towards?). If so, then something like
~ ID + bmi + age + class
will roughly give you what you want, but you might want to investigate the duplicateCorrelation function in limma to take account of within-subject correlation.
If the OP uses that model formula, then limma will automatically remove bmi and age because they are completely confounded with ID. So he'll end up with a standard paired analysis:
Thank you so much for all your comments it is really helping with my learning of limma! I will try the ~ID + class and do abit more reading of the limma manual
Hi Gavin,
Thanks for your answer.
I was originally using duplicate correlation to take into account subject correlation but the concern i have is that BMI and age will be the same regardless of being baseline or after 2weeks and that this is not the correct thing to do and i need some interaction term.
Thanks
Danielle
Don't think there's an issue with the continuous variables being constant within the subjects - does everything work ok if you have a design with of
~bmi + age + class
, with a blocking factor of id?I suppose you could add age:class (and/or bmi:class) interaction if you're worried that a gene's age-gradient might differ between CR states, but it's hard for me to conceptualise what a class:id interaction would add, other than looking at a subject-specific effect of class.