Hi,
I am dealing with continuous covariate in my design matrix as well. I am using edgeR, I have created a following design matrix:
design1 <- model.matrix(~groups+groups:age)
where there are four different groups.
So, colname(design1)
would give:
(Intercept) group2 group3 group4 group1:age group2:age group3:age group4:age
If I were to do a comparison such as: lrt <- glmLRT(fit, coef=2)
, what am I comparing? Am I merely comparing group1 and group2? From what reading through forum questions/answers, it seems like this comparison is affected by the age covariate included in the design matrix.
But if I were to create a design matrix: design2 <- model.matrix(~group)
, and perform the same comparison between group1 and group2, would that give a same result?
Also, using design1 again, if I were to compare group2:age with intercept (coef=6), would that be testing for age effect on group2? And comparing group1:age and group2:age would be testing for the differences in age effect between group1 and group2?
Lastly, if I were to include another continuous variable(s) as covariates into the design matrix such that:
design3 <- model.matrix(~groups+groups:age+continuous1+continuous2)
in order to adjust for the covariates continuous1 and continuous2, other than being adjusted for them, would they affect how the above comparisons turn out.
My apologies for asking a lot of questions. I am quite confused by the usage of design matrix, especially with continuous variables interaction. If there is any confusion regarding the questions I ask, please tell me. Thank you.
Thank you Aaron for your reply. I think I better understand how the design matrix works now.
I do agree that given the continuous covariate is age, it wouldn't make sense to compare the intercept between group1 and group2. However, this was merely an example - the continuous covariate is glucose levels in patients and I want to find out the difference in effect of glucose levels between group1 and group 2 (for instance) - as you mentioned, this will be comparing
group1:age
togroup2:age
.I have tested the design matrix(s) and it looks like the addition of continuous variables severely diminishes the effect of glucose on each group. Using
design3
, I have obtained only 10 DE genes whencoef=6
was done (group2:glucose
against intercept) but the number of DE genes increased to ~300 when design1 was used withcoef=6
(group2:glucose
against intercept) again. I wonder what this means in terms of interaction between continuous covariates and glucose?I'll guess that one or both of the continuous covariates are quite strongly correlated with glucose levels, at least for group 2. Try plotting the former against the latter and see if any relationship shows up between them.
Thank you! I will try that.
Hi Aaron,
My apologies for trying to revive an old thread but I have a question related to what I asked above.
You mentioned that the decrease in the number of DE genes with the addition of continuous covariates may be due to it being strongly correlated with glucose levels. Since then, I have plotted the glucose levels against the continuous covariates but found no strong correlation. If this is the case, what does the decrease in DE genes imply?
Alternatively, if the addition of continuous covariates does nothing (no change in DE genes), then I would omit that continuous covariate from the model, and if it increase the number of DE genes, I would include it. Is that a right step for analysis?
It's just that I have multiple covariates that I want to adjust for, and I'm not entirely sure what is optimal.
Thank you again for your help.
As for the correlation between glucose and the continuous covariates - there's two covariates, so you'll get a decrease if there's a correlation between glucose and any linear combination of those covariates, even if there is no strong correlation between glucose and each individual covariate. Try including each covariate separately into the design matrix and dropping
coef=6
, and see if you get fewer DE genes. Other than that, the only thing I can think of as that you're losing power from the loss in residual d.f. when you include two extra terms. This should only be relevant if your residual d.f. is already low, i.e., the number of parameters is close to the number of samples.The "optimal" model is the simplest one that allows you to make useful inferences. Include all the factors that are important to interpretation of the experiment. If the dispersion estimates are low, and if you get a decent number of DE genes, then that model is probably good enough. If you've missed an important systematic effect, this will manifest as an inflated dispersion and fewer DE genes, so you can try adding a blocking factor to account for it; if it gives you lower estimates and more DE genes, then it improves the fit and usefulness of the model, so you should keep it in. If not, then leave it out.
Thank you Aaron,
Pardon my lack of knowledge but would adding a blocking factor be the same as adding a covariate (additive model)? If I were to adjust for gender differences, how would considering it as a blocking factor be different from adding it as a covariate?
Also, I'm facing a slight conundrum where I'm comparing between four groups (two at a time) but depending on what continuous covariate I'm adding, the number of DE genes can change, and the change is different between different comparisons. To be more specific, addition of the second covariate instead of first covariate leads to increase in the number of DE genes for Group A vs Group B, but decrease in the number of DE genes for Group B vs Group C. Would I have to decide on one model to fit every comparison, but can I make a change in the model depending on the comparison?
A factor would be a categorical variable, whereas a covariate would be a continuous variable. Sex is either male or female, hence a factor - other variables like age, height, etc. are continuous, hence they are covariates. It doesn't make sense to consider sex as a covariate, unless you have a good quantitative reason to describe an individual as being, e.g., half-male or half-female.
From the sounds of it, you're getting DE genes in each comparison, regardless of the model. Given that they're both working, I wouldn't bother fitting a different model for each comparison. Otherwise, it would become tricky to interpret, especially if you wanted to make a statement about the relative amount of DE in A vs B compared to B vs C. As to which model to use, ask yourself whether the covariates are biologically important, or important to the interpretation of the experiment. If they're not, then leave them out, because they don't seem to provide any extra benefit with respect to more DE genes or lower dispersion estimates.
Got it. I was confused with the terms used.
And yes, single model sounds like a good idea for what I'm trying to achieve. Thank you again.
Hi Aaron,
I am sorry to revive an old thread but I have a question related to con1 and con2. I did not get the difference between the two comparisons. coef 1 of con1 and con2 should output the same results, am I wrong?
Keifa
I have no idea of what you actually did. Please post a new question with your sample metadata and code.