I'm using glmLRT() from edgeR to analyze some RNA-seq data. My experimental design has one Treatment factor with three classes (A, B and C), one continuous Age covariate and one Batch factor with 19 levels.
> colnames(design)
[1] "(Intercept)" "ClassB" "ClassC" "Age"
[5] "Batch1" "Batch2" "Batch3" "Batch4"
[9] "Batch5" "Batch6" "Batch7" "Batch8"
[13] "Batch9" "Batch10" "Batch11" "Batch12"
[17] "Batch13" "Batch14" "Batch15" "Batch16"
[21] "Batch17" "Batch18" "Batch19”
This is an ANCOVA design, so when I make my comparisons between the three Treatment classes (A, B, C), I have been taught that I need to pick a specific covariate value to make the contrasts. Typically you might pick Age = 0, Age = min(Age), Age = mean(Age or Age = max(Age) for your contrasts. I've been estimating the contrasts for the mean Age. Let's suppose the mean age is 7.3 years for this example. I'm using the code
> lrt1 = glmLRT(fit,contrast=c(0,1,0,mean(Age),rep(0,19)))
... to compute my contrasts. The code runs and the output makes sense, but my collaborators would like some assurance that glmLRT() is computing the correct contrast. Unfortunately, there are no explanations of covariate effects and contrasts in the edgeR manuals or tutorials, and I can't find any information on Google either. Can glmLRT compute contrasts adjusted for a continuous covariate like this? Or am I computing nonsense here?
First, reply to answers using the "Add Comment" button.
Second, the contrast as posted makes no sense. The null hypothesis corresponding to this contrast is:
... where
m
is the mean age,classB
is the log-fold change of B over A, andAge
is the slope w.r.t. age.That is, the log-fold change of class A over B is equal to the gradient of expression w.r.t. time scaled by the mean age. This has no obvious scientific meaning to me. If you want to test for DE between classes, just drop the
Class
coefficients. Think ofAge
as a blocking factor; required for the model fit but not involved in any comparisons (at least, not between classes).Your advice to always use Age = 0 in my contrasts implies that I can only make contrasts at the Y-intercept. That's problematic if there is a significant Age*Class interaction, because the fold changes could change from minAge to meanAge to maxAge. Obviously, I should have specified the design as Y ~ Class*Age + Batch in my original email. That was my mistake. Now, I've coded an example to demonstrate what I was thinking with my contrasts. EdgeR uses a neg binomial family generalized linear model with eBayes "shrinking", but I'm going to approximate the edgeR model using Poisson regression from glm().
This is a simplified example of my edgeR analysis for one gene. ClassA and ClassB are mostly parallel, while ClassC (green) is negative and almost perpendicular to A and B. This is an obvious interaction effect. I fit the Poisson regression and plotted the regression lines. I compute 9 different predicted values for Classes A, B and C at Ages 1.8, 5.4 and 12.2, then I compute the contrasts of interest two ways. Again, if you only compute contrasts using Age = 0, then you only compare the lines at Y-int. That would only tell you part of the story where A < B < C. I specified the design and contrasts incorrectly in my initial post, but I don't think my overall idea was flawed. However, the edgeR model is more complicated than glm(), so I was hoping someone could confirm that the contrasts would also work in this way in edgeR.
For simplicity, I'm going to reparametrize your example in edgeR terms:
The renaming just keeps
makeContrasts
happy later. Looking at the column names gives us:The first three terms are the intercepts for each group, while the last three terms are the gradients with respect to age. If you want to test for DE between classes at a specific timepoint
X
, you would do:... which may or may not be interesting, depending on the importance of
X
. I presume you wanted to usemean(Age)
asX
in your original post, but there's no statistical reason that the mean age would be of particular importance.Usually, the contrast of greater interest/relevance is whether there is any DE between classes at any time point:
The null hypothesis here would be that the slope and intercept are equal for both classes.
Thank you. This is what I was looking for. You are right that mean(X) isn't especially important, but it is a helpful way to code "compare the groups somewhere in the middle of the range of X" without specifying an exact X value. Usually the 3 places you would want to look are
1. Somewhere near min X or zero
2. Somewhere near mean X or median X or any where in the middle
3. Somewhere near max X
Looking at the contrasts near min X and max X will allow you to look at the two higher leverage areas of your experiment. If the fold change difference are going to change direction, they are most likely to do it between min X and max X. Looking somewhere in the middle will get you a comparison away from the high leverage areas, in an area where the regression lines are most likely to cross.
What's the null hypothesis here? I don't see the statistical necessity of testing a whole bunch of points in the covariate range - at least, not with respect to the leverages. If the scientific question involves those specific ages, then I could understand; but if you want to make a statement about the differences in slopes/intercepts between classes, it would be better/simpler to do that directly. I usually just check the expression profiles manually for any genes of interest to make sure that they're not being driven by a few influential observations (or, in extreme cases, I repeat the DE analysis after tossing those observations out, to ensure that the results are robust).