Question

Computing contrasts with a covariate using edgeR glmLRT()

0

Entering edit mode

Jeff Skinner • 0

@jeff-skinner-12831

Last seen 7.0 years ago

USA / Rockville MD

I'm using glmLRT() from edgeR to analyze some RNA-seq data. My experimental design has one Treatment factor with three classes (A, B and C), one continuous Age covariate and one Batch factor with 19 levels.

> colnames(design)
 [1] "(Intercept)"    "ClassB"        "ClassC"         "Age"                       
 [5] "Batch1"         "Batch2"        "Batch3"         "Batch4"        
 [9] "Batch5"         "Batch6"        "Batch7"         "Batch8"         
[13] "Batch9"         "Batch10"       "Batch11"        "Batch12"           
[17] "Batch13"        "Batch14"       "Batch15"        "Batch16"           
[21] "Batch17"        "Batch18"       "Batch19”

This is an ANCOVA design, so when I make my comparisons between the three Treatment classes (A, B, C), I have been taught that I need to pick a specific covariate value to make the contrasts. Typically you might pick Age = 0, Age = min(Age), Age = mean(Age or Age = max(Age) for your contrasts. I've been estimating the contrasts for the mean Age. Let's suppose the mean age is 7.3 years for this example. I'm using the code

> lrt1 = glmLRT(fit,contrast=c(0,1,0,mean(Age),rep(0,19)))

... to compute my contrasts. The code runs and the output makes sense, but my collaborators would like some assurance that glmLRT() is computing the correct contrast. Unfortunately, there are no explanations of covariate effects and contrasts in the edgeR manuals or tutorials, and I can't find any information on Google either. Can glmLRT compute contrasts adjusted for a continuous covariate like this? Or am I computing nonsense here?

edger contrast covariate glmlrt() • 3.6k views

ADD COMMENT • link 7.0 years ago Jeff Skinner • 0

score 0 · Answer 1 · 2017-04-12

You are computing nonsense. The intercept and age covariates are estimating the intercept and slope for the A class samples. The ClassB and ClassC covariates are estimating the difference in the intercept between the B and A and C and A groups, respectively. Since you don't have an interaction term for age and group, you are constraining the slope for the B and C classes to be parallel to the A class, so the fold change between any of those three groups will be the same, regardless of the age you choose!

score 0 · Answer 2 · 2017-04-12

0

Entering edit mode

Jeff Skinner • 0

@jeff-skinner-12831

Last seen 7.0 years ago

USA / Rockville MD

Yeah, I noticed that after I posted it. Still, that would make the covariate coefficient of mean(Age) = 7.3 unnecessary, but it would not necessarily make the contrast itself nonsense. But suppose I had specified the interaction term here, would the contrast calculation represent an adjustment to make the comparison at covariate Age = 7.3? Or would it simply be a weird ratio in the contrast statement?

ADD COMMENT • link 7.0 years ago Jeff Skinner • 0

0

Entering edit mode

First, reply to answers using the "Add Comment" button.

Second, the contrast as posted makes no sense. The null hypothesis corresponding to this contrast is:

ClassB + m*Age = 0 # or
         m*Age = -ClassB

... where m is the mean age, classB is the log-fold change of B over A, and Age is the slope w.r.t. age.

That is, the log-fold change of class A over B is equal to the gradient of expression w.r.t. time scaled by the mean age. This has no obvious scientific meaning to me. If you want to test for DE between classes, just drop the Class coefficients. Think of Age as a blocking factor; required for the model fit but not involved in any comparisons (at least, not between classes).

ADD REPLY • link 7.0 years ago Aaron Lun ★ 28k

0

Entering edit mode

Your advice to always use Age = 0 in my contrasts implies that I can only make contrasts at the Y-intercept. That's problematic if there is a significant Age*Class interaction, because the fold changes could change from minAge to meanAge to maxAge. Obviously, I should have specified the design as Y ~ Class*Age + Batch in my original email. That was my mistake. Now, I've coded an example to demonstrate what I was thinking with my contrasts. EdgeR uses a neg binomial family generalized linear model with eBayes "shrinking", but I'm going to approximate the edgeR model using Poisson regression from glm().

# Sample Data

Y      = c(7,8,8,10,12,11,13,15,16,18,15,13,10,9,6)
lY     = log(Y)
Class  = factor(c(rep("A", 5),rep("B", 5),rep("C",5)),levels=c("A","B","C"))
Age    = c(0.6,2.1,3.3,4.2,7.7,2.9,5.4,6.5,8.0,12.8,1.4,4.7,6.8,8.5,13.1)

# Basic Plot

plot(x=Age,y=lY,pch=c(rep(17,5),rep(16,5),rep(15,5)),
col=c(rep("orange",5),rep("purple",5),rep("green",5)),
cex=2,ylim=c(1.6,3.0),yaxt="n",xlim=c(0,14),ylab="Count")
axis(side=2,at=log(c(5,10,15,20)),labels=c(5,10,15,20))
legend("bottomleft",c("A","B","C"),pch=17:15,col=c("orange","purple","green"),
title="Class")

# Fit Poisson regression model with glm()

poisfit  = glm(Y ~ Class*Age,family=poisson)
poiscoef = poisfit$coef
summary(poisfit)

# Compute regression equations for A, B and C

A.reg  = poisfit$coef[c(1,4)]
B.reg  = c(sum(poisfit$coef[1:2]),sum(poisfit$coef[4:5]))
C.reg  = c(sum(poisfit$coef[c(1,3)]),sum(poisfit$coef[c(4,6)]))

# Add regression lines to plot

abline(A.reg,col="orange")
abline(B.reg,col="purple")
abline(C.reg,col="green")

# Suppose I compute contrasts at Age = 1.8, Age = 5.4 and Age = 12.2

abline(v=c(1.8,5.4,12.2),lty=3,col="red")

A0     = sum(poiscoef*c(1,0,0,0,0,0))
B0     = sum(poiscoef*c(1,1,0,0,0,0))
C0     = sum(poiscoef*c(1,0,1,0,0,0))
A1.8   = sum(poiscoef*c(1,0,0,1.8,0,0))
B1.8   = sum(poiscoef*c(1,1,0,1.8,1.8,0))
C1.8   = sum(poiscoef*c(1,0,1,1.8,0,1.8))
A5.4   = sum(poiscoef*c(1,0,0,5.4,0,0))
B5.4   = sum(poiscoef*c(1,1,0,5.4,5.4,0))
C5.4   = sum(poiscoef*c(1,0,1,5.4,0,5.4))
A12.2  = sum(poiscoef*c(1,0,0,12.2,0,0))
B12.2  = sum(poiscoef*c(1,1,0,12.2,12.2,0))
C12.2  = sum(poiscoef*c(1,0,1,12.2,0,12.2))

# Check mean responses against regression line

points(x=rep(0, 3),y=c(A0,     B0,   C0),   pch="i",col="red")
points(x=rep(1.8, 3),y=c(A1.8, B1.8, C1.8), pch="1",col="red")
points(x=rep(5.4, 3),y=c(A5.4, B5.4, C5.4), pch="2",col="red")
points(x=rep(12.2,3),y=c(A12.2,B12.2,C12.2),pch="3",col="red")

# Compute 9 contrast differences using means

D1     = B1.8  - A1.8
D2     = C1.8  - A1.8
D3     = C1.8  - B1.8
D4     = B5.4  - A5.4
D5     = C5.4  - A5.4
D6     = C5.4  - B5.4
D7     = B12.2 - A12.2
D8     = C12.2 - A12.2
D9     = C12.2 - B12.2

# Compute 9 contrasts using model coefficients

d1     = sum(poiscoef*c(0,1,0, 0,1.8,0))
d2     = sum(poiscoef*c(0,0,1, 0,0,  1.8))
d3     = sum(poiscoef*c(0,-1,1,0,-1.8,1.8))
d4     = sum(poiscoef*c(0,1,0, 0,5.4,0))
d5     = sum(poiscoef*c(0,0,1, 0,0,5.4))
d6     = sum(poiscoef*c(0,-1,1,0,-5.4,5.4))
d7     = sum(poiscoef*c(0,1,0, 0,12.2,0))
d8     = sum(poiscoef*c(0,0,1, 0,0,12.2))
d9     = sum(poiscoef*c(0,-1,1,0,-12.2,12.2))

data.frame(DiffMeans=c(D1,D2,D3,D4,D5,D6,D7,D8,D9),
Contrasts=c(d1,d2,d3,d4,d5,d6,d7,d8,d9))

This is a simplified example of my edgeR analysis for one gene. ClassA and ClassB are mostly parallel, while ClassC (green) is negative and almost perpendicular to A and B. This is an obvious interaction effect. I fit the Poisson regression and plotted the regression lines. I compute 9 different predicted values for Classes A, B and C at Ages 1.8, 5.4 and 12.2, then I compute the contrasts of interest two ways. Again, if you only compute contrasts using Age = 0, then you only compare the lines at Y-int. That would only tell you part of the story where A < B < C. I specified the design and contrasts incorrectly in my initial post, but I don't think my overall idea was flawed. However, the edgeR model is more complicated than glm(), so I was hoping someone could confirm that the contrasts would also work in this way in edgeR.

ADD REPLY • link 7.0 years ago Jeff Skinner • 0

0

Entering edit mode

For simplicity, I'm going to reparametrize your example in edgeR terms:

design <- model.matrix(~0 + Class + Class:Age)
colnames(design) <- sub(":", "_", colnames(design))

The renaming just keeps makeContrasts happy later. Looking at the column names gives us:

[1] "ClassA"     "ClassB"     "ClassC"     "ClassA_Age" "ClassB_Age"
[6] "ClassC_Age"

The first three terms are the intercepts for each group, while the last three terms are the gradients with respect to age. If you want to test for DE between classes at a specific timepoint X, you would do:

con <- makeContrasts((ClassA + X * ClassA_Age) - 
                     (ClassB + X * ClassB_Age),  levels=design)

... which may or may not be interesting, depending on the importance of X. I presume you wanted to use mean(Age) as X in your original post, but there's no statistical reason that the mean age would be of particular importance.

Usually, the contrast of greater interest/relevance is whether there is any DE between classes at any time point:

con <- makeContrasts(ClassA_Age - ClassB_Age, ClassA - ClassB, levels=design)

The null hypothesis here would be that the slope and intercept are equal for both classes.

ADD REPLY • link 7.0 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you. This is what I was looking for. You are right that mean(X) isn't especially important, but it is a helpful way to code "compare the groups somewhere in the middle of the range of X" without specifying an exact X value. Usually the 3 places you would want to look are

1. Somewhere near min X or zero

2. Somewhere near mean X or median X or any where in the middle

3. Somewhere near max X

Looking at the contrasts near min X and max X will allow you to look at the two higher leverage areas of your experiment. If the fold change difference are going to change direction, they are most likely to do it between min X and max X. Looking somewhere in the middle will get you a comparison away from the high leverage areas, in an area where the regression lines are most likely to cross.

ADD REPLY • link 7.0 years ago Jeff Skinner • 0

0

Entering edit mode

What's the null hypothesis here? I don't see the statistical necessity of testing a whole bunch of points in the covariate range - at least, not with respect to the leverages. If the scientific question involves those specific ages, then I could understand; but if you want to make a statement about the differences in slopes/intercepts between classes, it would be better/simpler to do that directly. I usually just check the expression profiles manually for any genes of interest to make sure that they're not being driven by a few influential observations (or, in extreme cases, I repeat the DE analysis after tossing those observations out, to ensure that the results are robust).

ADD REPLY • link 7.0 years ago Aaron Lun ★ 28k