Search
Question: Different results in edgeR using simple vs GLM
3
2.8 years ago by
dietahanson30
McGill University, Montreal, Canada
dietahanson30 wrote:

Hi there,

I'm using edgeR version 3.10.2 to analyze my RNAseq data. I have a design in which there are two factors (habitat and watershed) with two levels in each factor. I have 3 replicates for each of the four groups, to make 12 samples total. I am also interested in the interaction between the two factors. To test for differentially expressed genes between the habitat levels, the watershed levels, and the interaction between them, I am using a design matrix (shown below) modelled on section 3.3.4 of the edgeR users guide:

(Intercept) habStream shedRoberts habStream:shedRoberts
ML1           1         0           0                     0
ML2           1         0           0                     0
ML3           1         0           0                     0
MS1           1         1           0                     0
MS2           1         1           0                     0
MS3           1         1           0                     0
RL1           1         0           1                     0
RL2           1         0           1                     0
RL3           1         0           1                     0
RS1           1         1           1                     1
RS2           1         1           1                     1
RS3           1         1           1                     1

To first look at genes differentially expressed between the habitat levels, I use:

lrt.hab <- glmLRT(fit, coef=2)
topTags(lrt.hab)

because the second model coefficient is for the habitat term. This returns a list of the top de genes.

However, I was interested to see what would happen if I redid the analysis as a simple comparison between the two habitat levels, ignoring the other factor. To do this, I use the exactTest function. When I get the list of top de genes for this analysis, the first gene is the same as in the previous analysis, but all the others are different.

My question is, when using the glmLRT, is it working like a 2-way ANOVA in that it is keeping the other factor (watershed) constant while testing for de between habitats? Is the the reason I am getting two different lists? In that case, which method is more statistically sound?

Thanks for your help,

Dieta

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] statmod_1.4.21 edgeR_3.10.2   limma_3.24.15

loaded via a namespace (and not attached):
[1] tools_3.2.0   splines_3.2.0

modified 2.7 years ago • written 2.8 years ago by dietahanson30
2
2.8 years ago by
United States
James W. MacDonald46k wrote:

I wouldn't use that terminology, but when you fit the larger model, and contingent upon the lack of an interaction, the main effects are estimating one effect while controlling for the other effect. Obviously if there is a significant interaction term, then you cannot control for the other effect, and you cannot then interpret the main effects.

But if you don't have a significant interaction, then you are increasing your degrees of freedom, which increases power to detect differences.

2
2.8 years ago by
Yunshun Chen370
Australia
Yunshun Chen370 wrote:

If the watershed effect is included in the design, glmFit and glmLRT will estimate the effect and account for it in the test.

Suppose your two levels of habitat are hab1 and hab2, and the two levels of watershed are shed1 and shed2. According to your design, the second model coefficient is not for the overall habitat term. It actually represents the habitat effect within the group of shed1. So the list of DE genes from glmLRT are DE between hab1 and hab2 within shed1. However, in the exact test you ignore the watershed effect. So you got a list of genes DE between hab1 and hab2 in both shed1 and shed2.

You should probably make a MDS plot first to see whether there is any evidence of the watershed effect. If there is, you should probably include it in the design to increase the testing power.

You can still test for the overall habitat effect while accounting for the watershed effect. However, it would be a lot easier to use a one way layout where 12 samples are assigned into four groups, one hab-shed combination for each. Then you can compare the average of the two hab1 groups with the average of the two hab2 groups using glmLRT with a contrast.

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by Yunshun Chen370
0
2.8 years ago by
dietahanson30
McGill University, Montreal, Canada
dietahanson30 wrote:

That makes sense--I didn't even think of the fact of increasing the degrees of freedom.

Thanks a lot for your help!

1

I am going to have to disagree with certain aspects of both my and Yunshun's answers.

First, I didn't read your question closely, and didn't understand that you were fitting an exactTest() on all your data, ignoring the watershed. In that case you have more degrees of freedom than the factorial model you fit, and the differences are probably mostly due to biases that arise because you are ignoring important information. So Yunshun makes a good point that you should look at MDS plots to see if there are big differences due to watershed.

Second, I disagree with Yunshun's contention that the second coefficient measures the difference between the two habitats within one watershed. While in some technical sense this is true (you are setting one watershed as the 'baseline', and making comparisons based on that), this is still a main effect and is intended to measure the average difference between habitats after controlling for the average difference between watersheds. In other words, you could relevel() your watershed factor and re-fit the model and the estimated values of your habitat coefficient wouldn't change.

HOWEVER, this is only true for those genes that do not have a significant interaction term, and in the case that you don't actually fit the interaction. If the interaction term is significant, then the second coefficient isn't interpretable (neither is the third coefficient). It neither represents the habitat effect within a given watershed, nor a main effect. It's uninterpretable.

On the other hand, for those genes where you can drop the interaction term, you should do so, in which case the second term should be interpreted as the main effect of habitat (where you have blocked on watershed to control for inter-watershed differences).

This is a tricky thing that I am not sure many people think about when doing genomics analysis. If you fit a factorial model with an interaction, you cannot then look at the top results for the main effects because if there is a significant interaction, then the main effects are not interpretable. You could hypothetically fit your model, find the genes with a significant interaction, then exclude those genes and fit a 2-way ANOVA without an interaction on the remaining genes.

Or you could do the arguably smarter thing, which is what Yunshun recommends, and fit a one-way layout. In that scenario, you can still fit an interaction using a contrast, and you can make comparisons within the two different watersheds (or within streams). Comparisons between the two watersheds (or between streams, averaging over watersheds) can still be a bit fraught, as the average of habitat1 vs habitat2 is still uninterpretable if there is a significant interaction.

ADD REPLYlink written 2.8 years ago by James W. MacDonald46k

I'm pretty sure that with the parametrization shown, the second coefficient does indeed measure the difference between habitats within the baseline watershed, not the main effect of habitat across all watersheds. If you used sum-to-zero contrasts for watershed, then I think the second coefficient would represent the average effect across all watersheds (which is still not quite the same thing as the main effect unless the design is perfectly balanced).

But in any case, I definitely agree that using the one-way design avoids all this confusion entirely and is probably the preferred way to do this analysis.

ADD REPLYlink written 2.8 years ago by Ryan C. Thompson6.8k
1

I agree with Ryan and Andy in the interpretation of the second coefficient, in that it represents the log-fold change between MS and ML only. This is because the log-fold change between RL and RS will be modelled as a linear combination of the second and fourth coefficients. The second coefficient could take on any value it wants, without affecting the ability of the model to fit to the observed RL-RS log-fold change (as the fourth coefficient picks up the slack, so to speak). This suggests that the value of the second coefficient does not depend on what RL-RS are doing, only on the behaviour of ML-MS. Thus, the second coefficient shouldn't be interpreted as a main habitat effect in the full model.

Either that, or we're all wrong, and we'll have enough egg on our faces to make a decent-sized omelette.

Edit: Andy is Yunshun, by the way. That's why you never see them both in the same room at the same time.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Aaron Lun20k

Ugh. You're right. I'll scrape the egg off of my face and cook the omelette. And now you know why I never fit 2-way ANOVA...

ADD REPLYlink written 2.8 years ago by James W. MacDonald46k

Good to know that there will be no egg on my face...

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Yunshun Chen370
0
2.7 years ago by
dietahanson30
McGill University, Montreal, Canada
dietahanson30 wrote:

Thanks everyone for your responses (I didn't realize until now that the conversation had continued after James' response, so sorry for the late reply--I really appreciate your answers!).

I should clarify exactly what it is I am trying to accomplish by fitting the model that I did. I am looking for genes that show parallel differential expression in both watersheds. For other phenotypic traits, this has been done by fitting a similar 2-way ANOVA. If there is a significant interaction, the trait is interpreted as being non-parallel, but if the interaction is not significant AND the habitat term is significant, it is interpreted as being parallel. So, my idea was to basically do the same thing with my expression data--fit the model to each gene, throw out the genes that have a significant interaction, then look for genes that have a significant habitat term.

Using that method left me with a list of 196 genes. I then used a method that has been used in a few papers with similar experimental setups where two separate exact tests are used (one for each watershed), then genes which are DE in both tests and share the same sign for log-FC are considered to be parallel. Using this second method leaves me with a list of 30 genes, of which 27 are shared with method 1. So, I was worried that my original idea/method was a lot less conservative, but I didn't know if those extra 169 genes it found are type 1 errors, or if the second method just has a lot of type 2 errors.

But now if I understand correctly, you are saying that in my method 1, the habitat coefficient is only testing for DE between ML and MS. In which case, it is understandable why I'm getting so many more genes than in method 2, where they have to be DE in both watersheds. Edit: although now that I look at the design matrix, it looks to me as if the 2nd coefficient is indeed testing for a general effect of habitat, not just ML vs MS--both MS and RS have "1" in the 2nd column of the design matrix and the column itself is called "habStream". If it were only testing ML vs MS, shouldn't only MS have "1"'s and wouldn't the column be "MStream"? Maybe I'm just misinterpreting the design matrix, but intuitively that's how I would read it.

In response to the suggestion for a MDS plot--I did make one and the main axis of variation separates the two watersheds, and the second axis separates the habitats. So I know that there is some parallel expression going on here, I'm just trying to find exactly which genes (another thing I have to do is see if the genes I found with either method load highly on the second MDS axis, which they hopefully will).

I'm not 100% confident, but I think using Yunshun's method of testing average expression in lakes vs. streams wouldn't be able to test specifically for parallel expression. I can imagine a case where on average, expression is higher in one environment than in the other, but the magnitude of that difference might be significantly different in the two watersheds, meaning that expression changes would not be parallel (interaction present).

And thoughts/feedback would be greatly appreciated, again.

2

I assure you, the second coefficient in your original design matrix is not testing for a general habitat effect. The name is misleading, as is the fact that RS has entries of 1's. Consider a case where a gene is upregulated in MS compared to ML, and downregulated in RS compared to RL. This will be modelled by a large positive second coefficient and a large negative fourth coefficient. If you drop the second coefficient under the belief that it represents a general habitat effect, you would incorrectly conclude that the change in habitat resulted in upregulation, even though there is downregulation in watershed R (that could even be stronger than the upregulation in watershed S, i.e., there may be a net downregulation effect due to habitat across both watersheds, which would be totally contrary to your conclusion). Thus, you cannot interpret the second coefficient as the main habitat effect.

While I'm here, you can identify parallel effects easily with your current design, by feeding contrast vectors to glmLRT:

con1 <- c(0,0,0,1) # To test for interaction effects.
con2 <- c(0,1,0,0) # To test for significant MS-ML differences.
con3 <- c(0,1,0,1) # To test for significant RS-RL differences.

Find the genes that aren't significant in con1, and intersect them with those that are significant in con2 and con3. There's no need to fit multiple design matrices, which really complicates your analysis pipeline (as you then have to re-estimate the dispersions, keep track of the different matrices and their corresponding DGEList and DGEGLM objects, etc.).

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Aaron Lun20k

Thanks for your answer, I think I understand now what the second coefficient is testing (though the name is misleading!). Those three contrasts then are exactly what I'm looking for--glad to know I can just use my original design matrix.

Thanks again!