I have conducted an analysis differential expression analysis using edgeR, but am receiving some criticism of the design of the statistical tests. I am hoping someone can either verify that the approach I have taken is valid, or that I should be doing the type of analysis that has been suggested. Some background on the experiment:
I am interested in gene expression changes when individuals from two different populations are fed on different food sources. Specifically, each population typically specializes on a particular type of food that differs between the populations. The experimental design involved feeding individuals from each population on their normal food source as well as the food source typically used by the other population. In addition, each population was also fed an artificial food source.
Population: two levels
Food: three levels
Each combination had three biological replicates.
There are specific pre-planned comparisons I am interested in:
- Compare gene expression between populations on each of the two typical food sources (I am not interested in comparing the populations on the artificial food source).
- Compare gene expression on the food sources within each population (I am only interested in one comparison involving the artificial food source though)
- Compare gene expression between the populations across all food sources
Following the recommendations in the edgeR documentation (as I understood them in section 3.3.1), I set this up so that each treatment combination was a group and then made contrasts corresponding to the comparisons listed above. One of the follow-up goals was to look at reaction norms for DE genes from comparison one, looking for genes following a few specific patterns (basically constructing a generic interaction plot, though some of the patterns I’m looking for would actually involve no interaction). I reconstructed these by examining overlapping DE gene lists produced by comparison one and two. So, for example, I used comparison one to plot the relative location of points for each food in each population (e.g. spaced apart if DE or on top of each other if not DE), and comparison two to connect the lines (e.g. upward sloping, downward sloping, or flat if the gene was not DE in comparison two). Note: specific values are not of interest, but rather just the general pattern.
It has been suggested that there is a problem because I am not examining interactions formally, and using this as a starting point to then look at the reaction norms. I interpret this to mean that I should analyze the data in a traditional full factorial format where a significant interaction term would give license to further dissect the interaction. The first issue I have with this is that the artificial food source was only included for a few specific comparisons, and I wouldn’t be interested in some of the comparisons using this group that would naturally come out of a factorial analysis. I suppose that I could look at those experiments separately and then analyze the data with the other two foods in a factorial design. However, ultimately I think I would end up needing to construct the contrasts (one and two) that I set-up directly in the first place in order to investigate the reaction norms. Moreover, some of the comparisons would not come out as directly if I used the factorial method. For example, to get all the genes that differ between populations on one food source I would have to pull some of the genes from the interaction list and also the population main effect list. My understanding from the manual (and more so from the limma manual) is that these two approaches would basically be equivalent. But, I want to make sure I am not missing something by not doing the full factorial analysis as obviously I would like to do the most appropriate analysis. Is the approach I have taken valid? Am I really “ignoring” the interaction by the approach I have taken as has been suggested? Is there anything to be gained in this case by doing the full factorial and looking at interactions specifically? If so, I may need to ask a follow-up regarding specifically how to set this up. I apologize for the long post, and greatly appreciate any help anyone can give.
Thanks,
Jeremy
Jenny and Gordon,
Thanks to both of you for responding, these are very helpful answers. If I want to also make a contrast to look for a food effect (I only care about the two normal foods) would it be:
(P1.F1 +P2.F1) - (P1.F2 + P2.F2)
Likewise, would a population effect be:
(P1.F1 + P1.F2) - (P2.F1 + P2.F2)
Thank you very much for your attention to this, I really appreciate your willingness to help. Thanks!
You're close. If you want the traditional main effect of food (average of F1 - minus average of F2), you'd need:
((P1.F1 +P2.F1) - (P1.F2 + P2.F2)) / 2
Likewise, the main population effect would be:
((P1.F1 + P1.F2) - (P2.F1 + P2.F2)) /2
Jenny
Sorry to ask another question, but I'm wondering how to write an interaction contrast that includes all three levels of food (i.e. as above but with the artificial food included)? I was unable to find an example of this in edgeR or Limma documentation. I am interested in gene expression differences between populations that are consistent across all three foods, so I think I need to look at the interaction before doing the population comparison. Thank you once again for your time, I really appreciate it!
Following on from the answers above, the contrast that includes all three levels of food would look like:
Note that this mightn't necessarily find differences between populations that are consistent across foods. For example, the contrast above might detect a DE gene that is up between populations for F1 and F2, but down for F3. If you want to guarantee consistent differences, you'll have to do contrasts separately for each food type and intersect the resulting DE lists with the same sign of log-fold change.
Thank you for the advice. Just to clarify, I have done the contrast you described above, but since I'm also interested in consistent differences across foods I thought I should look at the interaction contrast between population and food that includes all three levels of food. My idea was that any gene that was significant in the population contrast but did not have a significant interaction could be considered consistently different between the populations. I'm just not sure how to write that interaction contrast, so would appreciate any guidance. Also, would this approach be reasonable or would doing the contrasts separately be a better approach? Thanks!
Aaron's contrast gives the overall main effect of population 1 vs. population 2, averaged over all 3 food levels, but as he said this may not give you the genes that are consistently different across all 3 food types. However, instead of intersecting the separate contrasts within each food type, you can calculate the overall interaction term by making 2 interaction terms between the 3 food types and then getting an F-statistic on both together :
Then look for the genes that have a significant main effect of population but a non-significant overall interaction term.
Jenny
Great, that's exactly what I was hoping to do. Thanks to both of you. Your time spent on these forums is really appreciated.
In sitting down to do this, we realized the instructions above appear to be for Limma. We would like to do this in edgeR since all the rest of the analyses are done with it. How would we write this overall contrast in edgeR? Thanks!
The
glmLRT
function in edgeR will happily accept a contrast matrix generated withmakeContrasts
. So, if you've already runglmFit
to get afit
object, all you have to do is:assuming that you want to do the same thing that Jenny's code does for
limma
.Ok, after doing the analyses suggested I've run into some additional issues that I hope someone can weigh in on. Two suggestions were given above about how to find genes with consistent differences between the populations across the three food types. One involved doing three separate contrasts on each food type and intersecting the gene lists. The other involved finding genes with a significant population effect but no significant population x food interaction. I initially thought I would do the latter analysis. However, when I do this, I run into the following issue: I find 211 genes with a population effect but no interaction--so I could concluded that these show consistent differences across populations. However, I'm also interested in reporting some of the specific comparisons between populations on individual food types. For these I get fewer genes that are significantly differentially regulated (e.g. comparing the populations on food type one I find 81 genes). It is a little non-intuitive that I could have 211 genes with a consistent difference across food types when a comparison on one food type has only 81 (it seems as though that should be the upper-limit for consistent differences). I think I understand how this can happen since the tests are constructed differently even though they are aimed at answering the same question. But, this is making me think I might be better off intersecting the three gene lists instead of the other approach so that I don't report this seemingly non-intuitive result. I just want to verify that either of these ways of doing it would be statistically sound. Is there an advantage of one way over the other (e.g. multiple testing issues), or are they both reasonable approaches to get at this question that could be justified? Thank you!
Actually, it's not unreasonable to get more genes with a significant overall population effect, compared to a comparison between populations on one specific food type. Consider a gene for which there is a small but consistent difference between populations at all food types. The small difference would not be enough to reject the null at any single food type; however, if you combine the evidence across all food types, you've now got enough evidence for significance.
In contrast, the intersection-based approach would never detect more significant genes than any one of the comparisons on the single food types. This is obvious, as the intersection of multiple sets cannot be larger than any one of the contributing sets.
In general, both strategies are reasonable for solving your question. I prefer to use the intersection approach, as I feel that it's easier to interpret (albeit at the cost of conservativeness). Also, the interaction-based method requires you to identify genes with no population-food interaction. This is somewhat difficult to do in a statistically rigorous manner, as the machinery in edgeR is built to detect significant differences rather than significant similarity.