Question

EdgeR - RNA-Seq time course analysis - Testing main and interaction effects and Post hoc Test

0

Entering edit mode

padmaswami15 • 0

@padmaswami15-9002

Last seen 8.6 years ago

United States

Hi All,

I am using edgeR to analyze data from my RNA-Seq time course (4 time points) experiment.

This is my experimental design. It is a factorial design 2*2*4*2*2 with a total of 64 samples (16 samples with 2 replicates each so a total of 32 samples from year1 and 16 samples with 2 replicates each so a total of 32 samples from year 2) comprising of four factors. They are Genotypes with 2 levels, Tissues with 2 levels, Time points with 4 levels and Year with 2 levels. I have the following code to create a design matrix and fit negative binomial GLM and conduct likelihood ratio test.

The following is my code using EdgeR package:

library(edgeR)

#The following is the count data
year<-read.csv("C:/berryallsamplestotalexonscountyear.csv",row.names="Symbol")

group<-factor(c("FP1","FP1","FP2","FP2","FP3","FP3","FP4","FP4","FS1","FS1","FS2","FS2","FS3","FS3","FS4","FS4","MP1","MP1","MP2","MP2","MP3","MP3","MP4","MP4","MS1","MS1","MS2","MS2","MS3","MS3","MS4","MS4","FPA","FPA","FPB","FPB","FPC","FPC","FPE","FPE","FSA","FSA","FSB","FSB","FSC","FSC","FSE","FSE","MPA","MPA","MPC","MPC","MPD","MPD","MPE","MPE","MSA","MSA","MSC","MSC","MSD","MSD","MSE","MSE"))
# Note: labels FP1 until MS4 denotes year 1 samples,FPA until MSE denotes year 2 samples

y<-DGEList(counts=year,group=group)
y
target<-read.csv("C:/targetyear.csv",header=TRUE)
target
Genotype<-factor(target$Genotype)
Tissue<-factor(target$Tissue)
Timepoint<-factor(target$Timepoint)
Year<-factor(target$Year)

design <- model.matrix(~Genotype+Tissue+Timepoint+Year+Genotype:Tissue+Genotype:Timepoint+Genotype:Year+Tissue:Timepoint+Tissue:Year+Timepoint:Year+Genotype:Tissue:Timepoint+Tissue:Timepoint:Year+Genotype:Timepoint:Year+Genotype:Tissue:Timepoint:Year,data=target)
design

y<-estimateGLMCommonDisp(y,design)
y<-estimateGLMTrendedDisp(y,design)
y<-estimateGLMTagwiseDisp(y,design)
y

fit <- glmFit(y,design)
colnames(fit)

Testing for main effects:

topTags(glmLRT(fit,coef=2:5))

Testing for interaction effects:

topTags(glmLRT(fit,coef=6:16))

I have the following questions:

I would like to know if the design matrix is correct to test for significance of main and interaction effects of the four variables (Genotype, Tissue, Timepoints, and Year) in the same model.
Also, how to interpret the genewise coefficients of all these variables and their interactions after fitting the model if they are significant or not?
Is it possible to do the post hoc test like Tukey’s HSD after edgeR? If not, could you please suggest an alternate way?

Thanks for all your time and the suggestions.

Swami

edger edgeR main and interaction effects edgeR post hoc test • 4.4k views

ADD COMMENT • link updated 8.6 years ago by Gordon Smyth 50k • written 8.6 years ago by padmaswami15 • 0

score 3 · Answer 1 · 2015-10-19

3

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 2 hours ago

The city by the bay

For your first question; you're setting off on a long and difficult road with your current analysis strategy. The concept of main effects only makes sense when the interaction term is negligible. To illustrate, consider a gene that has a positive log-fold change between genotypes for one tissue (I'll ignore all the other factors, to keep it simple) and a negative log-fold change for another tissue. The main effect of genotype has no meaning for this particular gene, because the genotype effect is conditional on the tissue that you're looking at, i.e., there is a significant interaction term between the genotype and tissue factors.

Simply dropping the coefficients corresponding to each factor is not sufficient to test for a "main effect", due to the presence of interaction terms in your design matrix. Instead, you have to identify genes with non-significant interaction terms involving the genotype; prune the design matrix such that those interaction terms are removed; and then repeat the testing for the main effect of genotype. Needless to say, this is a serious pain in the... neck.

For your analysis, I would recommend a one-way layout. It sounds simple, and it is, and that's why it's so nice:

grouping <- factor(paste(Genotype, Tissue, Timepoint, Year, sep="."))
design <- model.matrix(~0 + grouping)
colnames(design) <- levels(grouping)

Each combination of factors is considered to be a group, and the fitted value of each coefficient in the design matrix represents the average log-expression level for the corresponding group. It is trivial to compare between groups with makeContrasts. If you want something akin to a "main effect", you can test whether average of all groups with one genotype is equal to the average of all groups with another genotype. For example, if you have genotypes A and B in tissues X, Y and Z, you could do something like:

con <- makeContrasts((A.X + A.Y + A.Z)/3 - (B.X + B.Y + B.Z)/3, levels=design)

This can be fed to the contrast argument in glmLRT. Alternatively, you could do something like:

con <- makeContrasts(A.X - B.X, A.Y - B.Y, A.Z - B.Z, levels=design)

You could then select genes that have the same sign of log-fold change across all three comparisons , to identify DE genes that are changing between genotypes in a consistent manner across tissues. I feel that this comparison is more useful than testing for a main effect, as it demonstrates that there is significant DE in a consistent direction across all tissues.

As for your second question; the interpretation of the coefficients in your current model is horrendously complicated. This is why I suggest using the one-way layout instead, where the interpretation of each coefficient is a lot easier. Of course, in the one-way layout, it makes no sense to test each coefficient in isolation for significance - that would be like asking whether a gene is significantly expressed, which doesn't really mean much. Rather, you need to use makeContrasts to test for significant differences between coefficients.

Finally; if I recall correctly, the null hypothesis for Tukey's HSD test states that the means for all groups in an ANOVA are equal. If that's all you want to test, you can use makeContrasts to do that for you. Re-using the example above:

con <- makeContrasts(A.X - A.Y, A.X - A.Z, A.X - B.X, A.X - B.Y, A.X - B.Z, levels=design)

This will test for any difference between any of the groups. There's no need to specify, e.g., A.Z - A.Y, because if A.X is equal to A.Y, and A.X is equal to A.Z, then A.Y and A.Z must implicitly be equal (and so on, for all other pairs of groups). I've arbitrarily chosen A.X as a reference in this contrast, but you can use any group you want.

ADD COMMENT • link 8.6 years ago Aaron Lun ★ 28k

0

Entering edit mode

I think Tukey's HSD is about identifying which contrasts of an ANOVA are the significant ones (i.e. post-hoc testing). I believe the closest limma/edgeR equivalent is decideTests, but it doesn't work in quite the same way.

ADD REPLY • link 8.6 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Ah, that's what I was missing. Well, decideTestsDGE won't do much in that respect; the significance of each comparison in an ANOVA-like contrast is determined by the significance of the entire contrast. You'd have more possibilities with classifyTestsF in limma, but that only works on t-statistics.

ADD REPLY • link 8.6 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you Aaron for your clear explanation with examples.

First objective was to use contrasts just like the way you showed in the example and get the top differential expressed genes. I have done comparisons between different groups and got the top differential expressed genes.
Second objective was to do the testing for main and interaction effects. From your answers, I understand that performing Tukey's HSD is not possible with these results.

I would like to clarify:

Is it a good idea to test for main and interaction effects in time course analysis? If so, would you suggest to go for limma to test for interaction effects first and then check for main effects if needed as two separate models?
Please let me know what would be the alternate method for Tukey's method for time course analysis?

Tissues and Time point are the main factors and would like to test all interactions so as to know if it is significant with other factors.

Thanks again.

Thank you Ryan for your response regarding Tukey's HSD.

ADD REPLY • link 8.6 years ago padmaswami15 • 0

1

Entering edit mode

If you want to analyse your data as a time course, that requires a major overhaul of the design matrix to include a real-valued time covariate. Now, trying to interpret main and interaction effects in a factorial model is hard enough, because of reasons I've described in my original post. Trying to do so with real-valued covariates in the mix becomes harder still - I suspect that any benefit in reporting the "main effect" will be totally offset by the difficulty of interpreting it. This will be true for either limma or edgeR, so you can't get away from the problem by switching to the former. Instead, I would recommend sticking to something close to the one-way layout, e.g., define each combination of factors (except for time) as a group, and then add a covariate term for each group that represents the effect of time in that group. You could probably do this with:

design <- model.matrix(~0 + group.wo.time + group.wo.time:time) # time is numeric.

As for the post-hoc testing; if you want to figure out which contrast in a ANOVA-like set of contrasts is significant, then the easiest way to do so would be to test all contrasts separately for each gene (via separate calls to glmLRT), pool the p-values from all of those contrasts, and apply the BH method across all pooled p-values. This treats each significant gene from each contrast as a separate discovery while controlling the FDR across the entire analysis.

ADD REPLY • link 8.6 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you Aaron for your suggestion regarding the design matrix. I will try that one-way layout with and without time. I am not able to completely understand the following

"to test all contrasts separately for each gene (via separate calls to glmLRT), pool the p-values from all of those contrasts, and apply the BH method across all pooled p-values. This treats each significant gene from each contrast as a separate discovery while controlling the FDR across the entire analysis".

I don't understand when you mean pool the p-values from all those contrasts and apply BH method. Could you please explain this part?

Thank you very much.

ADD REPLY • link 8.6 years ago padmaswami15 • 0

1

Entering edit mode

If you assign the output of glmLRT to a result object, then you can extract the vector of p-values by calling result$table$PValue. This vector contains the p-values for each gene, for the contrast that was supplied to glmLRT. Now, if you have many different contrasts, then you end up with many p-value vectors from repeated calls of glmLRT with each contrast. You can pool them together and apply a global BH correction, by doing something like:

all.pvals <- cbind(pval1, pval2, pval3) # where each 'pval' is a vector
all.adjp <- p.adjust(all.pvals, method="BH")
dim(all.adjp) <- dim(all.pvals)

You can easily extract the adjusted p-values for each contrast, from the corresponding column of all.adjp.