Hi All,
I am using edgeR to analyze data from my RNA-Seq time course (4 time points) experiment.
This is my experimental design. It is a factorial design 2*2*4*2*2 with a total of 64 samples (16 samples with 2 replicates each so a total of 32 samples from year1 and 16 samples with 2 replicates each so a total of 32 samples from year 2) comprising of four factors. They are Genotypes with 2 levels, Tissues with 2 levels, Time points with 4 levels and Year with 2 levels. I have the following code to create a design matrix and fit negative binomial GLM and conduct likelihood ratio test.
The following is my code using EdgeR package:
library(edgeR) #The following is the count data year<-read.csv("C:/berryallsamplestotalexonscountyear.csv",row.names="Symbol") group<-factor(c("FP1","FP1","FP2","FP2","FP3","FP3","FP4","FP4","FS1","FS1","FS2","FS2","FS3","FS3","FS4","FS4","MP1","MP1","MP2","MP2","MP3","MP3","MP4","MP4","MS1","MS1","MS2","MS2","MS3","MS3","MS4","MS4","FPA","FPA","FPB","FPB","FPC","FPC","FPE","FPE","FSA","FSA","FSB","FSB","FSC","FSC","FSE","FSE","MPA","MPA","MPC","MPC","MPD","MPD","MPE","MPE","MSA","MSA","MSC","MSC","MSD","MSD","MSE","MSE")) # Note: labels FP1 until MS4 denotes year 1 samples,FPA until MSE denotes year 2 samples y<-DGEList(counts=year,group=group) y target<-read.csv("C:/targetyear.csv",header=TRUE) target Genotype<-factor(target$Genotype) Tissue<-factor(target$Tissue) Timepoint<-factor(target$Timepoint) Year<-factor(target$Year) design <- model.matrix(~Genotype+Tissue+Timepoint+Year+Genotype:Tissue+Genotype:Timepoint+Genotype:Year+Tissue:Timepoint+Tissue:Year+Timepoint:Year+Genotype:Tissue:Timepoint+Tissue:Timepoint:Year+Genotype:Timepoint:Year+Genotype:Tissue:Timepoint:Year,data=target) design y<-estimateGLMCommonDisp(y,design) y<-estimateGLMTrendedDisp(y,design) y<-estimateGLMTagwiseDisp(y,design) y fit <- glmFit(y,design) colnames(fit)
Testing for main effects:
topTags(glmLRT(fit,coef=2:5))
Testing for interaction effects:
topTags(glmLRT(fit,coef=6:16))
I have the following questions:
- I would like to know if the design matrix is correct to test for significance of main and interaction effects of the four variables (Genotype, Tissue, Timepoints, and Year) in the same model.
- Also, how to interpret the genewise coefficients of all these variables and their interactions after fitting the model if they are significant or not?
- Is it possible to do the post hoc test like Tukey’s HSD after edgeR? If not, could you please suggest an alternate way?
Thanks for all your time and the suggestions.
Swami
I think Tukey's HSD is about identifying which contrasts of an ANOVA are the significant ones (i.e. post-hoc testing). I believe the closest limma/edgeR equivalent is
decideTests
, but it doesn't work in quite the same way.Ah, that's what I was missing. Well,
decideTestsDGE
won't do much in that respect; the significance of each comparison in an ANOVA-like contrast is determined by the significance of the entire contrast. You'd have more possibilities withclassifyTestsF
inlimma
, but that only works on t-statistics.Thank you Aaron for your clear explanation with examples.
I would like to clarify:
Tissues and Time point are the main factors and would like to test all interactions so as to know if it is significant with other factors.
Thanks again.
Thank you Ryan for your response regarding Tukey's HSD.
If you want to analyse your data as a time course, that requires a major overhaul of the design matrix to include a real-valued time covariate. Now, trying to interpret main and interaction effects in a factorial model is hard enough, because of reasons I've described in my original post. Trying to do so with real-valued covariates in the mix becomes harder still - I suspect that any benefit in reporting the "main effect" will be totally offset by the difficulty of interpreting it. This will be true for either
limma
oredgeR
, so you can't get away from the problem by switching to the former. Instead, I would recommend sticking to something close to the one-way layout, e.g., define each combination of factors (except for time) as a group, and then add a covariate term for each group that represents the effect of time in that group. You could probably do this with:As for the post-hoc testing; if you want to figure out which contrast in a ANOVA-like set of contrasts is significant, then the easiest way to do so would be to test all contrasts separately for each gene (via separate calls to
glmLRT
), pool the p-values from all of those contrasts, and apply the BH method across all pooled p-values. This treats each significant gene from each contrast as a separate discovery while controlling the FDR across the entire analysis.Thank you Aaron for your suggestion regarding the design matrix. I will try that one-way layout with and without time. I am not able to completely understand the following
"to test all contrasts separately for each gene (via separate calls to
glmLRT
), pool the p-values from all of those contrasts, and apply the BH method across all pooled p-values. This treats each significant gene from each contrast as a separate discovery while controlling the FDR across the entire analysis".I don't understand when you mean pool the p-values from all those contrasts and apply BH method. Could you please explain this part?
Thank you very much.
If you assign the output of
glmLRT
to aresult
object, then you can extract the vector of p-values by callingresult$table$PValue
. This vector contains the p-values for each gene, for the contrast that was supplied toglmLRT
. Now, if you have many different contrasts, then you end up with many p-value vectors from repeated calls ofglmLRT
with each contrast. You can pool them together and apply a global BH correction, by doing something like:You can easily extract the adjusted p-values for each contrast, from the corresponding column of
all.adjp
.Thank you Aaron. That helps a lot with example code.