Question: EdgeR multiple factor design question
0
3.0 years ago by
bioinformatics10 wrote:

I have rna-seq data with three facors : treated , time, temperature

I want to get differential expressed genes for the treatment, but the design should also account for the

temperature and time.

I have created the following setup:

d <- DGEList(counts = countTable)
#~ #normalization must be done before GLM analysis.
dge <- calcNormFactors(d)
design <- model.matrix(~0+treated+time+temperature)
dge <- estimateGLMCommonDisp(dge, design)
dge <- estimateGLMTagwiseDisp(dge, design)
glmfit.dge <- glmFit(dge, design,dispersion=dge$common.dispersion) lrt.dge <- glmLRT(glmfit.dge, coef="treated") result <- topTags(lrt.dge, adjust.method="BH", sort.by="logFC") I have some doubts about the design model. I have not had any experience with design models, thus i am unsure if what i am doing is correct. Could someone explain why what i am doing here is right or wrong. And do i have to use 0+ in the formula? If i leave it out it creates an intersect, what does this do ? edger design matrix • 425 views ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by bioinformatics10 Answer: EdgeR multiple factor design question 1 3.0 years ago by Aaron Lun25k Cambridge, United Kingdom Aaron Lun25k wrote: I doubt that you have a coefficient named "treated" in the colnames of your design matrix. For example: treated <- rep(c("Y", "N"), each=4) # yes or no time <- rep(c("day1", "day2"), 4) temperature <- rep(c("hot", "hot", "cold", "cold"), 2) design <- model.matrix(~0 + treated + time + temperature) If you look at the column names here, you'll get: [1] "treatedN" "treatedY" "timeday2" "temperaturehot" The first two coefficients represent the average log-expression in the untreated and treated groups respectively. The timeday2 coefficient represents the effect of time, specifically the average log-fold change of day 2 samples over their day 1 counterparts. The temperaturehot coefficient represents the effect of temperature, i.e., the log-fold change of hot over cold. Now, assuming that you want to test for a treatment effect, you would do: con <- makeContrasts(treatedY - treatedN, levels=design) ... and supply the resulting value into the contrast argument of glmLRT, which will test for a difference in the log-expression of the treated and untreated samples. All of this is fairly well described in the edgeR user's guide - see page 30, for example. As for your other question, the ~0 just removes the intercept from the design matrix. I find that this makes the columns of the design matrix easier to interpret. If you don't include this in the model formula, you'll get instead: [1] "(Intercept)" "treatedY" "timeday2" "temperaturehot" Examination of the design matrix itself indicates that the intercept represents the log-expression of the untreated group, and treatedY represents the log-fold change upon treatment. This requires more mental gymnastics because the treatedY coefficient represents the treatment effect (i.e., Y over N), rather than the log-expression in the treated group as you might expect. But, it's a matter of taste - it's certainly easier to perform the above contrast if you have the intercept, as all you have to do is to drop coef="treatedY". ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by Aaron Lun25k Answer: EdgeR multiple factor design question 0 3.0 years ago by bioinformatics10 wrote: Dear Aaron, Thank you for your quick and straight to the point answer. There are some parts where I still have some doubt. For example you propose : con <- makeContrasts(treatedY - treatedN, levels=design) And put this into the contrast of glmLRT. Does this mean that the steps : dge <- estimateGLMCommonDisp(dge, design) dge <- estimateGLMTagwiseDisp(dge, design) glmfit.dge <- glmFit(dge, design,dispersion=dge$common.dispersion)

Are obsolete ?

I understand from your reaction that changing the code I posted into :

design <- model.matrix(~ treated + time + temperature)


and changing:

lrt.dge <- glmLRT(glmfit.dge, coef="treatedY")

This will be sufficient to observe deferentially expressed genes for the treatedY right?

That's better. As for the first question, no, the commands aren't obsolete. You still need to estimate the dispersion in order to do the testing. Actually, you could just make life easier for yourself and do:

dge <- estimateDisp(dge, design)
glmfit.dge <- glmFit(dge, design)

For the second question, if you change the design matrix to not have the intercept, then the step:

lrt.dge <- glmLRT(glmfit.dge, coef="treatedY")

... will be correct if you want to detect differences due to treatment. This would do the same thing as using a design matrix without an intercept and then calling makeContrasts, as I've described above. The only difference between these two strategies is in ease of use; for a design matrix without an intercept, the call to glmLRT is simpler but the interpretation of the coefficients is harder.