Question

EdgeR multiple factor design question

0

Entering edit mode

bioinformatics ▴ 20

@bioinformatics-11531

Last seen 8.1 years ago

I have rna-seq data with three facors : treated , time, temperature

I want to get differential expressed genes for the treatment, but the design should also account for the

temperature and time.

I have created the following setup:

d <- DGEList(counts = countTable)
#~ #normalization must be done before GLM analysis.
dge <- calcNormFactors(d)
design <- model.matrix(~0+treated+time+temperature)
dge <- estimateGLMCommonDisp(dge, design)
dge <- estimateGLMTagwiseDisp(dge, design)
glmfit.dge <- glmFit(dge, design,dispersion=dge$common.dispersion)
lrt.dge <- glmLRT(glmfit.dge, coef="treated")
result <- topTags(lrt.dge, adjust.method="BH", sort.by="logFC")

I have some doubts about the design model. I have not had any experience with design models, thus i am unsure if what i am doing is correct.

Could someone explain why what i am doing here is right or wrong. And do i have to use 0+ in the formula? If i leave it out it creates an intersect, what does this do ?

edger design matrix • 1.1k views

ADD COMMENT • link 8.6 years ago bioinformatics ▴ 20

score 1 · Answer 1 · 2016-12-06

I doubt that you have a coefficient named "treated" in the colnames of your design matrix. For example:

treated <- rep(c("Y", "N"), each=4) # yes or no
time <- rep(c("day1", "day2"), 4)
temperature <- rep(c("hot", "hot", "cold", "cold"), 2)
design <- model.matrix(~0 + treated + time + temperature)

If you look at the column names here, you'll get:

[1] "treatedN"       "treatedY"       "timeday2"       "temperaturehot"

The first two coefficients represent the average log-expression in the untreated and treated groups respectively. The timeday2 coefficient represents the effect of time, specifically the average log-fold change of day 2 samples over their day 1 counterparts. The temperaturehot coefficient represents the effect of temperature, i.e., the log-fold change of hot over cold.

Now, assuming that you want to test for a treatment effect, you would do:

con <- makeContrasts(treatedY - treatedN, levels=design)

... and supply the resulting value into the contrast argument of glmLRT, which will test for a difference in the log-expression of the treated and untreated samples. All of this is fairly well described in the edgeR user's guide - see page 30, for example.

As for your other question, the ~0 just removes the intercept from the design matrix. I find that this makes the columns of the design matrix easier to interpret. If you don't include this in the model formula, you'll get instead:

[1] "(Intercept)"    "treatedY"       "timeday2"       "temperaturehot"

Examination of the design matrix itself indicates that the intercept represents the log-expression of the untreated group, and treatedY represents the log-fold change upon treatment. This requires more mental gymnastics because the treatedY coefficient represents the treatment effect (i.e., Y over N), rather than the log-expression in the treated group as you might expect. But, it's a matter of taste - it's certainly easier to perform the above contrast if you have the intercept, as all you have to do is to drop coef="treatedY".

score 0 · Answer 2 · 2016-12-06

0

Entering edit mode

bioinformatics ▴ 20

@bioinformatics-11531

Last seen 8.1 years ago

Dear Aaron,

Thank you for your quick and straight to the point answer.

There are some parts where I still have some doubt. For example you propose :

con <- makeContrasts(treatedY - treatedN, levels=design)

And put this into the contrast of glmLRT. Does this mean that the steps :

dge <- estimateGLMCommonDisp(dge, design)

dge <- estimateGLMTagwiseDisp(dge, design)

glmfit.dge <- glmFit(dge, design,dispersion=dge$common.dispersion)

Are obsolete ?

I understand from your reaction that changing the code I posted into :

design <- model.matrix(~ treated + time + temperature)

and changing:

lrt.dge <- glmLRT(glmfit.dge, coef="treatedY")

This will be sufficient to observe deferentially expressed genes for the treatedY right?

ADD COMMENT • link 8.6 years ago bioinformatics ▴ 20

0

Entering edit mode

Your code is impossible to read, it's all jumbled together.
Reply to answers with "add comment" rather than "add answer".

ADD REPLY • link 8.6 years ago Aaron Lun ★ 28k

0

Entering edit mode

That's better. As for the first question, no, the commands aren't obsolete. You still need to estimate the dispersion in order to do the testing. Actually, you could just make life easier for yourself and do:

dge <- estimateDisp(dge, design)
glmfit.dge <- glmFit(dge, design)

For the second question, if you change the design matrix to not have the intercept, then the step:

lrt.dge <- glmLRT(glmfit.dge, coef="treatedY")

... will be correct if you want to detect differences due to treatment. This would do the same thing as using a design matrix without an intercept and then calling makeContrasts, as I've described above. The only difference between these two strategies is in ease of use; for a design matrix without an intercept, the call to glmLRT is simpler but the interpretation of the coefficients is harder.

ADD REPLY • link 8.6 years ago Aaron Lun ★ 28k