Question

design for glm for edgeR

0

Entering edit mode

gthm ▴ 30

@gthm-8377

Last seen 5.0 years ago

spain

I would like to compare the DE genes between Treatment (T) vs Control (C). The Treatment and controls have subgroups ( which are minor variants of treatments/controls ), so I would expect some genes to show differential expression between different subgroups of treatment/control, but I am only interested in genes that show DE between T and C. I want to eliminate any genes that show DE only between subgroups. i.e If a gene is expressed in T1 and T2 but not in T3, I would like to skip those genes from DE analysis.

`Treatment`	`sub_group`
`T`	`T1`
`T`	`T1`
`T`	`T1`
`T`	`T2`
`T`	`T2`
`T`	`T2`
`T`	`T3`
`T`	`T3`
`T`	`T3`
`C`	`C1`
`C`	`C1`
`C`	`C1`
`C`	`C2`
`C`	`C2`
`C`	`C2`

What I am planning to do is the following and I would like to know if its the correct approach.

treat <- c(rep(c("T"),each=9), rep(c("C"),each=6)) sub <- c(rep(c("T1"),each=3),rep(c("T2"),each=3),rep(c("T3"),each=3),rep(c("C1"),each=3),rep(c("C2"),each=3)) design <- model.matrix(~sub+treat) y <- DGEList(counts=counts_data, group=treat) y <- calcNormFactors(y) y <- estimateGLMCommonDisp(y, design, verbose=TRUE) y <- estimateGLMTrendedDisp(y, design) y <- estimateGLMTagwiseDisp(y, design) fit <- glmFit(y, design) lrt <- glmLRT(fit)

edgeR • 1.2k views

ADD COMMENT • link updated 8.4 years ago by Michael Love 41k • written 8.4 years ago by gthm ▴ 30

0

Entering edit mode

note: I edited away the DESeq tags because it looks like this is a question about how to do an edgeR analysis.

ADD REPLY • link 8.4 years ago Michael Love 41k

score 1 · Answer 1 · 2015-12-15

Your design matrix will not be of full column rank, as sub is nested within treat. It's impossible to estimate the last coefficient - for example, a treatT of 1 and a subT1, subT2 and subT3 of 0 would give the same fitted values as a treatT of 0 and the others equal to 1. You need to construct the design matrix using only one factor or the other. I'd probably construct it using sub if you're expecting some DE between subpopulations. Even if you're not interested in the DE between subpopulations, you should still account for it in your model, otherwise your dispersion estimates will be inflated.

Now, for the construction of the contrasts, you can do something like:

con <- makeContrasts((subT1 + subT2 + subT3)/3 - subC2/2, levels=design)

... and feed this as the contrast= argument in glmLRT, to test for differences between the average of the treated to the control subpopulations. The division by two for the control is intended, even though only one coefficient is present; this is because the C1 expression is implicitly present as the intercept, but cancels out so it doesn't show up in the above expression. Anyway, genes that are significant in the above contrast will be those that are DE between treatment and control. Note that this does not exclude genes that are DE between subpopulations, which is fine - for example, if you have a gene that was DE between T1, T2 and T3, but the expression in all treated subpopulations was upregulated relative to C1/2, then you'd want to call it as being DE between treatment and control.

That said, be careful in your interpretation, as you might get some genes where expression is being driven by a strong log-fold change in only one subpopulation. At worst, you could end up with a scenario where, e.g., T1 and T2 are downregulated relative to C1/2 but T3 is very strongly upregulated, such that the treated average is increased over the control. This would give a misleading impression of upregulation upon treatment. To protect against this, I would also do an ANOVA-like contrast by setting coef=2:5, and only look at the genes where the subT* coefficients were all positive and greater than subC2, or all negative and less than subC2. This, at least, ensures that all treated subpopulations are changing in the same direction relative to the control subpopulations.