[edgeR Usage] Constructing DGE test with fewer input variables
1
0
Entering edit mode
mousheng xu ▴ 10
@mousheng-xu-2280
Last seen 6.7 years ago

Suppose you have tumor and normal samples treated with compound A with 0, 1, 2 mM for 0, 1, 2 hours, and you want to do DGE analysis using edgeR. You want to answer the question "which genes are differentially expressed in response to compound A": after taking into account of the impact of disease condition, time of treatment, and concentration of A (noted as [A]), you want to focus on the relationship of treatment by A and gene expression, and don't care about disease condition, time of treatment, not even [A].

How do you do it in edgeR?

In R, assuming the log(expression) being normally distributed, you could simply do glm(log(expression) ~ disease_condition + [A] + time. Then you only need to check the slope of [A] and its p-value, right? But since log(expression) is not normally distributed, we cannot use glm in R.

While in edgeR, things seem to be much more complicated. First, both time and [A] cannot be considered to be continuous (correct me if I am wrong); Second, I don't know how to do the DGE analysis.

For simplicity, let's forget about time in the problem -- just consider disease condition and [A] (0 & 1 mM), each with a replicate. Your gene count file looks like the following (T for tumor, N for normal; A0 for [A] = 0, A1 for [A] = 1; column 2~9 are gene counts for 8 samples, one column for one sample):

GENE T_A0 T_A0 T_A1 T_A1 N_A0 N_A0 N_A1 N_A1
GENE_A ...
GENE_B ...
...

You will have four levels: T_A0, T_A1, N_A0, N_A1.

Your code snippet looks like the following:

y <- DGEList(counts=d, group=grp);
y <- calcNormFactors(y); # global normalization
dsgn <- model.matrix(~0+grp); # now allowing intercept
y <- estimateDisp(y, dsgn);
fit <-glmFit(y, dsgn)

The question is how to construct your test (or contrast). Assuming the levels are ordered as above, what would be your contrast? c(-0.5, 0.5, -0.5, 0.5) or something else? Or is there a better approach? Or am I terribly wrong?

edger • 891 views
1
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 16 hours ago
The city by the bay

You are wrong on a number of counts.

1. If log-expression were normally distributed, you should be using linear models with lm. And if you were to do that anyway, it would be better to use limma via voom.
2. edgeR is no more complicated than glm. Whatever formula you want to use in the latter, you can use in the former. The only additional work required is to perform empirical Bayes shrinkage, and that is completely necessary in the presence of limited replicates.
3. Time and concentration can be continuous. Why makes you think that they can't be?

Moving onto your simple example, the contrast between levels would indeed be:

con <- makeContrasts((T_A1 + N_A1)/2 - (T_A0 + N_A0)/2, levels=design)

... which will test if the average of A1 groups is different from the average of A0 groups. If you want to be more stringent, you might do:

con <- makeContrasts(T_A1 - T_A0, N_A1 - N_A0, levels=design)

... and select for genes that are significant and have the same sign of the log-fold change in both contrasts.

Obviously, you can also use an additive model in edgeR by changing the design matrix (~disease + A). In this case, just dropping the treated/untreated factor would tell you the effect of A. However, this involves making some additional assumptions regarding the lack of interaction between disease and treatment effects - regardless of whether you use edgeR or glm.

0
Entering edit mode

1. Need to digest the difference of glm & lm. Thought they were almost identical except that glm allows "link";

2. OK, makes sense. So if you have enough replicates (e.g. 4 replicates each for 0, 0.1, 0.5, 1mM concentration for disease samples and normal samples), then you can use glm?

3. The reason that made me think edgeR disallows continuous variable is the "group" -- you assign a group to each sample, and even more, in the user's guide section 3.3.1, there is this line:

Group <- factor(paste(targets$Treat,targets$Time,sep="."))

which gives me the impression that an exhaustive combination of "factors" have to be used.

If I want to treat [A] as continuous, what would be "group" & "level" definitions?

How if I treat all variables (disease condition, [A], and time) as continuous, I would have no groups and no levels? What would my R script be?

Thanks much!

2
Entering edit mode

For question 2, not really. You still benefit from sharing information between genes and (for QL edgeR) accounting for the uncertainty of dispersion estimation. edgeR is also faster than glm when run across thousands of genes, and it provides more robust convergence for additive models. Finally, you don't get the benefit of TMM normalization when you run glm on the raw counts - this is necessary to avoid composition biases when you have unbalanced up/downregulation.

For question 3, the group setting is only necessary for classic edgeR. For GLM edgeR, you can just set up a design matrix. Read through Section 3.4 for additive models, and read the case studies in the user's guide. The reason we parametrize experimental designs in terms of groups is that a one-way layout is mathematically equivalent to an interaction model but is much easier to interpret. If you have continuous covariates, this is not possible so you just leave it as an additive model.