Question

EdgeR: setting reference condition and design matrix headers

0

Entering edit mode

Ekarl2 ▴ 80

@ekarl2-7202

Last seen 8.1 years ago

Sweden

Hello,

My current dataset has four treatments in a typical set-up with a control, two single exposures and one double exposure. However, unlike e. g. testing potentially toxic substances, I just have two different concentrations instead of presence/absence. My four groups would then be HighA/HighB, LowA/HighB, HighA/LowB, LowA/LowB. In this context, it is the high concentrations that are considered to be the baseline/reference/control and the low concentrations the experimental challenge.

My matrix headers are:

HiAHiB-1 LoAHiB-1 HiALoB-1 LoALoB-1 HiAHiB-2 LoAHiB-2 HiALoB-2 LoALoB-2 ... ...

My factors are:

A <- factor(substring(colnames(data.set),1,3))
B <- factor(substring(colnames(data.set),4,6))
batch <- factor(substring(colnames(data.set),8,8))

Looking more closely at e. g. A:

> A

[1] HiA LoA HiA LoA HiA LoA HiA LoA ... ...
Levels: HiA LoA

When I specify my design matrix like this:

design <- model.matrix(~batch+A*B)

My headers include "ALoA" and "BLoB"

From reading the EdgeR User Guide (particularly the Arabidopsis case study, pp. 61-68), I suspect that this would indicate that HiA is set as the reference (since it is the level farthest to the left after "Levels:") and if a comparison between HiA and LoA produce a log2 fold change value that was positive, this would indicate an upregulated in LoA compared with HiA. Is this correct? Or does relevel() always have to be done? In the case study, mock is to the left in the first table in section 4.6.4, but then also set with relevel().

Does the design matrix header include the non-reference condition and if so, what does that mean? In the Arabidopsis case study, it was "Treathrcc" with hrcc as the non-reference, and for me it was e. g. "ALoA" with LoA as the non-reference.

I ask primarily because I have not yet quite figured out how to abstract from case studies to a more general approach when it comes to mathematical modelling.

edgeR reference design matrix multiple factor design • 2.2k views

ADD COMMENT • link updated 9.1 years ago by Aaron Lun ★ 28k • written 9.1 years ago by Ekarl2 ▴ 80

score 0 · Answer 1 · 2015-06-09

Using your code, I get a design matrix with five columns:

> colnames(design)
[1] "(Intercept)" "batch2"      "ALoA"        "BLoB"        "ALoA:BLoB"

The (Intercept) column represents the average expression of all HiAHiB-1 samples, as it is the only active predictor variable in the systematic component for those libraries (i.e., it is the only non-zero column for the corresponding rows). The batch2 column represents the batch effect, obviously, as it is the only variable that differs between the same samples in different batches.

The ALoA coefficient represents the log-fold change from HiAHiB to LoAHiB, averaged across both batches. This is because it is the only difference between the HiAHiB and LoAHiB rows within each batch. Similarly, the BLoB coefficient represents the log-fold change from HiAHiB to HiALoB, again averaged across batches. Dropping either of these with coef=... in glmLRT will give you the treatment effect - in particular, a positive log-FC would indicate an increase in expression for the low concentrations compared to the high "reference" sample of HiAHiB.

Finally, for completeness, the interaction term ALoA:BLoB represents the non-additive effect of double treatment with low A and B. Generally, I like to avoid these interaction models and do something like this instead:

group <- factor(substring(colnames(data.set), 1, 6))
design <- model.matrix(~0+group+batch)

Then I can use makeContrasts to compare pairs of groups directly without worrying about what the reference is.