Question

Limma Nuisance Variable in an Interaction Model

0

Entering edit mode

andrew.j.skelton73 ▴ 370

@andrewjskelton73-7074

Last seen 7 months ago

United Kingdom

Hi,

I'm trying to look for an interaction effect between sampleType and Age, with a nuisance variable present. I used the model:

~0 + SampleType:Age + Batch

Should the first level of the batch variable be absorbed in this case? - When I've been designing with model.matrix, it appears that all levels of the batch variable are present.

Thanks,

edit: a simple example

SampleType <- rep(c("A", "B", "C"), 6)
Age        <- rep(45:47, 6)
Batch      <- c(rep("X", 6),
                rep("Y", 12))

colnames(model.matrix(~0 + SampleType + Batch))
colnames(model.matrix(~0 + SampleType*Age + Batch))
colnames(model.matrix(~0 + SampleType:Age + Batch))
colnames(model.matrix(~Age + Batch))

limma model.matrix • 1.9k views

ADD COMMENT • link updated 8.9 years ago by Ryan C. Thompson ★ 7.9k • written 8.9 years ago by andrew.j.skelton73 ▴ 370

0

Entering edit mode

It would be nice to know what the different factors are for each sample. Stuff like this sometimes corresponds to linear dependencies between your factors, but it's hard to say for sure.

ADD REPLY • link 8.9 years ago Aaron Lun ★ 28k

0

Entering edit mode

I've added a simple example to illustrate my point. The first, second and forth colnames calls all absorb the first batch term, whereas the third doesn't. I'm basically trying to figure out why the third doesn't absorb the batch variable.

ADD REPLY • link 8.9 years ago andrew.j.skelton73 ▴ 370

score 2 · Answer 1 · 2015-12-17

In your example, SampleType is totally confounded with Age, i.e., all A samples are 45, B is 46, and C is 47. You won't be able to distinguish between them in a model, because any effects on expression due to age can be recapitulated with a corresponding sample type-specific effect. To illustrate, imagine a simple example where expression is proportional to age for a given gene, i.e., age of 45 results in an expression of 45. This could be equally well explained as there being no age effect at all for this gene, and sample types A, B and C displaying group-specific average expressions of 45, 46 and 47 respectively. More concisely:

design <- model.matrix(~0 + SampleType:Age + Batch)
ncol(design) # gives 5
qr(design)$rank # gives 4

So, there are linear dependencies between your factors, as I predicted. This means that some coefficients cannot be estimated by lmFit, due to the identifiability issues I've described above. In such cases, worrying about the names of your columns is the least of your problems. You'll have to discard either SampleType or Age from your design to get a matrix with fully estimable coefficients. The Batch naming is simply symptomatic of the underlying problem, and is model.matrix's best shot in a bad situation.

score 2 · Answer 2 · 2015-12-17

Note: My answer assumes that Age and SampleType are not completely confounded. I have included the updated set-up code for such an example here:

SampleType <- rep(c("A", "B", "C"), 6)
Age        <- rep(45:47, each=6) # Note use of "each"
Batch      <- c(rep("X", 6),
                rep("Y", 12))

Your desired design, ~0 + SampleType:Age + Batch, is an exceptionally odd one, since it includes an interaction term between SampleType and Age without including the main effects of either SampleType or Age, nor including an intercept. This means that there is nothing in the design capable of absorbing the first level of Batch, so you get a design with separate coefficients for both batches. This is also a design that is unlikely to represent any kind of reality. If I'm not mistaken, this design effectively assumes that all three SampleTypes have the same expression at Age 0, which is unlikely to be a useful assumption if your operative age range is in the mid-40s. Much more likely is that you would want to include main effect coefficients for SampleType in addition to the interaction, using a formula of ~0 + SampleType + SampleType:Age + Batch. This would fit a separate Age slope and intercept to each level of SampleType, which is a very reasonable design. In such a design, the BatchX coefficient is properly absorbed, as you expect:

> colnames(model.matrix(~0 + SampleType:Age + Batch))
[1] "BatchX"          "BatchY"          "SampleTypeA:Age" "SampleTypeB:Age"
[5] "SampleTypeC:Age"
> colnames(model.matrix(~0 + SampleType + SampleType:Age + Batch))
[1] "SampleTypeA"     "SampleTypeB"     "SampleTypeC"     "BatchY"         
[5] "SampleTypeA:Age" "SampleTypeB:Age" "SampleTypeC:Age"