Question

Limma: Continuous variable model designs

0

Entering edit mode

andrew.j.skelton73 ▴ 370

@andrewjskelton73-7074

Last seen 7 months ago

United Kingdom

Hi all,

I'm trying to figure out which is the best model to go with in an experiment, so I'd appreciate any advice people can give!

I have 450K experiment with ~200 samples. Samples are split by disease type (A, B, C, and D). I'm aiming to find probes that correlate with age.

So far I have 2 ways I can approach this problem. I'm interested in looking at each disease type against age, as well as grouping them to look for probes that correlate between two or more disease types (A and B together for example).

Method 1:

Subset the input matrix for just the disease types I'm interested in i.e. A, or A and B. Use the model ~Age and use topTable to look at the second coefficient (the first being the intercept). Advantages: Simple, relatively easy to understand and trace back. Disadvantage: a new model has to be made for each test I want to carry out.

Method 2:

Throw all samples in, and use an interaction model ~SampleType:Age. I can then look at each individual SampleType's correlations (topTable the relevant coefficient). Then using a contrast matrix A+B+C+D/4 for example, would give me probes that on average have similar gradients throughout all sampleTypes? (or should I be using an none-intercept model for this?). Do the P Values of a regression test against a continuous variable represent the minimal amount of variance around the fit line?

Questions:

Are my assumptions correct?

Is there a better approach that I've missed? (If not, which method would you recommend I go with?)

limma model.matrix • 7.3k views

ADD COMMENT • link updated 9.0 years ago by Aaron Lun ★ 28k • written 9.0 years ago by andrew.j.skelton73 ▴ 370

score 5 · Answer 1 · 2015-10-12

If you want to look at the age response within each disease type, I'd suggest:

design <- model.matrix(~0+SampleType+SampleType:Age)

This will yield a baseline expression term for each disease type, in addition to the age response coefficient. The disease-specific baseline ensures that the linear regression against age for one disease type is not skewed by high or low expression in another disease type (whereas in your existing "method 2" model, fits for the various diseases are all interdependent due to the need to estimate a common intercept term). Dropping the age response term for any given disease will yield the age effect for that disease.

Note that the above design is equivalent to your "method 1" design after cramming all of the samples and coefficients from the various disease states into a single model. I prefer to do my fitting with all available samples at once, to get more reliable variance estimates; though, if you have 200 samples, you've probably got enough samples per disease to not need to worry too much about that. Another option is to replace Age with ns(Age, df=5), which uses splines to account for non-linear trends (though this complicates the contrasts somewhat; you'll have to drop all spline coefficients to test for a non-zero age effect in each gene).

Identifying probes with similar gradients across disease types is not easy, as you're effectively trying to identify significantly non-DE genes. Taking the average gradient won't necessarily get you what you want; a large positive gradient for disease A plus a small negative gradient for disease B would yield a (still-large) average positive gradient and a significant result, but you wouldn't argue that the gradients are similar. Instead, I'd recommend just testing for DE across gradients, by doing something like:

con <- makeContrasts(SampleTypeA_Age - SampleTypeB_Age,
    levels=design) # replace ":" with "_" in colnames(design)

You can then use the confidence intervals in topTable to identify those probes where the difference between the disease-specific gradients is acceptably close to zero. For example, you could take all probes where the 95% confidence intervals were contained within (-1, 1), and treat these as behaving similarly between diseases A and B. It's a bit ad hoc, but it's the simplest approach.