Limma: Continuous variable model designs
1
0
Entering edit mode
@andrewjskelton73-7074
Last seen 4 months ago
United Kingdom

Hi all,

I'm trying to figure out which is the best model to go with in an experiment, so I'd appreciate any advice people can give!

I have 450K experiment with ~200 samples. Samples are split by disease type (A, B, C, and D). I'm aiming to find probes that correlate with age.

So far I have 2 ways I can approach this problem. I'm interested in looking at each disease type against age, as well as grouping them to look for probes that correlate between two or more disease types (A and B together for example).

Method 1:

Subset the input matrix for just the disease types I'm interested in i.e. A, or A and B. Use the model ~Age and use topTable to look at the second coefficient (the first being the intercept). Advantages: Simple, relatively easy to understand and trace back. Disadvantage: a new model has to be made for each test I want to carry out.

Method 2:

Throw all samples in, and use an interaction model ~SampleType:Age. I can then look at each individual SampleType's correlations (topTable the relevant coefficient). Then using  a contrast matrix A+B+C+D/4 for example, would give me probes that on average have similar gradients throughout all sampleTypes? (or should I be using an none-intercept model for this?). Do the P Values of a regression test against a continuous variable represent the minimal amount of variance around the fit line?

Questions:

Are my assumptions correct?

Is there a better approach that I've missed? (If not, which method would you recommend I go with?)

limma model.matrix • 7.1k views
5
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 5 hours ago
The city by the bay

If you want to look at the age response within each disease type, I'd suggest:

design <- model.matrix(~0+SampleType+SampleType:Age)

This will yield a baseline expression term for each disease type, in addition to the age response coefficient. The disease-specific baseline ensures that the linear regression against age for one disease type is not skewed by high or low expression in another disease type (whereas in your existing "method 2" model, fits for the various diseases are all interdependent due to the need to estimate a common intercept term). Dropping the age response term for any given disease will yield the age effect for that disease.

Note that the above design is equivalent to your "method 1" design after cramming all of the samples and coefficients from the various disease states into a single model. I prefer to do my fitting with all available samples at once, to get more reliable variance estimates; though, if you have 200 samples, you've probably got enough samples per disease to not need to worry too much about that. Another option is to replace Age with ns(Age, df=5), which uses splines to account for non-linear trends (though this complicates the contrasts somewhat; you'll have to drop all spline coefficients to test for a non-zero age effect in each gene).

Identifying probes with similar gradients across disease types is not easy, as you're effectively trying to identify significantly non-DE genes. Taking the average gradient won't necessarily get you what you want; a large positive gradient for disease A plus a small negative gradient for disease B would yield a (still-large) average positive gradient and a significant result, but you wouldn't argue that the gradients are similar. Instead, I'd recommend just testing for DE across gradients, by doing something like:

con <- makeContrasts(SampleTypeA_Age - SampleTypeB_Age,
levels=design) # replace ":" with "_" in colnames(design)

You can then use the confidence intervals in topTable to identify those probes where the difference between the disease-specific gradients is acceptably close to zero. For example, you could take all probes where the 95% confidence intervals were contained within (-1, 1), and treat these as behaving similarly between diseases A and B. It's a bit ad hoc, but it's the simplest approach.

0
Entering edit mode

Hi Arron, that's a lot of interesting points, cheers! Splines is something that I've had a play with, but as with using continuous variables in Limma, I'm struggling to interpret what the p values are representing, are they lower, the lower the deviation from the fitted trend line? Can you explain why you drop the spline coefficients (Is it the 5 derived from the 5 degrees of freedom)?

The DE test is a good idea, I'll give that a try. I'm assuming the caveat of getting that to work is using SampleType as a baseline expression term.

Cheers,

0
Entering edit mode

For the simple linear regression without splines, the p-value represents the evidence against the null hypothesis that the gradient is zero. Lower p-values indicate that a significant age effect is present, i.e., the actual gradient is likely to be non-zero.

For the splines, the null is a bit more complicated; it states that all of the spline coefficients are zero. This is roughly equivalent to the linear approach in that you're testing against the absence of any age effect. You need to drop all of the spline coefficients at once (e.g., 5 of them, if you set df=5) as you can't really interpret one spline coefficient separately from another.

Finally, the DE test will work with your common intercept model, though I'd still recommend using disease-specific baselines as I've described above.