Hi all,
I'm trying to figure out which is the best model to go with in an experiment, so I'd appreciate any advice people can give!
I have 450K experiment with ~200 samples. Samples are split by disease type (A, B, C, and D). I'm aiming to find probes that correlate with age.
So far I have 2 ways I can approach this problem. I'm interested in looking at each disease type against age, as well as grouping them to look for probes that correlate between two or more disease types (A and B together for example).
Method 1:
Subset the input matrix for just the disease types I'm interested in i.e. A, or A and B. Use the model ~Age and use topTable to look at the second coefficient (the first being the intercept). Advantages: Simple, relatively easy to understand and trace back. Disadvantage: a new model has to be made for each test I want to carry out.
Method 2:
Throw all samples in, and use an interaction model ~SampleType:Age. I can then look at each individual SampleType's correlations (topTable the relevant coefficient). Then using a contrast matrix A+B+C+D/4 for example, would give me probes that on average have similar gradients throughout all sampleTypes? (or should I be using an none-intercept model for this?). Do the P Values of a regression test against a continuous variable represent the minimal amount of variance around the fit line?
Questions:
Are my assumptions correct?
Is there a better approach that I've missed? (If not, which method would you recommend I go with?)
Hi Arron, that's a lot of interesting points, cheers! Splines is something that I've had a play with, but as with using continuous variables in Limma, I'm struggling to interpret what the p values are representing, are they lower, the lower the deviation from the fitted trend line? Can you explain why you drop the spline coefficients (Is it the 5 derived from the 5 degrees of freedom)?
The DE test is a good idea, I'll give that a try. I'm assuming the caveat of getting that to work is using SampleType as a baseline expression term.
Cheers,
For the simple linear regression without splines, the p-value represents the evidence against the null hypothesis that the gradient is zero. Lower p-values indicate that a significant age effect is present, i.e., the actual gradient is likely to be non-zero.
For the splines, the null is a bit more complicated; it states that all of the spline coefficients are zero. This is roughly equivalent to the linear approach in that you're testing against the absence of any age effect. You need to drop all of the spline coefficients at once (e.g., 5 of them, if you set
df=5
) as you can't really interpret one spline coefficient separately from another.Finally, the DE test will work with your common intercept model, though I'd still recommend using disease-specific baselines as I've described above.