If you are going to fit a conventional linear model, you should not constrain the intercept to zero (which is what you are doing). If you were to do such a thing, you are in effect saying that the expression for all of your genes at birth should be equal to zero, which is only true for zombie babies.
Instead you should do
design <- model.matrix(~AGE, df)
and
topTable(fit, 2, <other args you like>)
And then your logFC column will be the slope of the line, correlating age and gene expression. The p-value will test the hypotheses Ho: slope == 0 vs Ha: slope != 0.
But do note that this linear model assumes that the expression is a linear function of age, which may not be true, especially if you have a wide range of ages. In which case you might want a quadratic term as well, but that can make interpretation tricky.
As for 'strength of correlation of size of slope', I am not sure that's a thing. You can test that the slope is not equal to zero, or you could use treat() and require that the slope be greater than some value that you think is the limit of biological meaningfulness (which seems like it shouldn't be a word, yet is). But in the microarray context, there isn't much else you can do.
I suppose you could look at individual genes using lm() and associated model diagnostics, but I would normally wait for the validation stage for that sort of thing.
James, this is a great answer and do want to give you an upvote, but I'd like to just apply a bit more scrutiny to some of the more scientific aspects of it before doing so:
When you say "zombie babies", do you mean:
Furthermore, how can you speculate on the expression profiling of zombies at all. Have you run across such data? If so, was it done on microarrays (and, therefore, probably flawed) or is it RNAseq. If the latter, what genome assembly did you use? Are we sure that the zombie genome is more or less the same as its "host human's"?
Anyway, we did try to run some RNA-Seq on some zombies last year, but after numerous futile attempts to get total RNA we gave up. Hence my assertion that zombie babies have 0 expression.
Upvote granted for clear demonstration of scientific rigor.
Hi James, Thanks for the reply, that makes more sense. Can you explain in a bit more detail what the effect of using ~0 + AGE, particularly the role that '0' plays, for my own understanding?
The main thing you want to be looking at is ?formula, which explains (in great detail) how you specify various linear models using R's formula interface. But that presupposes you know what model you are trying to fit and how to interpret it.
So to back up one step, recall the formula for a line
where y are the values on the vertical axis, x are the values on the horizontal axis, m is the slope of the line (increase in y for each unit increase in x), and b is the intercept (e.g., the value of y when x == 0).
You can also specify the formula as
Where
I
is an indicator variable that can be either 0 or 1. You have specifiedI = 0
, which implies that when x = 0 that y = 0 as well:So you are by definition saying that y = 0 when x = 0.
The default in R is to fit an intercept (e.g., set
I = 1
), because there are vanishingly few situations when forcing R to use a zero intercept makes sense (note here that I am talking about linear regression, not ANOVA). So the code that I gave youis functionally identical to Ryan's suggestion of
And the ~1 or ~0 just specifies what value you want to use for
I
.Does that make sense?