Question: Correlation of a Linear Variable
gravatar for andrew.j.skelton73
4.6 years ago by
United Kingdom
andrew.j.skelton73320 wrote:


I'd like to use limma to look for probes that correlation strongly and with a large slope (positive or negative), based on a linear variable. I've scanned the Limma users guide but couldn't find such an example. 

design              <- model.matrix(~0 + df$AGE)
fit                 <- lmFit(exprs(lumi.norm), design)
fit                 <- eBayes(fit)
topTable(fit, adjust.method="BH", number=10, p.value=0.01,"M")

The above code shows what I've done so far, which is provide the design of the model with a linear variable. I was wondering if this was the correct way to do it? If so, how would i filter for strength of correlation of size of slope?

Is there a better way to do it?



limma limma design matrix • 794 views
ADD COMMENTlink modified 4.6 years ago by James W. MacDonald51k • written 4.6 years ago by andrew.j.skelton73320
Answer: Correlation of a Linear Variable
gravatar for James W. MacDonald
4.6 years ago by
United States
James W. MacDonald51k wrote:

If you are going to fit a conventional linear model, you should not constrain the intercept to zero (which is what you are doing). If you were to do such a thing, you are in effect saying that the expression for all of your genes at birth should be equal to zero, which is only true for zombie babies.

Instead you should do

design <- model.matrix(~AGE, df)


topTable(fit, 2, <other args you like>)

And then your logFC column will be the slope of the line, correlating age and gene expression. The p-value will test the hypotheses Ho: slope == 0 vs Ha: slope != 0.

But do note that this linear model assumes that the expression is a linear function of age, which may not be true, especially if you have a wide range of ages. In which case you might want a quadratic term as well, but that can make interpretation tricky.

As for 'strength of correlation of size of slope', I am not sure that's a thing. You can test that the slope is not equal to zero, or you could use treat() and require that the slope be greater than some value that you think is the limit of biological meaningfulness (which seems like it shouldn't be a word, yet is). But in the microarray context, there isn't much else you can do.

I suppose you could look at individual genes using lm() and associated model diagnostics, but I would normally wait for the validation stage for that sort of thing.


ADD COMMENTlink written 4.6 years ago by James W. MacDonald51k

James, this is a great answer and do want to give you an upvote, but I'd like to just apply a bit more scrutiny to some of the more scientific aspects of it before doing so:

When you say "zombie babies", do you mean:

  1. A zombie at time 0 of zombie life, ie. just after it woke up and "turned"?
  2. A baby that was unfortunately bitten and turned into a zombie, thusly "frozen" in a baby's body for eternity?
  3. The combination of 1 & 2?
  4. The offspring of zombies?

Furthermore, how can you speculate on the expression profiling of zombies at all. Have you run across such data? If so, was it done on microarrays (and, therefore, probably flawed) or is it RNAseq. If the latter, what genome assembly did you use? Are we sure that the zombie genome is more or less the same as its "host human's"?

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Steve Lianoglou12k
  1. Obviously... But not 'woke up and turned'. Momma has to get bit in late third trimester, and when she turns, so does the baby. Ceasarean section required.
  2. Nope. Time 0 in that case would be a real baby, and hence expressing genes. Plus it wouldn't be for eternity anyway. Michonne is bound to find it crawling about mewling plaintively (yet aggressively) and nick its head off with her sword.
  3. No. See 2.
  4. That's just silly. Zombies don't procreate. Too busy with other more important matters.

Anyway, we did try to run some RNA-Seq on some zombies last year, but after numerous futile attempts to get total RNA we gave up. Hence my assertion that zombie babies have 0 expression.

ADD REPLYlink written 4.6 years ago by James W. MacDonald51k

Upvote granted for clear demonstration of scientific rigor.

ADD REPLYlink written 4.6 years ago by Steve Lianoglou12k

Hi James, Thanks for the reply, that makes more sense. Can you explain in a bit more detail what the effect of using ~0 + AGE, particularly the role that '0' plays, for my own understanding? 

ADD REPLYlink written 4.6 years ago by andrew.j.skelton73320

The main thing you want to be looking at is ?formula, which explains (in great detail) how you specify various linear models using R's formula interface. But that presupposes you know what model you are trying to fit and how to interpret it.

So to back up one step, recall the formula for a line

y = mx + b

where y are the values on the vertical axis, x are the values on the horizontal axis, m is the slope of the line (increase in y for each unit increase in x), and b is the intercept (e.g., the value of y when x == 0).

You can also specify the formula as

y = mx + Ib 

Where I is an indicator variable that can be either 0 or 1. You have specified I = 0, which implies that when x = 0 that y = 0 as well:

y = m*0 + 0*b => y = 0

So you are by definition saying that y = 0 when x = 0.

The default in R is to fit an intercept (e.g., set I = 1), because there are vanishingly few situations when forcing R to use a zero intercept makes sense (note here that I am talking about linear regression, not ANOVA). So the code that I gave you

design <- model.matrix(~AGE, df)

is functionally identical to Ryan's suggestion of

design <- model.matrix(~ 1 + df$AGE)

And the ~1 or ~0 just specifies what value you want to use for I.

Does that make sense?


ADD REPLYlink written 4.6 years ago by James W. MacDonald51k
Answer: Correlation of a Linear Variable
gravatar for Ryan C. Thompson
4.6 years ago by
The Scripps Research Institute, La Jolla, CA
Ryan C. Thompson7.4k wrote:

The main issue with your code is that you eliminated the necessary intercept term by putting the zero in your model. That's essentially forcing all your lines to go through the origin (i.e. zero RMA value at age zero), which makes no biological sense. You should probably just use a design of model.matrix(~1+AGE, data=df). Secondly, if you want to enrich your results for slopes that are both large and significant, then you should experiment with the treat function, which combines a fold change (i.e. slope) cutoff with a significance cutoff, instead of topTable, which simply tests whether the slope is significantly different from zero.

ADD COMMENTlink written 4.6 years ago by Ryan C. Thompson7.4k

Thanks for that Ryan! What's the particular difference between ~0+... and ~1+... ?? Does that make the assumption (at 0), that someone has 0 expression at 0 years old? Or at ~1+... that someone has a global expression at birth of 1? 

ADD REPLYlink written 4.6 years ago by andrew.j.skelton73320

James's comment explains the difference. In short, it determines whether or not the model will have an intercept term. And you almost always want an intercept term.

ADD REPLYlink written 4.6 years ago by Ryan C. Thompson7.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 349 users visited in the last hour