Question: matrix of residuals from limma, with and without the phenotype of interest
1
luna.deferrari10 wrote:

I am using limma's lmFit to find differentially methylated probes between cases and control subjects. And to correct for covariates, for example, age of the subjects.

But my question is more general, related to the lmFit object.  All the design examples I've seen contain the phenotype of interest in the design, along with a number of covariates to correct for. But how can the biological signal be conserved for further steps (such as eBayes and topTable) if the phenotype of interest is corrected away?

An example of meta-data for the design would be:

sampleName, caseOrControl, age

sample1, "Case", 39

sample2, "Control", 56

Where caseOrControl is my phenotype of interest (diseased or healthy) and age a covariate I want to correct for.

Example of my data:

probe, sample1, sample2

probeA, 0.97, 1.24

probeB, 1.45, 2.20

...

An example command to find if one of my probes is differentially methylated between cases and controls could be:

fit1 <- lmFit(object=data, design=~ caseOrControl + age) ## design contains case/control and age

bayesFit <- eBayes(fit1)

fitResults <- topTable(bayesFit, coef=caseOrControl, number=Inf, adjust.method="BH", p.value=1)

From my understanding of lmFit, the three lines above provide me with the probability of each probe being differentially methylated between cases and controls, adjusting for age.

For my work, however, I need to have (for plotting and further processing) the matrix of values just corrected for age.

What I am doing now to obtain it, is doing another fit excluding the phenotype (case or control) from the design, and saving the residuals matrix.

fit2 <- lmFit(object=data, design=~ age) ## design contains only the covariate age
correctedValues2 <- residuals.MArrayLM(object=fit2, y=data) ## values corrected for age

Is there a way to obtain the matrix above from the first fit1 object described above?  ( fit1 <- lmFit(object=data, design=~ caseOrControl + age) )

If I am not misunderstood, the lines:

fit1 <- lmFit(object=data, design=~ caseOrControl + age)

correctedValues1 <- residuals.MArrayLM(object=fit1, y=data)

would give me the values corrected for age *and* case/control, which in my mind, somehow erases the biological signal (corrects away the difference between cases and controls).

On the other hand, if the correction above (with a design including case/control status) erases or muddles the biological signal, then how could eBayes and topTable then extract that signal?...

I've been unable to find a good, clear example on the topic, so I hope that refining this rough question, and maybe receive pointers to a good example, may help other beginner users in the future.

Answer: matrix of residuals from limma, with and without the phenotype of interest
2
James W. MacDonald51k wrote:

I think you are confusing residuals and parameter estimates. When you fit a linear model, you are trying to extract information pertaining to the observed status (phenotype, age, etc) of your observational units (subjects, mice, whatever). These are your parameter estimates, and are encoded in your design matrix using model.matrix().

So if you fit a model with ~ caseOrControl + age, then you are saying that you think both of those observations may have an effect on gene expression for one or more genes, so you are fitting a model that allows for different intercepts for the cases and controls, and are also doing a linear regression on age. So the model you are fitting can be thought of as being two parallel lines. The age parameter captures any linear dependence on gene expression that is accounted for by the age of the subject, and you are allowing for different intercepts for each group.

In this case, age is a nuisance variable; something that you think may have an effect on gene expression, but is not inherently interesting to you. Since you think it may have an effect, you want to control for that effect, but that is the extent of it. In this situation, we would normally say you have 'controlled for age'.

The parameter for caseOrControl will actually be the difference in intercept between case and control subjects (unless you changed the factor ordering), and will be the parameter of interest. You will test for evidence that the parameter is different from zero, implying that there is a reproducible difference between cases and controls for one or more genes. So you wouldn't really say you were controlling for caseOrControl status, since that is the parameter of interest.

The residuals are 'what is left over' after accounting for differences in age and caseOrControl status. So if you fit a model like

Obs = Age + error

this is equivalent to

Obs - Age = error

So in this case the residuals are the data after you have corrected for age-specific differences. The same is true for the model including caseOrControl, but at that stage you are not really concerned with the residuals (aside from testing assumptions about your model fit, etc), as you have already extracted the parameter of interest that you are then going to test for significance.

In other words, eBayes() and topTable() are concerned with testing if the parameter estimate for caseOrControl is different from zero, using the residuals as a measure of intra-group variability. So caseOrControl doesn't muddle the biological signal, it is intended to capture the biological signal, removing it from the residual term.

Does that make sense?

Answer: matrix of residuals from limma, with and without the phenotype of interest
1
5.0 years ago by
Gordon Smyth38k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth38k wrote:

Is it possible that you are simply after removeBatchEffect? For example:

design = model.matrix( ~ caseOrControl)
y <- removeBatchEffect(data, covariate=age, design=design)

This removes the effect of age while preserving differences between cases and controls.