Question

Extracting fitted expression in limma

0

Entering edit mode

rf • 0

@rfes-12351

Last seen 3.0 years ago

Spain

I am trying to perform some linear regressions with respect to some covariates in limma so that I can predict expression values after controling for 2 covariates, Afactor and Efactor. Basically, this is my sample setup:

                Afactor        Efactor
sample1         a1             e2
sample2         a0             e1
sample3         a0             e2
sample4         a1             e0
sample5         a1             e0
sample6         a1             e1
...

Here, Afactor is a factor with 2 unique values a1 and a2, and Efactor is a factor with 3 unique values e1, e2, e3. To fit it, I'm using limma as follows:

design=model.matrix('~0 + Afactor + Efactor', my_metadata)
fit= lmFit(my_data, design)

This allows me to find the fitting coefficients with fit$coefficients, that resemble the following:

>head(fit$coefficients)


                Afactora1            Afactora2            Efactore1        Efactore2
gene1           2.3                      3.7                 -1.3            0.9
gene2           1.1                      2.1                 -2.5            2.1
gene3           1.3                      3.1                 0.2             1.3

My questions:

Note that there is no coefficient for Efactore3, but why is that so?
Does this have to do with the fact that possible factors for Efactor are e1, e2 and e3?
Shouldn't it be possible to extract the coefficient for e3 so that I could use each coefficient to estimate gene expression from the fitted model by imputing the values of Afactor and Efactor for each sample?

limma • 2.1k views

ADD COMMENT • link updated 4.2 years ago by James W. MacDonald 68k • written 4.2 years ago by rf • 0

score 2 · Accepted Answer · 2021-11-04

2

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 2 days ago

United States

I don't see how you got just those coefficients from that input data.frame. If I do something similar, here's what I get

> my_metadata <- data.frame(Afactor = paste0("a", 0:2)[sample(1:3, 50, TRUE)], Efactor = paste0("e", 0:3)[sample(1:4, 50, TRUE)])
> head(my_metadata)
   Afactor Efactor
1       a1      e2
2       a0      e2
3       a1      e3
4       a0      e1
5       a2      e0
6       a0      e1

> colnames(model.matrix(~0 + Afactor + Efactor, my_metadata))
[1] "Afactora0" "Afactora1" "Afactora2" "Efactore1" "Efactore2" "Efactore3"

And if you don't have all combinations of Afactor and Efactor levels in your input data.frame, you won't get all those coefficients anyway, so it depends on exactly what is in your data.frame.

But even if you have all combinations of the two factors, you will notice that there are still things missing (Efactore0). That's because the model isn't identifiable if you try to fit all the coefficients. The column names are actually a bit misleading, because you might think this is a cell means model (where we compute the mean of each group), but it isn't. Fitting with an intercept makes this clearer.

> colnames(model.matrix(~ Afactor + Efactor, my_metadata))
[1] "(Intercept)" "Afactora1"   "Afactora2"   "Efactore1"   "Efactore2"  
[6] "Efactore3"

With a bit of work you can see that the intercept is estimating the gene expression level when Afactor is 0 and Efactor is 0, and the rest of the coefficients are the difference in expression between that baseline level and all the other combinations. This parameterization is difficult to work with, and you can simplify by simply combining the factor levels.

> combo <- apply(my_metadata, 1, paste, collapse = "_")
> colnames(model.matrix(~0+combo))
 [1] "comboa0_e0" "comboa0_e1" "comboa0_e2" "comboa0_e3" "comboa1_e0"
 [6] "comboa1_e1" "comboa1_e2" "comboa1_e3" "comboa2_e0" "comboa2_e1"
[11] "comboa2_e2" "comboa2_e3"

In this parameterization you get the mean of each pair of factor levels, which is probably easier for you to work with.

ADD COMMENT • link 4.2 years ago James W. MacDonald 68k

0

Entering edit mode

Hi James, many thanks for your comments and help!

I don't see how you got just those coefficients from that input data.frame.

You are correct indeed. I had edited my post and changed the levels in Afactor to a0 and a1, and updated the post to say (mistakenly) a1 and a2. I should've had Afactora0 and Afactora1 instead.

The column names are actually a bit misleading, because you might think this is a cell means model (where we compute the mean of each group)

That had been my interpretation of the coefficients actually: that this would be a fitting coefficient, such that when replacing some given factor levels in it (e.g. sample1 a1 e2) I would be able to predict the gene expression value as given by the model, and that the coefficient value would represent the mean of all values in that group.

With a bit of work you can see that the intercept is estimating the gene expression level when Afactor is 0 and Efactor is 0, and the rest of the coefficients are the difference in expression between that baseline level and all the other combinations.

The last bit of your sentence was actually why I thought of avoiding the intercept altogether, in order to facilitate the interpretation of the coefficients. Your suggestion of combining the factor levels is quite interesting (though it may pose a challenge if I include many covariates!). Am I correct in assuming the remaining steps in retrieving the adjusted gene expression is by replacing the sample values as I mentioned above?

I had googled a bit for this question and wasn't able to find much, though I'd think this is a common question (though I did find this). I do find many controlling for covariates when contrasting effects

ADD REPLY • link 4.2 years ago rf • 0

0

Entering edit mode

It's not super clear to me exactly what you are after. Maybe you just want covariate-adjusted values? In which case removeBatchEffect will do that for you. But do note (as pointed out in the help page for that function) that it's not intended as something you do with your data prior to fitting your 'real' model, which would artificially increase your degrees of freedom.

ADD REPLY • link 4.2 years ago James W. MacDonald 68k

0

Entering edit mode

Many thanks James. My aim is exactly to adjust the values for these covariates. In this case I do not want to remove any batch effect (beyond the effect of the covariates), so would you encode all samples with the same batch id to enter in batch1=some_batch? In practice, is this the same as employing the lmFit approach above and using the residuals of the model (residuals.MArrayLM) for further analysis?

I will accept your answer above, but feel free to edit it with any information if you feel it may be useful to others.

Once again many thanks!

ADD REPLY • link 4.2 years ago rf • 0

0

Entering edit mode

Please read the help page for removeBatchEffect a bit more closely. There are more arguments than the batch arguments!

And yes, it's just fitting the model using the covariates and returning the residuals.

ADD REPLY • link 4.2 years ago James W. MacDonald 68k