How to test full design vs a reduced design in limma?
2
0
Entering edit mode
@constantin-ahlmann-eltze-20493
Last seen 8 hours ago
Germany/Heidelberg/EMBL

Hi,

I wondered if there is a way to compare a reduced model matrix against the full model matrix with limma.

Let's say for example, I have a full model: ~ cell + dex and want to compare this against the reduced model ~ dex. Can I directly compare the two models, or do I have to specify which columns in the original design matrix come from cell and set them as coef?

Here is a small example to illustrate what I mean:

library(limma)
suppressMessages(library(airway))
data(airway)

logCPM <- edgeR::cpm(airway, log = TRUE)
design <- model.matrix(~ cell + dex, data = colData(airway))
design
#>            (Intercept) cellN061011 cellN080611 cellN61311 dexuntrt
#> SRR1039508           1           0           0          1        1
#> SRR1039509           1           0           0          1        0
#> SRR1039512           1           0           0          0        1
#> SRR1039513           1           0           0          0        0
#> SRR1039516           1           0           1          0        1
#> SRR1039517           1           0           1          0        0
#> SRR1039520           1           1           0          0        1
#> SRR1039521           1           1           0          0        0
#> attr(,"assign")
#> [1] 0 1 1 1 2
#> attr(,"contrasts")
#> attr(,"contrasts")$cell
#> [1] "contr.treatment"
#> 
#> attr(,"contrasts")$dex
#> [1] "contr.treatment"
fit <- lmFit(logCPM, design)
fit <- eBayes(fit, trend = TRUE)
topTable(fit, coef = 2:4, number = 3)
#>                 cellN061011 cellN080611 cellN61311    AveExpr         F
#> ENSG00000213058   -5.997784  -5.9977840  0.3935414 -0.3581851 28310.347
#> ENSG00000164308    3.780246   3.1534606  3.5315346  4.8003548 13606.080
#> ENSG00000012817   -8.330419   0.1749454  0.2208583  3.5195512  3285.961
#>                      P.Value   adj.P.Val
#> ENSG00000213058 6.767049e-08 0.004337814
#> ENSG00000164308 2.325664e-07 0.007453984
#> ENSG00000012817 2.547302e-06 0.017566474

Created on 2021-09-10 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)

If there is no other way to compare a full and reduced model matrix, what is the recommended way to conduct a likelihood ratio test for two unrelated models? Would you say that it is simply not a good idea to do something like that? I could for example imagine a scenario where I want to find out if the model using ~ cell + dex or a hypothetical patient age ~ age explains the observed data better.

model.matrix(~ age, tmp)
#>            (Intercept) age
#> SRR1039508           1  10
#> SRR1039509           1   7
#> SRR1039512           1  41
#> SRR1039513           1  25
#> SRR1039516           1   9
#> SRR1039517           1  33
#> SRR1039520           1  13
#> SRR1039521           1  19
#> attr(,"assign")
#> [1] 0 1

Given how long limma has been around, I am sure this question must have been discussed before. I searched for similar questions in the Bioconductor forum, and this was the closest hit, but it did not directly answer my question. But I am very sorry if I missed some obvious resource or forum post.

Best, Constantin

limma • 323 views
ADD COMMENT
2
Entering edit mode
@gordon-smyth
Last seen 13 minutes ago
WEHI, Melbourne, Australia

I wondered if there is a way to compare a reduced model matrix against the full model matrix with limma.

Yes of course, that's what limma does all the time. In fact any hypothesis test in statistics is equivalent to comparing a reduced model against a full model.

Can I directly compare the two models, or do I have to specify which columns in the original design matrix come from cell and set them as coef?

The two things that you describe are exactly the same thing. In limma you specify the full model matrix, then limma forms the reduced model automatically from the coefficients you specify and compares it to the full model.

limma doesn't force you to specify the null model explicitly. The limma approach is IMO much quicker and more flexible because you can specify lots of different null models simply by specifying different coefficients or contrasts to be tested. In fact limma never actually has to fit the reduced model explicitly. Normal linear model theory allows all these calculations to be done much more elegantly.

what is the recommended way to conduct a likelihood ratio test for two unrelated models?

limma is already doing a likelihood ratio test for you. That's what an anova F-test is. In fact, the limma F-test is much better than a general likelihood ratio test because the p-values are exact, whereas likelihood ratio tests for generalized linear models are approximate.

 

IMO forcing users to specify full and reduced models explicitly would be clumsy and, in addition, would force the software to check that the reduced model is indeed a reduced version of the full. However, if you did have full and reduced design matrices, it is fairly straightforward to automatically generate the limma contrast matrix that is equivalent to comparing those two matrices.

ADD COMMENT
0
Entering edit mode

Hi Gordon,

thank you very much for your insightful response.

However, if you did have full and reduced design matrices, it is fairly straightforward to automatically generate the limma contrast matrix that is equivalent to comparing those two matrices.

Following your advice, I wrote a function that takes a reduced design matrix and converts it to the corresponding contrast matrix. It also checks that the models are indeed nested. You are right, this is quite elegant.

library(limma)
suppressMessages(library(airway))
data(airway)
assay(airway, "cpm") <- edgeR::cpm(airway, log = TRUE)


annihilator <- function(x){
  diag(nrow = nrow(x)) - x %*% solve(t(x) %*% x) %*% t(x)
}

reduced_design.fit <- function(fit, reduced_design){
  full_design <- fit$design

  mapping <- lm(reduced_design ~ full_design - 1)
  if(sum(abs(residuals(mapping))) > 1e-8){
    stop("Apparently the reduced design is not nested in the full design")
  }

  cntrst <- annihilator(coef(mapping))
  colnames(cntrst) <- colnames(full_design)
  rownames(cntrst) <- colnames(full_design)
  cntrst <- cntrst[, colSums(abs(cntrst)) > 1e-8]

  contrasts.fit(fit, contrast = cntrst)
}



full_design <- model.matrix(~ cell + dex, data = colData(airway))
fit <- lmFit(assay(airway, "cpm"), full_design)
a <- contrasts.fit(fit, coef = 2:4)
topTable(eBayes(a, trend = TRUE), number = 3)
#>                 cellN061011 cellN080611 cellN61311    AveExpr         F
#> ENSG00000213058   -5.997784  -5.9977840  0.3935414 -0.3581851 28310.347
#> ENSG00000164308    3.780246   3.1534606  3.5315346  4.8003548 13606.080
#> ENSG00000012817   -8.330419   0.1749454  0.2208583  3.5195512  3285.961
#>                      P.Value   adj.P.Val
#> ENSG00000213058 6.767049e-08 0.004337814
#> ENSG00000164308 2.325664e-07 0.007453984
#> ENSG00000012817 2.547302e-06 0.017566474
red_design <- model.matrix(~ dex - 1, data = colData(airway))
b <- reduced_design.fit(fit, red_design)
topTable(eBayes(b, trend = TRUE), number = 3)
#>                 cellN061011 cellN080611 cellN61311    AveExpr         F
#> ENSG00000213058   -5.997784  -5.9977840  0.3935414 -0.3581851 28310.347
#> ENSG00000164308    3.780246   3.1534606  3.5315346  4.8003548 13606.080
#> ENSG00000012817   -8.330419   0.1749454  0.2208583  3.5195512  3285.961
#>                      P.Value   adj.P.Val
#> ENSG00000213058 6.767049e-08 0.004337814
#> ENSG00000164308 2.325664e-07 0.007453984
#> ENSG00000012817 2.547302e-06 0.017566474

Created on 2021-09-13 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)

Thank you again for your help and for maintaing this great software :)

1
Entering edit mode
@james-w-macdonald-5106
Last seen 4 hours ago
United States

It is not really common to think about modeling high-throughput data in this way. You are fitting (usually) tens of thousands of models simultaneously, so there isn't really a concept of testing to see if a model explains the observed data better. It is almost surely the case that for Gene X the best model will be different than for Gene Y, but given the number of models, it is intractable to try to specify individual gene-level models. Instead the best we can do is fit a plausible model and then see if the p-value histogram looks reasonable (mostly uniform distribution, with hopefully some inflation in the smaller p-values), which would indicate that the model isn't altogether horrible for most genes.

Usually, in my experience, the problem is a lack of explanatory variables, in which case you get deflation of the small p-values. In which case I usually add some surrogate variables using the sva package.

Also, if you are planning to fit a conventional linear model with limma, you should use the voom pipeline to convert to logCPM and estimate observation-level weights. There is a strong relationship between the mean and variance for logCPM that you would normally want to control for by using a weighted model.

0
Entering edit mode

Hi James,

thanks for your thorough answer :)

I agree that this question is not about the most common use case in high-throughput testing. I should have probably given more detail on the background of my question.

I am currently writing a small wrapper package for limma and wanted to support testing for differential expression using either using contrasts or a reduced model. This is inspired by the nbinomLRT and nbinomWaldTest functions in DESeq2.

For this purpose, I was looking for the best way to make limma work on the reduced design matrix. I guess if I posit that the full model and reduced model are nested, I could calculate the corresponding coef parameter like this which(colnames(full_design) %in% setdiff(colnames(full_design), colnames(red_design))), but that feels very hacky so I was hoping for a more robust solution.

0
Entering edit mode

Ah. I think it's easier for Mike to do that sort of stuff because DESeq2 generates the design matrix internally unlike edgeR or limma, where you have to specify it separately. Probably the best way to get around that is to write a wrapper that takes a model formula like DESeq2 does, so you will have better control.

Although you can compute an LRT for conventional linear models, it's not really a common thing to do is it? It seems to me the preferred thing to do is to test the coefficient (or contrast) directly. Isn't that test universally most powerful in the context of normal-distribution based linear modeling? Also, an LRT is usually performed to compare nested models. Is it OK to compare non-nested models using LRT? That seems not copacetic, but I don't know for sure.

Login before adding your answer.

Traffic: 356 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6