Question

Applying lmFit(), eBayes() and topTable() for modeling associations between NMR-derived metabolic measures and various traits

0

Entering edit mode

anni • 0

@23791ae8

Last seen 5 months ago

Finland

I'm aware of that limma functions were originally designed for microarray and RNA-Seq data, but I'm using them for modeling associations between metabolic measures (quantified with nuclear magnetic resonance (NMR) platform) and different traits (e.g. body mass index and overweight) here. I prefer using lmFit() and eBayes() over standard regression functions like lm() because, as far as I have understood, these limma functions account for shared structure of in high-dimensional data, optimize inference across multiple tests and moderate the variance estimates with the empirical Bayes approach.

I'm running the following chunk of code inside a loop over different traits some of which are binary and others are continuous. The aim is to test associations between each metabolic measure and trait.

# ph.dat is a data frame with individuals in rows, a single column for individual IDs, and additional columns for traits
# mtb.dat is a matrix with metabolic measures in rows and individuals (the same that are in ph.dat) in columns
# gr.labs is a named list where each element is a vector including labels (e.g. 'cases', 'controls') for a specific binary trait

stopifnot(all(ph.dat$id == colnames(mtb.dat)))

if(trait %in% bin.traits) { # if trait is binary
  gr.labs0 = gr.labs[[trait]]
  groupyes <- which(ph.dat[,trait]==gr.labs0[1])
  groupno <- which(ph.dat[,trait]==gr.labs0[2])

  design <- model.matrix(~0 + ph.dat[,trait])
  colnames(design) <- gsub("ph.dat\\[, trait\\]", "", colnames(design))
  design <- design[,c(gr.labs0[2], gr.labs0[1])]
  colnames(design) <- c("groupno", "groupyes")
  cont.matrix <- makeContrasts(yesvsno = (groupyes - groupno), levels = design)
  fit <- lmFit(mtb.dat, design)
  fit2 <- contrasts.fit(fit, cont.matrix)
} else { # if trait is not binary, it is assumed to be continuous
  design <- model.matrix(~ ph.dat[,trait])
  fit2 <- lmFit(mtb.dat, design)
}

fit.e <- eBayes(fit2)
topDM <- topTable(fit.e, number=nrow(mtb.dat))

The code runs without errors, but I want to ensure that my approach is conceptually sound. In particular, I would like to confirm the following points:

I tried to see what happens inside lmFit() and it seems that lm.series() is applied regardless of whether the trait is continuous or binary. The association is then modeled so that the trait is the predictor and 'omical' variable (here metabolic measure) is the response. Does lm.series() make some assumptions on the type or structure of the omics data? Is there some specific in lm.series() that would make it unsuitable for modeling associations with metabolic measures? What are its key differences compared to e.g. lm()? Is lm.series() just a wrapper around running e.g. lm() repeatedly for each feature (row)?
Do lmFit(), eBayes() or topTable() make any assumptions about whether the trait is continuous or categorical?
What is the purpose of the trend argument in eBayes(), and under what circumstances is it recommended to set it to TRUE versus leaving it at the default FALSE? According to the limma user guide (http://bioconductor.jp/packages/3.9/bioc/vignettes/limma/inst/doc/usersguide.pdf) setting trend = TRUE may increase statistical power but for example in the manual https://bioconductor.org/packages/release/bioc/manuals/limma/man/limma.pdf this argument is left at its default FALSE.

limma Bioconductor Metabolomics • 2.3k views

ADD COMMENT • link updated 5 months ago by Gordon Smyth 53k • written 5 months ago by anni • 0

score 0 · Answer 1 · 2025-07-02

0

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 10 hours ago

WEHI, Melbourne, Australia

The assumptions made by limma are explained in the documentation and in the relevant publications:

limma works with any number of binary, categorical or continuous predictors. It works exactly the same in each case.

Regarding the trend argument, try it and see. If there is clear trend then it will strengthen the analysis.

ADD COMMENT • link 5 months ago Gordon Smyth 53k

0

Entering edit mode

Ok, thanks!

I have just a couple more questions:

Based on the limma User's Guide, for binary predictors like overweight, I have interpreted that logFC represents the log2-fold-change in the response between the two groups (e.g. overweight and non-overweight). Based on this previous post, I have interpreted that when analyzing associations with continuous predictors such as body mass index (BMI), the logFC represents a log2-fold-change in the response corresponding to 1 unit change in this continuous predictor, right? However, to me the concept of 'fold change' is still a little unintuitive especially in the context of continuous variables. How do these logFC values compare to standard effect sizes derived using a basic regression model like lm()?
In general, is it recommended to standardize the molecular variables (here metabolic measure levels) to mean of 0 and standard deviation of 1? Would standardizing make the logFCs more comparable?
Regarding correction for multiple testing, my current code actually slightly deviates from what I wrote above. Given the potential for high correlations within metabolic measures, I have used poolr::meff() to estimate how many independent components ('effective' number of tests) exist within a correlated set of p-values. There are four methods ('nyholt', 'liji', 'gao', or 'galwey') to choose from to derive this estimate, all working by extracting the eigenvalues from the correlation matrix supplied as an input for poolr::meff(). This is my current code implementing multiple testing correction:

topDM <- topTable(fit.e, number=nrow(mtb.dat), adjust.method="none")
m.eff <- meff(cor(t(mtb.dat)), method = 'galwey') 
topDM$adj.P.Val = topDM$P.Value*m.eff

However, if the limma functions lmFit() and/or eBayes() somehow already account for the shared structure among the molecular variables, is this multiple testing correction still conceptually appropriate?

ADD REPLY • link 5 months ago anni • 0

0

Entering edit mode

How do these logFC values compare to standard effect sizes derived using a basic regression model like lm()?

Exactly the same. Both software tools are fitting linear models by least squares.

In general, is it recommended to standardize the molecular variables

Standardize by rows or by columns? I assume you mean by rows. No, we would not recommend that for any response variable. The purpose of statistical analysis is to detect and interpret biological signals, rather than to remove or arbitrarily equalize the biological signals.

It would be a pretty curious approach to standardize the variances and then use limma to moderate the variances. Obviously those two ideas cannot go together.

Would standardizing make the logFCs more comparable?

I think it would make the logFCs meaningless because they would no longer reflect the true biological response.

Some sort of normalization might be appropriate for your data but I don't know anything about the nature of your data or how it has been processed, so I can't make recommendations. Presumably you have already made sure the data is consistent between samples.

This is my current code implementing multiple testing correction

The FDR adjustment provided by limma already works in the presence of dependence between the molecular variables.

I have never seen a multiple testing adjustment like what you are doing, but it does not seem appropriate to me. First, the correlation structure would need, it seems to me, to take into account the design matrix in order to be meaningful. It is the correlation of the residuals that is relevant, not the naive correlation that conflates systematic effects with residual variation. Secondly, you are doing a sort of Bonferroni adjustment, which will generally be more conservative and have a different interpretation to the FDR adjustment that is the limma default.

Having said that, I am not familiar with the type of data you are analysing.

ADD REPLY • link 5 months ago Gordon Smyth 53k