I'm aware of that limma functions were originally designed for microarray and RNA-Seq data, but I'm using them for modeling associations between metabolic measures (quantified with nuclear magnetic resonance (NMR) platform) and different traits (e.g. body mass index and overweight) here. I prefer using lmFit() and eBayes() over standard regression functions like lm() because, as far as I have understood, these limma functions account for shared structure of in high-dimensional data, optimize inference across multiple tests and moderate the variance estimates with the empirical Bayes approach.
I'm running the following chunk of code inside a loop over different traits some of which are binary and others are continuous. The aim is to test associations between each metabolic measure and trait.
# ph.dat is a data frame with individuals in rows, a single column for individual IDs, and additional columns for traits
# mtb.dat is a matrix with metabolic measures in rows and individuals (the same that are in ph.dat) in columns
# gr.labs is a named list where each element is a vector including labels (e.g. 'cases', 'controls') for a specific binary trait
stopifnot(all(ph.dat$id == colnames(mtb.dat)))
if(trait %in% bin.traits) { # if trait is binary
gr.labs0 = gr.labs[[trait]]
groupyes <- which(ph.dat[,trait]==gr.labs0[1])
groupno <- which(ph.dat[,trait]==gr.labs0[2])
design <- model.matrix(~0 + ph.dat[,trait])
colnames(design) <- gsub("ph.dat\\[, trait\\]", "", colnames(design))
design <- design[,c(gr.labs0[2], gr.labs0[1])]
colnames(design) <- c("groupno", "groupyes")
cont.matrix <- makeContrasts(yesvsno = (groupyes - groupno), levels = design)
fit <- lmFit(mtb.dat, design)
fit2 <- contrasts.fit(fit, cont.matrix)
} else { # if trait is not binary, it is assumed to be continuous
design <- model.matrix(~ ph.dat[,trait])
fit2 <- lmFit(mtb.dat, design)
}
fit.e <- eBayes(fit2)
topDM <- topTable(fit.e, number=nrow(mtb.dat))
The code runs without errors, but I want to ensure that my approach is conceptually sound. In particular, I would like to confirm the following points:
- I tried to see what happens inside
lmFit()and it seems thatlm.series()is applied regardless of whether the trait is continuous or binary. The association is then modeled so that the trait is the predictor and 'omical' variable (here metabolic measure) is the response. Doeslm.series()make some assumptions on the type or structure of the omics data? Is there some specific inlm.series()that would make it unsuitable for modeling associations with metabolic measures? What are its key differences compared to e.g.lm()? Islm.series()just a wrapper around running e.g.lm()repeatedly for each feature (row)? - Do
lmFit(),eBayes()ortopTable()make any assumptions about whether the trait is continuous or categorical? - What is the purpose of the
trendargument ineBayes(), and under what circumstances is it recommended to set it toTRUEversus leaving it at the defaultFALSE? According to the limma user guide (http://bioconductor.jp/packages/3.9/bioc/vignettes/limma/inst/doc/usersguide.pdf) settingtrend = TRUEmay increase statistical power but for example in the manual https://bioconductor.org/packages/release/bioc/manuals/limma/man/limma.pdf this argument is left at its defaultFALSE.

Ok, thanks!
I have just a couple more questions:
However, if the limma functions lmFit() and/or eBayes() somehow already account for the shared structure among the molecular variables, is this multiple testing correction still conceptually appropriate?
Exactly the same. Both software tools are fitting linear models by least squares.
Standardize by rows or by columns? I assume you mean by rows. No, we would not recommend that for any response variable. The purpose of statistical analysis is to detect and interpret biological signals, rather than to remove or arbitrarily equalize the biological signals.
It would be a pretty curious approach to standardize the variances and then use limma to moderate the variances. Obviously those two ideas cannot go together.
I think it would make the logFCs meaningless because they would no longer reflect the true biological response.
Some sort of normalization might be appropriate for your data but I don't know anything about the nature of your data or how it has been processed, so I can't make recommendations. Presumably you have already made sure the data is consistent between samples.
The FDR adjustment provided by limma already works in the presence of dependence between the molecular variables.
I have never seen a multiple testing adjustment like what you are doing, but it does not seem appropriate to me. First, the correlation structure would need, it seems to me, to take into account the design matrix in order to be meaningful. It is the correlation of the residuals that is relevant, not the naive correlation that conflates systematic effects with residual variation. Secondly, you are doing a sort of Bonferroni adjustment, which will generally be more conservative and have a different interpretation to the FDR adjustment that is the limma default.
Having said that, I am not familiar with the type of data you are analysing.