13 months ago by

The Scripps Research Institute, La Jolla, CA

I doubt that limma-trend would help much here. I expect it would have the same problem as voom with low counts. If nothing else, one possibility is to manually remove the dip on the left side by replacing the part of trend line left of the maximum with a constant function equal to that maximum. This isn't completely ideal, but it should be stricly more conservative at low counts than standard voom, without throwing them away entirely. Here's how I would implement this strategy using the version of the voom function from limma 3.28.21 (Note: If you're not using this same version of limma, it would be better to make the same modification to the voom function from whichever version of limma you are using, rather than using the code provided here.):

modifiedVoom <- function (counts, design = NULL, lib.size = NULL, normalize.method = "none",
span = 0.5, plot = FALSE, save.plot = FALSE, ...)
{
out <- list()
if (is(counts, "DGEList")) {
out$genes <- counts$genes
out$targets <- counts$samples
if (is.null(design) && diff(range(as.numeric(counts$sample$group))) >
0)
design <- model.matrix(~group, data = counts$samples)
if (is.null(lib.size))
lib.size <- with(counts$samples, lib.size * norm.factors)
counts <- counts$counts
}
else {
isExpressionSet <- suppressPackageStartupMessages(is(counts,
"ExpressionSet"))
if (isExpressionSet) {
if (length(Biobase::fData(counts)))
out$genes <- Biobase::fData(counts)
if (length(Biobase::pData(counts)))
out$targets <- Biobase::pData(counts)
counts <- Biobase::exprs(counts)
}
else {
counts <- as.matrix(counts)
}
}
if (is.null(design)) {
design <- matrix(1, ncol(counts), 1)
rownames(design) <- colnames(counts)
colnames(design) <- "GrandMean"
}
if (is.null(lib.size))
lib.size <- colSums(counts)
y <- t(log2(t(counts + 0.5)/(lib.size + 1) * 1e+06))
y <- normalizeBetweenArrays(y, method = normalize.method)
fit <- lmFit(y, design, ...)
if (is.null(fit$Amean))
fit$Amean <- rowMeans(y, na.rm = TRUE)
sx <- fit$Amean + mean(log2(lib.size + 1)) - log2(1e+06)
sy <- sqrt(fit$sigma)
allzero <- rowSums(counts) == 0
if (any(allzero)) {
sx <- sx[!allzero]
sy <- sy[!allzero]
}
l <- lowess(sx, sy, f = span)
## Set all values left of the max to the max value
lmax <- which.max(l$y)
l$y[seq(1, lmax)] <- l$y[lmax]
## Continue with the rest of the algorithm
if (plot) {
plot(sx, sy, xlab = "log2( count size + 0.5 )", ylab = "Sqrt( standard deviation )",
pch = 16, cex = 0.25)
title("voom: Mean-variance trend")
lines(l, col = "red")
}
f <- approxfun(l, rule = 2)
if (fit$rank < ncol(design)) {
j <- fit$pivot[1:fit$rank]
fitted.values <- fit$coef[, j, drop = FALSE] %*% t(fit$design[,
j, drop = FALSE])
}
else {
fitted.values <- fit$coef %*% t(fit$design)
}
fitted.cpm <- 2^fitted.values
fitted.count <- 1e-06 * t(t(fitted.cpm) * (lib.size + 1))
fitted.logcount <- log2(fitted.count)
w <- 1/f(fitted.logcount)^4
dim(w) <- dim(fitted.logcount)
out$E <- y
out$weights <- w
out$design <- design
if (is.null(out$targets))
out$targets <- data.frame(lib.size = lib.size)
else out$targets$lib.size <- lib.size
if (save.plot) {
out$voom.xy <- list(x = sx, y = sy, xlab = "log2( count size + 0.5 )",
ylab = "Sqrt( standard deviation )")
out$voom.line <- l
}
new("EList", out)
}

The critical modification is marked by the comments. This will prevent the left-side dip in the voom trend, but it will not eliminate any of the other problems associated with fitting a linear model to (the logarithm of) very low count values. I cannot say whether this alone would be enough to "fix" the limma voom algorithm for low counts.

Another option, though one that would require a lot more work, is to switch to Salmon or Kallisto for gene quantification with bootstrap resampling and Sleuth for differential expression testing. Sleuth uses the bootstrap samples from Kallisto/Salmon to more directly quantify the uncertainty in expression estimates, rather than fitting a trend line as voom does, so it should not have the same problem as the voom trend at low counts (although I cannot say what *other* problems this approach might have for low counts, having not tried it before).

How many treatment groups do you have? If there are 100s of samples overall, are there also lots of samples in each distinct treatment group?

32kEach treatment group is small (duplicates or triplicates). Several treatment groups are in larger blocks to control for various batch effects.

40Have you looked into the cqn package? The package implements a correction of the expression based on GC content, and length of the seq (although a bit obscure). Using it you can normalize it and could increase the low counts to a more normal distribution.

300I don't think that will help. My understanding of the problem is that very low counts are problematic because there are very few unique count values available, so the apparent variance is artifactually low, not because the actual expression has low variability, but because small count values have very low information content. Normalization isn't going to fix this problem. The NB-based methods can deal with this because the negative binomial distribution actually models these discrete counts, so it "knows" about and accounts for this effect.

6.1k