Search
Question: DE for genes with very low counts using limma.
0
24 months ago by
maltethodberg100
UCPH
maltethodberg100 wrote:

I am analyzing a very large RNA-Seq experiment (100s of samples). I intend to use limma for the analysis, mainly due to the speed of limma compared to edgeR or DESeq2.

I specifically want to keep a large number of very lowly expressed genes in the analysis. When I apply the voom, the fitted mean-variance trend dips down at very low expression values. According to this thread: C: Filtering lowly expressed genes in voom-limma analysis, this compromises the model.

I’m looking for solutions to get around this, while keeping as many of the lowly expressed genes as possible.

I tried to filter away the most lowly expressed genes by using edgeR::aveLogCPM, until the mean-variance trend outputted by voom was monotonically decreasing. This however, leads to a loss of quite a large number of genes.

The library sizes are quite stable, with only a few samples outside the recommended 3-fold range, so I also looked at using the limma-trend approach. I was unable to find any info on the how limma-trend behaves when there are many lowly expressed genes. Does it suffer from the same limitations as voom? I note that the prior count in the log-transformation (using edgeR::cpm) can be tweaked in this pipeline, whereas voom always uses a prior count of 0.5.

modified 24 months ago by Gordon Smyth35k • written 24 months ago by maltethodberg100

How many treatment groups do you have? If there are 100s of samples overall, are there also lots of samples in each distinct treatment group?

Each treatment group is small (duplicates or triplicates). Several treatment groups are in larger blocks to control for various batch effects.

Have you looked into the cqn package? The package implements a correction of the expression based on GC content, and length of the seq (although a bit obscure). Using it you can normalize it and could increase the low counts to a more normal distribution.

I don't think that will help. My understanding of the problem is that very low counts are problematic because there are very few unique count values available, so the apparent variance is artifactually low, not because the actual expression has low variability, but because small count values have very low information content. Normalization isn't going to fix this problem. The NB-based methods can deal with this because the negative binomial distribution actually models these discrete counts, so it "knows" about and accounts for this effect.

5
24 months ago by
Aaron Lun21k
Cambridge, United Kingdom
Aaron Lun21k wrote:

Or you can just use edgeR, which will handle the low counts appropriately. I've run edgeR on a single-cell data set containing approximately 3000 cells; it took about half an hour on a desktop computer, so hundreds of samples shouldn't be a problem. Set up the analysis, start it running and go grab a coffee.

What specific approach of edgeR would you recommend using in this case, the standard or QL pipeline?

3
24 months ago by
The Scripps Research Institute, La Jolla, CA
Ryan C. Thompson6.9k wrote:

I doubt that limma-trend would help much here. I expect it would have the same problem as voom with low counts. If nothing else, one possibility is to manually remove the dip on the left side by replacing the part of trend line left of the maximum with a constant function equal to that maximum. This isn't completely ideal, but it should be stricly more conservative at low counts than standard voom, without throwing them away entirely. Here's how I would implement this strategy using the version of the voom function from limma 3.28.21 (Note: If you're not using this same version of limma, it would be better to make the same modification to the voom function from whichever version of limma you are using, rather than using the code provided here.):

modifiedVoom <- function (counts, design = NULL, lib.size = NULL, normalize.method = "none",
span = 0.5, plot = FALSE, save.plot = FALSE, ...)
{
out <- list()
if (is(counts, "DGEList")) {
out$genes <- counts$genes
out$targets <- counts$samples
if (is.null(design) && diff(range(as.numeric(counts$sample$group))) >
0)
design <- model.matrix(~group, data = counts$samples) if (is.null(lib.size)) lib.size <- with(counts$samples, lib.size * norm.factors)
counts <- counts$counts } else { isExpressionSet <- suppressPackageStartupMessages(is(counts, "ExpressionSet")) if (isExpressionSet) { if (length(Biobase::fData(counts))) out$genes <- Biobase::fData(counts)
if (length(Biobase::pData(counts)))
out$targets <- Biobase::pData(counts) counts <- Biobase::exprs(counts) } else { counts <- as.matrix(counts) } } if (is.null(design)) { design <- matrix(1, ncol(counts), 1) rownames(design) <- colnames(counts) colnames(design) <- "GrandMean" } if (is.null(lib.size)) lib.size <- colSums(counts) y <- t(log2(t(counts + 0.5)/(lib.size + 1) * 1e+06)) y <- normalizeBetweenArrays(y, method = normalize.method) fit <- lmFit(y, design, ...) if (is.null(fit$Amean))
fit$Amean <- rowMeans(y, na.rm = TRUE) sx <- fit$Amean + mean(log2(lib.size + 1)) - log2(1e+06)
sy <- sqrt(fit$sigma) allzero <- rowSums(counts) == 0 if (any(allzero)) { sx <- sx[!allzero] sy <- sy[!allzero] } l <- lowess(sx, sy, f = span) ## Set all values left of the max to the max value lmax <- which.max(l$y)
l$y[seq(1, lmax)] <- l$y[lmax]
## Continue with the rest of the algorithm
if (plot) {
plot(sx, sy, xlab = "log2( count size + 0.5 )", ylab = "Sqrt( standard deviation )",
pch = 16, cex = 0.25)
title("voom: Mean-variance trend")
lines(l, col = "red")
}
f <- approxfun(l, rule = 2)
if (fit$rank < ncol(design)) { j <- fit$pivot[1:fit$rank] fitted.values <- fit$coef[, j, drop = FALSE] %*% t(fit$design[, j, drop = FALSE]) } else { fitted.values <- fit$coef %*% t(fit$design) } fitted.cpm <- 2^fitted.values fitted.count <- 1e-06 * t(t(fitted.cpm) * (lib.size + 1)) fitted.logcount <- log2(fitted.count) w <- 1/f(fitted.logcount)^4 dim(w) <- dim(fitted.logcount) out$E <- y
out$weights <- w out$design <- design
if (is.null(out$targets)) out$targets <- data.frame(lib.size = lib.size)
else out$targets$lib.size <- lib.size
if (save.plot) {
out$voom.xy <- list(x = sx, y = sy, xlab = "log2( count size + 0.5 )", ylab = "Sqrt( standard deviation )") out$voom.line <- l
}
new("EList", out)
}


The critical modification is marked by the comments. This will prevent the left-side dip in the voom trend, but it will not eliminate any of the other problems associated with fitting a linear model to (the logarithm of) very low count values. I cannot say whether this alone would be enough to "fix" the limma voom algorithm for low counts.

Another option, though one that would require a lot more work, is to switch to Salmon or Kallisto for gene quantification with bootstrap resampling and Sleuth for differential expression testing. Sleuth uses the bootstrap samples from Kallisto/Salmon to more directly quantify the uncertainty in expression estimates, rather than fitting a trend line as voom does, so it should not have the same problem as the voom trend at low counts (although I cannot say what other problems this approach might have for low counts, having not tried it before).

Well, I think that this approach (extending the peak variance to the left) will be quite conservative, probably too conservative, for low counts. It is not actually necessary that the voom curve be strictly monotonic, it only needs to be smooth enough that the loess curve is a reasonable it.

Regarding Kallisto/Salmon, as far as I can see, Kallisto's bootstrap estimates technical variability only and does not help with estimating variability between samples, which is what voom is doing. The bootstrap will have just as much problem with low counts as other methods: if there are few reads or no reads then there's nothing to bootstrap.

1
24 months ago by
Gordon Smyth35k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth35k wrote:

I agree with Aaron. If you need to push the envelope regarding low count genes, then edgeR QL pipeline will handle the low counts more effectively than limma, and it can certainly handle a few hundred samples without any difficulty.

If want to keep trying limma, then here are a few suggestions:

1. I would use limma-trend rather than limma-voom. This has essentially the same filtering requirements as voom, but will be faster and just as correct for your data.
2. Filtering on aveLogCPM will be overly conservative for your data.I suggest you instead try keeping genes that have at least 10 reads in at least K samples, where K might perhaps be as low as 2. You will have to decide what value of K makes sense for your data.
3. Fit the linear model using eBayes(fit, robust=TRUE, trend=TRUE) and the examine plotSA(fit). If the plot looks too problematic, then you should probably move to edgeR. The plot doesn't need to be strictly monotonic, but any dip on the left should be not be so sharp that the loess curve can't follow it accurately.

That was more or less exactly what I was doing, ranking genes with aveLogCPM and then filtering out the most lowly expressed genes until the prior curve from plotSA looked decent. I will try playing around with different filtering criteria.
With limma-trend my reasoning was that you could potentially add a higher prior count to all expression values when log transforming (instead of voom's 0.5). This would not increase the number of unique expression value for each gene, but would increase the mean expression - how would this affect the loess fit?

Well, I am glad that you have tried limma-trend, as your original post did not make that clear. In other respects however my suggestions are not what you were doing. I am suggesting that you switch from aveLogCPM to a different criterion and turn on robust estimation. The filtering I am suggesting will perform very differently to what you are currently doing.

Yes, with limma-trend we always use a prior.count greater than 0.5. At least prior.count=1 but perhaps prior.count=3. Why don't you try it and see how it goes? If you have tried it, how did it work?

Played around a bit with the different thresholds using limma-trend, to see how it it affects the prior fit for lowly expressed genes. Made the following observations:
1) Filtering out genes with a substantial amount of zero counts across samples, removes most of the prior dip at lowly expressed genes with prior.count=0.5.
2) Increasing the prior.count makes the prior dip more extreme - to counteract it, more lowly expressed must be filtered out.
What's the reasoning behind limma-trend requiring a higher prior.count than voom?

limma-trend doesn't require a higher prior.count than voom, it just gives you the freedom to choose. Choosing a higher prior.count makes the trend line flatter and less important, which is good for methods like heatmaps for other plots than don't use the trend line. You can therefore use the same expression summary for all purposes.

Yes, a higher prior count does depress the variances of low count genes more severely, but this does not mean that you need to do more filtering. The trend line does have to be strictly monotonic.