Question

Normalization methodology in DESeq/DESeq2 vs edgeR

2

Entering edit mode

Nik Tuzov ▴ 90

@nik-tuzov-8783

Last seen 2.1 years ago

United States

The reference for DESeq2's estimateDispersions() says that it uses Cox-Reid likelihood that was originally implemented in edgeR in 2010. I take it (correct me if I am wrong) that one drawback of CR vs GLM is that GLM can incorporate the effective library size as an offset, whereas CR can't. For that reason, counts have to be normalized prior to applying CR.

In DESeq2 normalized counts are obtained by dividing the raw counts by the corresponding normalization factors:

rawCounts = counts(dds, normalized = FALSE);

dds = estimateSizeFactors(dds);

normFactors = sizeFactors(dds);

normCounts = counts(dds, normalized = TRUE);

normCountsNew = t(t(rawCounts) / normFactors);

# The last two are the same normalized counts.

I would like to compare this to the old (“classic”) version of edgeR that didn't rely on GLM. They could have applied a simple division as above, but instead they generated pseudo-counts using “quantile-adjusted CML” from Robinson et al, 2008, “Small sample estimation of negative binomial dispersion”. It looks like the authors of DESeq/DESeq2 were well aware of the pseudo-counts approach, but instead they decided to simplify it. Is there any evidence that the simplified version isn't much worse?

deseq deseq2 edgeR normalization • 9.4k views

ADD COMMENT • link updated 10.2 years ago by Simon Anders ★ 3.8k • written 10.2 years ago by Nik Tuzov ▴ 90

score 8 · Answer 1 · 2015-11-17

8

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 9 hours ago

WEHI, Melbourne, Australia

You've got the wrong end of the stick a bit. Cox-Reid isn't a competitor to GLM, rather it is an esssential part of the GLM methodology. See the paper in which Cox-Reid and GLM was introduced:

http://nar.oxfordjournals.org/content/40/10/4288

The so-called normCounts that you give from DESeq2 are just the same as the cpm values produced by the cpm() function in edgeR. I can't speak for DESeq2, but I doubt that they apply negative binomial modelling to cpm values.

ADD COMMENT • link 10.2 years ago Gordon Smyth 53k

4

Entering edit mode

Just to add to Gordon's answer; in general, there can be a considerable difference between simple scaling of the counts compared to the more sophisticated quantile adjustment method. Consider the simplest case, where we have a Poisson-distributed count y with mean u. Now, let's say that this count is occurring in a library that's half the size of all the other libraries. I could "normalize" for the library size difference by doubling y, which ensures that the mean of the doubled count 2y is now 2u and comparable to the counts for the other libraries.

However, by doing so, I would increase the variance of the scaled count by 4-fold. As var(y) = u in the Poisson distribution, I would end up with var(2y) = 4u. As such, my scaled count is no longer Poisson-distributed (if it were, then var(2y) should be equal to 2u instead). In contrast, quantile adjustment will preserve the mean-variance relationship, generating a Poisson-distributed pseudo-count with mean 2u and variance 2u. This might seem like a subtle point, but it is quite important; accurate modeling of the variance is critical to the DE analysis, and failure to do so will result in loss of power or loss of error control.

So, basically, that's why quantile adjustment is used rather than direct scaling in "classic" edgeR. Of course, if you're using GLMs, this isn't an issue; normalization is handled naturally via the offsets without distorting the mean-variance relationship.

ADD REPLY • link 10.2 years ago Aaron Lun ★ 29k

score 4 · Answer 2 · 2015-11-17

4

Entering edit mode

Simon Anders ★ 3.8k

@simon-anders-3855

Last seen 5.5 years ago

Zentrum für Molekularbiologie, Universi…

I take it (correct me if I am wrong) that one drawback of CR vs GLM is that GLM can incorporate the effective library size as an offset, whereas CR can't.

There is no "CR vs GLM". Cox-Reid adjustment is a way to reduce bias when estimating nuisance parameters, and you can use it both for the simple two-group comparison methods the early versions of edgeR and DESeq only offered, and for the full GLMs that were added later.

DESeq does not divide by library size. That would indeed be much worse. We always stay with raw counts and use offsets, which works perfectly fine with GLMs and CR.

If i recall correctly, the reason why edgeR originally used this quantile-adjustment or pseudo-counts approach was that they carried out their original two-group comparison using an exact test, namely an NB equivalent to the binomial test. This test only worked for equal library size, which required the quantile adjustment. DESeq, as described in the 2010 paper, did something somewhat similar, but instead of having an approximation that changed the data to make it fit the test (edgeR's quantile-adjusted pseudo-counts), we left the data as is, but used an approximative test.

In autumn 2010, we added GLMs, where offsets are an obvious way to incorporate library size. To get this to work, we had to replace our methods-of-moments estimator of dispersion by a likelihood-based one, and use a Cox-Reid adjustment to avoid biases. How to do that I learned from studying the edgeR 2008 paper and source code, which is why I acknowledged it so explicitly. About at the same time, the edgeR authors also realized that GLMs are easily added and, hence, only a month or two after we had released a DESeq version supporting GLMs, they released a new version of edgeR with similar new features. I suppose that the quantile-adjustment were no longer used then, as it is not needed for GLMs, but I am not sure.

ADD COMMENT • link 10.2 years ago Simon Anders ★ 3.8k

3

Entering edit mode

Simon,

I’m not sure why you found it necessary to give a history of the GLM code, but your post gives a mistaken account of the order of events and you give me no choice but to correct the record. Your post gives the impression that you invented Cox-Reid for GLMs and that GLMs in DESeq pre-dated GLMs in edgeR, neither of which is true.

I am also surprised that you feel entitled to make comments about what I and my students thought or realized at any time. You can’t read our minds and it would be better to stick to facts.

Dispersion estimation for GLMs was a major topic of my PhD thesis and I’ve been publishing papers on GLMs for 30 years. I didn’t just “realize” that Cox-Reid could be used for GLMs in 2010, I’ve been familiar with that for decades. My lab started the edgeR project back in 2004, although we didn’t publish our first paper on it until 2007. We were already thinking about GLM extensions for edgeR in 2008 and I recruited two students, Davis McCarthy and Yunshun Chen, to work on them. Davis wrote his honours thesis on edgeR in 2009 including both GLMs and Cox-Reid adjusted profile likelihood. Yunshun and I worked out an efficient computation scheme for Cox-Reid dispersion estimation in April 2010. After much stress testing, we finally released publicly a complete suite of working and fully documented GLM functions on 23 September 2010. Amongst other things, we introduced a novel method of computing the Cox-Reid adjustment from the QR decomposition of the design matrix. Here’s a key code snippet from edgeR in Oct 2010 computing the Cox-Reid correction term:

cr <- sum(log(abs(diag(fit$qr$qr)[1:fit$qr$rank])))

By contrast, there was no mention of GLMs in any of the DESeq documentation at that time. The first documentation about GLMs appeared in DESeq on 11 October 2010. Even then, the DESeq GLM code didn’t actually work. It was unable to estimate the dispersion parameter for any real GLM, not even for a simple linear regression or for an additive model with two factors.

The svn record shows that you didn’t add likelihood-based dispersion estimation or Cox-Reid adjustment to DESeq until 29 June 2011. And when you did, you used the same computational formula that edgeR introduced in Oct 2010. Here’s the line of code from DESeq computing the Cox-Reid term:

cr <- sum( log( abs( diag( qrres$qr )[ 1:qrres$rank ] ) ) )

You sent me a number of emails in 2011 explaining that you had worked from the edgeR code.

The svn record shows therefore that DESeq didn’t have working GLM code for more than 9 months after edgeR, and it was based on a reimplementation of one of edgeR’s methods. Even then it is a stretch to describe DESeq’s GLM functions as “similar” to the more sophisticated empirical Bayes moderation strategy in edgeR.

Then you removed the Cox-Reid adjustment again from DESeq in 10 October 2011, telling me in a couple of emails that you had introduced it only temporarily in DESeq as an experiment.

All this is a matter of record. edgeR was the first Bioc package to implement NB GLMs for RNA-seq as far as I know. It was unambiguously the first to implement Cox-Reid likelihood adjustment and empirical Bayes dispersion estimation for GLMs. When we wrote up the edgeR GLM methodology for publication in 2011 we were actually unable to find any competitors to compare to.

ADD REPLY • link 10.2 years ago Gordon Smyth 53k

3

Entering edit mode

Dear Gordon

I am really sorry for having stirred up this old story - this was really not my intention and bad judgement from my side. I should not have written a mail on such a potentially sensitive topic so late at night. What I wanted to do is simply tell Nik that the use of pseudocounts or quantile adjustment is not a difference between DESeq and edgeR, but that it once was one, though only a minor one, namely at a time that neither of the two tools offered GLM functionaility. I should have stated this as succinctly as you did.

So, for the record: The crucial advance to get general linear modelling capability for sequencing count data was the dispersion estimating procedure devised by Gordon and his students. The line of code that Gordon posted is the essence of this idea. This is why the DESeq help page acknowledges edgeR so explicitly.

What I released in 2010 was a version of DESeq that supported GLMs, but with a dispersion estimation based on our methods-of-moments approach, which, frankly, was not all that useful. While I was proud of it at the time and felt that bringing GLMs into RNA-Seq analysis was an achievement, Gordon's judgment ("didn’t actually work") is somewhat justified. It became useful only when I later replaced the MoM estimator with the estimator used in edgeR, that they described in their 2011 NAR paper.

I often wondered why edgeR did not use GLMs from the beginning. I somehow always thought that the CR dispersion estimator was already implicit in Gordon's earlier publications on use of CR-adjusted likelihood in linear modelling and it was hence straightforward for them to use it in the manner that edgeR uses it. Until now reading Gordon's post, I had never realized how non-obvious the part with the QR diagonal is, and how much work went into getting this last part. With hindsight, everything looks easy.

Simon

ADD REPLY • link 10.2 years ago Simon Anders ★ 3.8k

1

Entering edit mode

Seems Gordon was faster in replying succinctly, while I strained my memory to remember what the story behind all this originally was -- even though it's long ago and does not matter, and I probably also got it wrong.

ADD REPLY • link 10.2 years ago Simon Anders ★ 3.8k

0

Entering edit mode

Seems Gordon was faster in replying succinctly, while I strained my memory to remember what he story originally was -- even though it's long ago and does not matter.

ADD REPLY • link 10.2 years ago Simon Anders ★ 3.8k

0

Entering edit mode

Thank you all very much for replying. It doesn't look like the "normalized counts" obtainable from edgeR or DESeq2 are very meaningful, except that they can be fed to limma. Is that the only usage approved by the party line?

ADD REPLY • link 10.2 years ago Nik Tuzov ▴ 90

0

Entering edit mode

cpm and rpkm values are meaningful, but they are for data display and data exploration rather than for formal DE analysis.

ADD REPLY • link 10.2 years ago Gordon Smyth 53k