Big differences in logfoldchange for some genes between edgeR and DESeq2
2
0
Entering edit mode
Matarit • 0
@matarit-9136
Last seen 5.6 years ago
France

Hi,

I am currently analyzing a rna-seq dataset for differentially expressed genes between 2 conditions (20 individuals in condition A and 6 in B, B being the reference). I ran both DESeq2 and edgeR (glmFit and glmLRT) to compare my results with both methods. I have 51 DE genes with DESeq2, 156 with edgeR, 44 are found by both. So they are globally consistent about the DE genes (good news) although edgeR find more DE genes. Nevertheless, log fold change can be quite different for some genes (not all) between the 2 methods, edgeR being generally over (absolute value) DESeq2.

Here is a "summary" table for 26 genes that are commonly DE with the 2 methods. Columns "basemean" to "padj" are from the DESeq2 analysis, those from logFC to FDR are from edgeR, and the 2 last columns give the mean cpm (from edgeR) for each conditions. I realize that this is a shortcut, but it's to keep an eye on the data.

 baseMean log2FC  lfcSE   stat    pvalue      padj   logFC logCPM    LR    PValue       FDR       cpm_A  cpm_B
a   136.28 -3.792 0.4706 -8.058 7.776e-16 1.536e-11 -12.422  4.416 45.78 1.322e-11 2.788e-08    0.4766  34.89
b   161.14 -3.064 0.4469 -6.857 7.046e-12 4.640e-08 -11.834  3.426 55.04 1.182e-13 5.482e-10    1.7046  34.05
c   357.49 -3.036 0.4474 -6.786 1.150e-11 5.678e-08 -11.768  4.836 43.17 5.017e-11 7.271e-08    0.0654  89.80
d    82.38 -3.035 0.4571 -6.639 3.161e-11 1.249e-07 -11.591  3.738 43.98 3.311e-11 5.484e-08    0.0944  20.14
e   101.27 -2.963 0.4740 -6.252 4.055e-10 8.902e-07 -11.505  5.507 44.30 2.818e-11 5.028e-08    0.6707  24.40
f   362.13 -2.819 0.4408 -6.396 1.600e-10 3.950e-07 -11.422  3.164 49.05 2.499e-12 7.244e-09    0.0483  91.16
g   187.13 -2.598 0.4341 -5.984 2.171e-09 3.574e-06 -11.023  3.473 43.36 4.563e-11 7.054e-08    0.0315  47.32
h   104.11 -2.556 0.4470 -5.718 1.075e-08 1.180e-05 -11.000  3.711 39.63 3.072e-10 3.239e-07    0.1963  25.45
i   240.92 -2.555 0.4371 -5.846 5.042e-09 6.740e-06 -10.763  2.426 39.29 3.652e-10 3.682e-07    0.1635  60.44
j   180.58 -2.103 0.4123 -5.099 3.408e-07 2.694e-04 -10.738  3.633 41.28 1.319e-10 1.699e-07    0.0055  46.13
k   176.02 -2.089 0.4401 -4.747 2.060e-06 1.187e-03 -10.715  4.413 57.65 3.130e-14 1.815e-10    3.9389  32.06
l    84.07 -2.044 0.4559 -4.483 7.343e-06 3.022e-03 -10.265  4.394 58.45 2.088e-14 1.614e-10    0.6260  19.43
m   150.23 -2.022 0.4085 -4.950 7.416e-07 5.233e-04 -10.250  3.474 51.34 7.757e-13 2.998e-09    0.0079  38.08
n   112.69 -1.995 0.4711 -4.235 2.288e-05 6.550e-03  -8.449  3.832 38.72 4.894e-10 4.729e-07    1.4762  25.48
o   777.03 -1.903 0.4020 -4.733 2.207e-06 1.211e-03  -7.653  2.312 37.19 1.072e-09 9.563e-07    0.0618 195.46
p   209.40 -1.871 0.4006 -4.671 2.992e-06 1.526e-03  -6.981  2.643 32.19 1.397e-08 8.097e-06    0.0238  52.87
q   141.22 -1.866 0.4324 -4.315 1.597e-05 5.047e-03  -6.077  3.025 58.55 1.981e-14 1.614e-10    0.7316  33.11
r   187.26 -1.850 0.3991 -4.636 3.552e-06 1.694e-03  -5.458  3.072 17.87 2.364e-05 3.916e-03    0.0153  46.99
s    90.78 -1.819 0.3895 -4.670 3.013e-06 1.526e-03  -5.064  2.607 36.67 1.401e-09 1.203e-06    2.7839  14.08
t   603.21 -1.810 0.3682 -4.916 8.854e-07 6.032e-04  -4.834  2.328 17.34 3.130e-05 5.111e-03   19.1040  85.95
u   405.26 -1.761 0.3556 -4.953 7.325e-07 5.233e-04  -4.326  3.222 49.45 2.035e-12 6.743e-09   13.3401  56.45
v   486.81 -1.716 0.3905 -4.394 1.112e-05 4.070e-03  -3.993  2.747 16.88 3.979e-05 6.193e-03    0.0280 121.90
w   225.69 -1.674 0.3878 -4.317 1.583e-05 5.047e-03  -2.934  3.376 24.97 5.830e-07 1.779e-04    0.0117  56.61
x   221.30 -1.674 0.3880 -4.313 1.609e-05 5.047e-03  -2.278  2.437 24.03 9.472e-07 2.679e-04    0.0203  55.69
y   362.95 -1.593 0.3820 -4.171 3.029e-05 7.874e-03  -2.196  5.124 27.18 1.849e-07 6.497e-05    0.0089  90.87
z    88.98 -1.588 0.3821 -4.157 3.229e-05 8.212e-03  -2.105  4.553 27.41 1.646e-07 5.873e-05    0.0065  22.38

Looking at the cpm, I first thought the difference came when the gene was at zero for most individuals in one condition and was expressed in the other one. But it is not always the case. I guess the difference come from filters and normalization, but when I manually (imprecise, just to have an idea) calculate the log FC, sometimes it's closer to the DESeq2 value, sometimes to the edgeR. So, my list of DE genes is quite consistent, but could someone explain the differences in logFC? Note: I used this filter in my edgeR code

keep <- rowSums(cpm(dge)>1) >= 2

Thanks!

edger deseq2 • 2.5k views
3
Entering edit mode
@mikelove
Last seen 1 day ago
United States

I don't have much comment on the difference in the size of the set of genes passing a threshold. I think users often expect there to be a sharp division between DE and not DE, when this is not the case in reality. Furthermore, p-values are tail probabilities, which are highly sensitive to subtle differences in how the model is built up, and the software make different choices here and there. If you look at the rank of the p-values, you'll see that the software are very consistent.

For the question about comparing LFC, note that DESeq2, by default, puts a prior distribution on log fold changes, giving a "shrunken" estimate (the maximal value of the posterior distribution). edgeR also shrinks the LFC using a small prior count, but the shrinkage is generally not as strong with this approach. You can read all about this in the DESeq2 paper:

http://www.genomebiology.com/2014/15/12/550

0
Entering edit mode

Thank you Michael for your answer. I went through your article again. I had also read the user guide for both packages. I had plot the plotMA for both shrunken and unshrunken results in DESEq2, and the effect is very obvious. I understand each package makes different choices, but I am surprised by the differences I can have for some logFC, wich can be either -3 or -12 for the same gene for example. From a biologist point of view, which one is "true"? Could I, for those genes that are DE expressed in both edgeR and DESeq2, recalculate the logFC myself in R, taking the mean of the normalized counts for the gene in each condition and then the log2? Or would that not be correct either, because it wouldn't take into account the distribution and dispersion?

1
Entering edit mode

Neither is true. When you estimate values from data, you don't get the true value, but you get an estimate. You would want the estimate to have small error: | estimate - truth |.

These are different estimators with different properties. Under the model, both kinds of estimates converge to the true value as the sample size grows larger, which is called "consistent" in statistics.

We have the default in DESeq2 to apply the moderation, because we find it useful for visualization and for using the LFC as point estimates. There are some cases when we think it's better to not moderate the LFC, for example if the idea of a common prior estimated from all genes is not appropriate. You can turn off LFC moderation with betaPrior=FALSE.

Calculating the LFC yourself to compare wouldn't really inform much here in my opinion. The unmoderated LFC is log2 of the ratio of the mean of normalized counts in each group, but that doesn't help you choose. If you want the fold changes to represent this value, then you can choose the unmoderated LFC in DESeq2 or use edgeR.

2
Entering edit mode

To add to Mike's point; imagine you have a gene with a count of zero in one sample and a non-zero count in another. If you didn't do any moderation, you'd end up with an infinite log-fold change between samples. That might sound impressive, but it's not actually that useful. For example, if the counts were 0 vs. 10 for one gene, and 1 vs. 10000 in another, the first gene would have a more extreme log-fold change than the second gene (infinite and log2(10000), respectively), despite the upregulation being clearly stronger in the latter. Similarly, if you had three genes with counts of 0 vs. 10, 0 vs. 100 and 0 vs. 1000, you wouldn't be able to tell them apart if they all had infinite log-fold changes.

In edgeR, a small prior count (0.25, I think) is added before computing log-fold changes, to provide some protection against this kind of silliness with zero counts. As Mike says, this shrunken estimate will converge to the true value as you get more counts from more samples. However, if your gene is completely silent in one condition such that it always gets zero counts for all samples in that condition, then the true log-fold change for that gene is inherently undefined. In such cases, any finite estimate is going to be wrong - but at least it's useful, which is more than I can say for a stack of Inf values.

0
Entering edit mode

Thanks again Michael and Aaron for your replies. I also indeed improperly asked my question: which logFC is the best estimator? Knowing that the sample size is set for my experiment. From you answers, I get that I cannot really answer that question.

But for the biology behind the experiment, how far can I analyze my results: would the following be appropriate?
- for estimated logFC similar via deseq2 and edgeR, I would consider that these estimates are close enough to the unkwon truth (knowing that I want an order of magnitude, is it about 2 or 100 or 1000 times up or down regulated?)
- for estimated logFC that are very different between DESeq2 and edgeR: I know my gene are differentially expressed between the 2 conditions, but I can't give a precise estimation, it's likely between those estimated logFC.

3
Entering edit mode

If the estimated log-fold changes are very different, it suggests that there's not a lot of information in the data, e.g., because the counts are too low. This means that the choice of prior for moderation/shrinkage is dominating the final estimate, with the different choices between edgeR and DESeq2 contributing to the final differences. When there isn't enough information, the true log-fold change could be anywhere - there's no guarantee that it lies between the two estimates. In short, if low counts are involved, you should take the estimated log-fold changes with a grain of salt. If you want a better answer, you should spend more money on sequencing to get bigger counts.

0
Entering edit mode

Thanks, that's very clear and makes sense for those genes. Would that be also true for genes almost not expressed in one condition and much more expressed in the other, or is it only true if you have low counts throughout all samples?

1
Entering edit mode

Both. Obviously, the log-fold change depends on the expression in both conditions, so having low counts in either condition would affect precise estimation. That said, you don't need precise estimation if all you want to say is "this log-fold change is big" for any particular gene.

3
Entering edit mode
@gordon-smyth
Last seen 5 hours ago
WEHI, Melbourne, Australia

The amount of shrinkage of the logFC is settable parameter in edgeR. You can get logFC that are shrunk as much as those from DESeq2 by increasing the value of the 'prior.count' argument when you run glmFit(). Or you can get unshrunk logFC by setting prior.count=0.

How much shrinkage is best depends on your purpose. By default we choose to give values that are slightly shrunk -- this avoids problems with zero counts and infinite fold changes, but also still allows you to see genes with large relative changes.

0
Entering edit mode

Thank you all for your help. I am re-running my analyses with and without shrinkage, and will compare my results.