Hi,
I am currently analyzing a rna-seq dataset for differentially expressed genes between 2 conditions (20 individuals in condition A and 6 in B, B being the reference). I ran both DESeq2 and edgeR (glmFit and glmLRT) to compare my results with both methods. I have 51 DE genes with DESeq2, 156 with edgeR, 44 are found by both. So they are globally consistent about the DE genes (good news) although edgeR find more DE genes. Nevertheless, log fold change can be quite different for some genes (not all) between the 2 methods, edgeR being generally over (absolute value) DESeq2.
Here is a "summary" table for 26 genes that are commonly DE with the 2 methods. Columns "basemean" to "padj" are from the DESeq2 analysis, those from logFC to FDR are from edgeR, and the 2 last columns give the mean cpm (from edgeR) for each conditions. I realize that this is a shortcut, but it's to keep an eye on the data.
baseMean log2FC lfcSE stat pvalue padj logFC logCPM LR PValue FDR cpm_A cpm_B
a 136.28 -3.792 0.4706 -8.058 7.776e-16 1.536e-11 -12.422 4.416 45.78 1.322e-11 2.788e-08 0.4766 34.89
b 161.14 -3.064 0.4469 -6.857 7.046e-12 4.640e-08 -11.834 3.426 55.04 1.182e-13 5.482e-10 1.7046 34.05
c 357.49 -3.036 0.4474 -6.786 1.150e-11 5.678e-08 -11.768 4.836 43.17 5.017e-11 7.271e-08 0.0654 89.80
d 82.38 -3.035 0.4571 -6.639 3.161e-11 1.249e-07 -11.591 3.738 43.98 3.311e-11 5.484e-08 0.0944 20.14
e 101.27 -2.963 0.4740 -6.252 4.055e-10 8.902e-07 -11.505 5.507 44.30 2.818e-11 5.028e-08 0.6707 24.40
f 362.13 -2.819 0.4408 -6.396 1.600e-10 3.950e-07 -11.422 3.164 49.05 2.499e-12 7.244e-09 0.0483 91.16
g 187.13 -2.598 0.4341 -5.984 2.171e-09 3.574e-06 -11.023 3.473 43.36 4.563e-11 7.054e-08 0.0315 47.32
h 104.11 -2.556 0.4470 -5.718 1.075e-08 1.180e-05 -11.000 3.711 39.63 3.072e-10 3.239e-07 0.1963 25.45
i 240.92 -2.555 0.4371 -5.846 5.042e-09 6.740e-06 -10.763 2.426 39.29 3.652e-10 3.682e-07 0.1635 60.44
j 180.58 -2.103 0.4123 -5.099 3.408e-07 2.694e-04 -10.738 3.633 41.28 1.319e-10 1.699e-07 0.0055 46.13
k 176.02 -2.089 0.4401 -4.747 2.060e-06 1.187e-03 -10.715 4.413 57.65 3.130e-14 1.815e-10 3.9389 32.06
l 84.07 -2.044 0.4559 -4.483 7.343e-06 3.022e-03 -10.265 4.394 58.45 2.088e-14 1.614e-10 0.6260 19.43
m 150.23 -2.022 0.4085 -4.950 7.416e-07 5.233e-04 -10.250 3.474 51.34 7.757e-13 2.998e-09 0.0079 38.08
n 112.69 -1.995 0.4711 -4.235 2.288e-05 6.550e-03 -8.449 3.832 38.72 4.894e-10 4.729e-07 1.4762 25.48
o 777.03 -1.903 0.4020 -4.733 2.207e-06 1.211e-03 -7.653 2.312 37.19 1.072e-09 9.563e-07 0.0618 195.46
p 209.40 -1.871 0.4006 -4.671 2.992e-06 1.526e-03 -6.981 2.643 32.19 1.397e-08 8.097e-06 0.0238 52.87
q 141.22 -1.866 0.4324 -4.315 1.597e-05 5.047e-03 -6.077 3.025 58.55 1.981e-14 1.614e-10 0.7316 33.11
r 187.26 -1.850 0.3991 -4.636 3.552e-06 1.694e-03 -5.458 3.072 17.87 2.364e-05 3.916e-03 0.0153 46.99
s 90.78 -1.819 0.3895 -4.670 3.013e-06 1.526e-03 -5.064 2.607 36.67 1.401e-09 1.203e-06 2.7839 14.08
t 603.21 -1.810 0.3682 -4.916 8.854e-07 6.032e-04 -4.834 2.328 17.34 3.130e-05 5.111e-03 19.1040 85.95
u 405.26 -1.761 0.3556 -4.953 7.325e-07 5.233e-04 -4.326 3.222 49.45 2.035e-12 6.743e-09 13.3401 56.45
v 486.81 -1.716 0.3905 -4.394 1.112e-05 4.070e-03 -3.993 2.747 16.88 3.979e-05 6.193e-03 0.0280 121.90
w 225.69 -1.674 0.3878 -4.317 1.583e-05 5.047e-03 -2.934 3.376 24.97 5.830e-07 1.779e-04 0.0117 56.61
x 221.30 -1.674 0.3880 -4.313 1.609e-05 5.047e-03 -2.278 2.437 24.03 9.472e-07 2.679e-04 0.0203 55.69
y 362.95 -1.593 0.3820 -4.171 3.029e-05 7.874e-03 -2.196 5.124 27.18 1.849e-07 6.497e-05 0.0089 90.87
z 88.98 -1.588 0.3821 -4.157 3.229e-05 8.212e-03 -2.105 4.553 27.41 1.646e-07 5.873e-05 0.0065 22.38
Looking at the cpm, I first thought the difference came when the gene was at zero for most individuals in one condition and was expressed in the other one. But it is not always the case. I guess the difference come from filters and normalization, but when I manually (imprecise, just to have an idea) calculate the log FC, sometimes it's closer to the DESeq2 value, sometimes to the edgeR. So, my list of DE genes is quite consistent, but could someone explain the differences in logFC? Note: I used this filter in my edgeR code
keep <- rowSums(cpm(dge)>1) >= 2
Thanks!
Thank you Michael for your answer. I went through your article again. I had also read the user guide for both packages. I had plot the plotMA for both shrunken and unshrunken results in DESEq2, and the effect is very obvious. I understand each package makes different choices, but I am surprised by the differences I can have for some logFC, wich can be either -3 or -12 for the same gene for example. From a biologist point of view, which one is "true"? Could I, for those genes that are DE expressed in both edgeR and DESeq2, recalculate the logFC myself in R, taking the mean of the normalized counts for the gene in each condition and then the log2? Or would that not be correct either, because it wouldn't take into account the distribution and dispersion?
Neither is true. When you estimate values from data, you don't get the true value, but you get an estimate. You would want the estimate to have small error: | estimate - truth |.
These are different estimators with different properties. Under the model, both kinds of estimates converge to the true value as the sample size grows larger, which is called "consistent" in statistics.
We have the default in DESeq2 to apply the moderation, because we find it useful for visualization and for using the LFC as point estimates. There are some cases when we think it's better to not moderate the LFC, for example if the idea of a common prior estimated from all genes is not appropriate. You can turn off LFC moderation with betaPrior=FALSE.
Calculating the LFC yourself to compare wouldn't really inform much here in my opinion. The unmoderated LFC is log2 of the ratio of the mean of normalized counts in each group, but that doesn't help you choose. If you want the fold changes to represent this value, then you can choose the unmoderated LFC in DESeq2 or use edgeR.
To add to Mike's point; imagine you have a gene with a count of zero in one sample and a non-zero count in another. If you didn't do any moderation, you'd end up with an infinite log-fold change between samples. That might sound impressive, but it's not actually that useful. For example, if the counts were 0 vs. 10 for one gene, and 1 vs. 10000 in another, the first gene would have a more extreme log-fold change than the second gene (infinite and log2(10000), respectively), despite the upregulation being clearly stronger in the latter. Similarly, if you had three genes with counts of 0 vs. 10, 0 vs. 100 and 0 vs. 1000, you wouldn't be able to tell them apart if they all had infinite log-fold changes.
In
edgeR
, a small prior count (0.25, I think) is added before computing log-fold changes, to provide some protection against this kind of silliness with zero counts. As Mike says, this shrunken estimate will converge to the true value as you get more counts from more samples. However, if your gene is completely silent in one condition such that it always gets zero counts for all samples in that condition, then the true log-fold change for that gene is inherently undefined. In such cases, any finite estimate is going to be wrong - but at least it's useful, which is more than I can say for a stack ofInf
values.Thanks again Michael and Aaron for your replies. I also indeed improperly asked my question: which logFC is the best estimator? Knowing that the sample size is set for my experiment. From you answers, I get that I cannot really answer that question.
But for the biology behind the experiment, how far can I analyze my results: would the following be appropriate?
- for estimated logFC similar via deseq2 and edgeR, I would consider that these estimates are close enough to the unkwon truth (knowing that I want an order of magnitude, is it about 2 or 100 or 1000 times up or down regulated?)
- for estimated logFC that are very different between DESeq2 and edgeR: I know my gene are differentially expressed between the 2 conditions, but I can't give a precise estimation, it's likely between those estimated logFC.
If the estimated log-fold changes are very different, it suggests that there's not a lot of information in the data, e.g., because the counts are too low. This means that the choice of prior for moderation/shrinkage is dominating the final estimate, with the different choices between
edgeR
andDESeq2
contributing to the final differences. When there isn't enough information, the true log-fold change could be anywhere - there's no guarantee that it lies between the two estimates. In short, if low counts are involved, you should take the estimated log-fold changes with a grain of salt. If you want a better answer, you should spend more money on sequencing to get bigger counts.Thanks, that's very clear and makes sense for those genes. Would that be also true for genes almost not expressed in one condition and much more expressed in the other, or is it only true if you have low counts throughout all samples?
Both. Obviously, the log-fold change depends on the expression in both conditions, so having low counts in either condition would affect precise estimation. That said, you don't need precise estimation if all you want to say is "this log-fold change is big" for any particular gene.