Wald vs LRT: different number of significant genes when testing the same coefficient
1
0
Entering edit mode
aziza ▴ 20
@aziza-10616
Last seen 7.5 years ago

Hi list,

My question is simple: what questions are each of these two tests responding to? (why do i get different results when testing for the same coefficient in the Wald vs LRT test?) The FC is the same, the p-values change because the LRT provides a p-value testing for the null hypothesis that the coefficients that are not in the reduced formula are equal to zero, the Wald test provide p-values testing for the null hypothesis that the coefficients of the full model are equal to zero. But how do you interpret this?

What are the genes that are being shown as significant for each test?

> design(ddsK)=~A*B*C
> resRed = DESeq(ddsK, test = "LRT", reduced = formula(~ A*B))
> resRed = results(resRed, name = "C_C2_vs_C1", alpha = 0.05)
> summary(resRed)

out of 28842 with nonzero total read count
LFC > 0 (up)     : 5936, 21%
LFC < 0 (down)   : 5894, 20%
outliers [1]     : 81, 0.28%
low counts [2]   : 3354, 12%
(mean count < 1)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

> design(ddsK)=~A*B*C
> ddsK=DESeq(ddsK)
> resB <- results(ddsK, name = "C_C2_vs_C1", alpha = 0.05)
> summary(resB)

out of 28842 with nonzero total read count
LFC > 0 (up)     : 1142, 4%
LFC < 0 (down)   : 1088, 3.8%
outliers [1]     : 81, 0.28%
low counts [2]   : 5015, 17%
(mean count < 3)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

> resRed
log2 fold change (MLE): C C2 vs C1
LRT p-value: '~ A * B * C' vs '~ A * B'
DataFrame with 29085 rows and 6 columns
baseMean log2FoldChange      lfcSE      stat       pvalue         padj
<numeric>      <numeric>  <numeric> <numeric>    <numeric>    <numeric>
comp100000_c0   5.394950     0.88129957  0.7296606 25.557624 3.885476e-05 1.842103e-04
comp100000_c1  15.236690     0.83653169  0.6352388 48.670559 6.839876e-10 8.012020e-09
comp100000_c3   7.684048    -0.02531354  0.9073333 23.174532 1.168546e-04 4.968079e-04
comp100002_c0 241.680116     1.27044149  0.3719448 52.028340 1.360758e-10 1.797857e-09
comp100003_c0  11.010333     1.44552180  0.6997581  8.617796 7.139597e-02 1.266642e-01
...                  ...            ...        ...       ...          ...          ...
comp99991_c0   1643.4060     0.06606909 0.08068866 39.811182 4.735591e-08 4.018609e-07
comp99992_c0    651.1968    -0.08546246 0.19208340  2.133650 7.111930e-01 7.639641e-01
comp99993_c0   1957.2871     0.08239842 0.13402428  5.188685 2.684794e-01 3.605124e-01
comp99994_c0   1562.8709    -0.12381266 0.14475235  1.837816 7.655570e-01 8.095270e-01
comp99999_c0    170.9605     0.15237421 0.20194552  4.054691 3.986549e-01 4.888804e-01
> resB
log2 fold change (MLE): C C2 vs C1
Wald test p-value: C C2 vs C1
DataFrame with 29085 rows and 6 columns
baseMean log2FoldChange      lfcSE        stat       pvalue       padj
<numeric>      <numeric>  <numeric>   <numeric>    <numeric>  <numeric>
comp100000_c0   5.394950     0.88129957  0.7296606  1.20782123 0.2271160328 0.47320324
comp100000_c1  15.236690     0.83653169  0.6352388  1.31687747 0.1878796988 0.42510643
comp100000_c3   7.684048    -0.02531354  0.9073333 -0.02789883 0.9777428379 0.98894584
comp100002_c0 241.680116     1.27044149  0.3719448  3.41567233 0.0006362476 0.01283631
comp100003_c0  11.010333     1.44552180  0.6997581  2.06574505 0.0388525593 0.18057827
...                  ...            ...        ...         ...          ...        ...
comp99991_c0   1643.4060     0.06606909 0.08068866   0.8188151    0.4128919  0.6519404
comp99992_c0    651.1968    -0.08546246 0.19208340  -0.4449237    0.6563749  0.8259819
comp99993_c0   1957.2871     0.08239842 0.13402428   0.6148022    0.5386854  0.7499239
comp99994_c0   1562.8709    -0.12381266 0.14475235  -0.8553413    0.3923622  0.6347520
comp99999_c0    170.9605     0.15237421 0.20194552   0.7545313    0.4505303  0.6829395
deseq2 wald lrt genes significance • 2.5k views
0
Entering edit mode
@mikelove
Last seen 4 hours ago
United States

hi aziza,

I recommend that you consult with a local statistician. There is a big difference between these tests and it's too much to explain on the support site. You have specified a model with 2nd order interactions between 3 factors and are comparing that to a model with interactions between 2 factors. If you aren't sure what these models are doing, or the meaning of the p-values that come out, you really should bring on a collaborator who can help come up with a plan for the statistical analysis.

The likelihood ratio test is testing the whether the explanatory power of the full model is more than expected by change compared to the reduced model:

> resRed
log2 fold change (MLE): C C2 vs C1
LRT p-value: '~ A * B * C' vs '~ A * B'

The Wald test that you have constructed is testing whether a single fold change in the model (C2 vs C1) is equal zero:

> resB
log2 fold change (MLE): C C2 vs C1
Wald test p-value: C C2 vs C1

0
Entering edit mode

Hi, thank you for your reply Michael. The explanation you gave me is in statistical terms and I gather that, I have the same problem when I speak with statisticians here, I was hoping for someone to orient me more on a biological interpretation.

I was expecting that the genes with sig pvalue obtained with the LRT test would be the genes that are affected by factor C and that with results I would extract the DEG between C2_vs_C1 (in only those genes affected by factor C).

Equally I thought that fitting the data with the full model (the Wald option) and extracting the DEG between C2_vs_C1 would give the same number of DEGs as in above.

Actually the fold changes are the same which means it is testing for C2_vs_C1, but I don't understand why the number of significant genes is different, or in other words, biologically, what are the genes obtained from the LRT comparison vs the ones obtained with the Wald test?

In any case, Which of the two options is best to test for differences between C2 and C1?

Thank you

0
Entering edit mode

You could say translating from biological hypotheses to statistical analyses is a whole profession and an academic discipline. It seems like you have a fairly complex experimental design, and I can't easily tell you what to do on the support forum. I would want to sit down and understand what are A, B, and C what are the relationship between these, what are the assumptions of the experiment, etc. But I don't have much time for this, which is why I recommend finding a local statistician who you would include as a collaborator on the project.