Artificial P-value caused by outlier samples
1
0
Entering edit mode
Xianjun Dong ▴ 10
@xianjun-dong-7069
Last seen 8 months ago
United States

HI,

I got the following plot for one of the DE genes. The p-value in DP cell type (the most right panel) is 7.50e-07 (FDR = 3.04e-05), with log2(fold-change) of 30. But obviously it's caused by the two outlier samples on the right corner. How would I avoid reporting genes like this? I am not sure the two outlier samples are same for other genes, so filtering them with minReplicateForReplace=Inf might not be an ideal solution.

I am using DESeq2_1.16.1.

Thanks,

-Xianjun

deseq2 • 959 views
0
Entering edit mode
@mikelove
Last seen 15 hours ago
United States

hi Xianjun,

For whatever reason, if the filtering or replacement isn't working for your dataset (and it's not nearly a perfect solution), I'd recommend to require that there are more non-zero counts per row:

keep <- rowSums(counts(dds) >= 5) >= n
dds <- dds[keep,]

where you provide a reasonable 'n' here for your experiment. You would run this before DESeq().

I will note that our newest estimator, apeglm, will give you an LFC of 0 for all these, because it's pretty good about shrinking these away while preserving large differences. apeglm can also produce posterior estimates of DE.

0
Entering edit mode

Thanks, Michael.

I will definitely add the non-zero counts requirement. But this will work for cases where most samples are at 1 (for example), and two samples are at extremely high (e.g. 100). I am wondering if it's possible (and how) to set the cooksCutoff parameter for case like this? Also, would the lfcSE be useful to filter case like this?

Thanks for referring the apeglm. It seems that lfcShrink() currently only accepts coef (not contrast) for type='apeglm'. In my case, I will have to use contrast to specify the reference level.

0
Entering edit mode

You can check the Cook's value for these genes, it's in assays(dds)[["cooks"]]

Probably using lfcShrink with type="normal" would also give LFC's of ~0 for these genes.

0
Entering edit mode

Thank you, Mike.

lfcShrink apparently does good job in shrinking the LFC, but seems not affect the p-value. e.g.

> res_after_shrink['ENSG00000257885.1',]
baseMean log2FoldChange     lfcSE      stat       pvalue         padj
<numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric>
ENSG00000257885.1  8.157185      0.2021806 0.1781094  9.210762 3.238169e-20 6.098768e-16
> res_before_shrink['ENSG00000257885.1',]
baseMean log2FoldChange     lfcSE      stat       pvalue         padj
<numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric>
ENSG00000257885.1  8.157185       27.65828  3.002822  9.210762 3.238169e-20 6.098768e-16

Here is the raw values (already require non-zero count >= 4). There is an extreme high value of 350, while most are zero.

> assay(dds)['ENSG00000257885.1',]
HC_HD5_DN_1_rep1     HC_HD5_DP_1_rep1    HC_HD5_Th1_1_rep1   HC_HD5_Th17_1_rep1     MS_PT5_DN_1_rep1     MS_PT5_DP_1_rep1
0                    0                    0                    0                    0                    0
MS_PT5_Th1_1_rep1   MS_PT5_Th17_1_rep1     HC_HD7_DN_1_rep1     HC_HD7_DP_1_rep1    HC_HD7_Th1_1_rep1   HC_HD7_Th17_1_rep1
0                    0                    0                    0                    0                    0
MS_PT7_DN_1_rep1    MS_PT7_Th1_1_rep1    HC_HD11_DN_1_rep1    HC_HD11_DP_1_rep1   HC_HD11_Th1_1_rep1  HC_HD11_Th17_1_rep1
0                    0                    0                    0                    0                    0
MS_PT11_DN_1_rep1    MS_PT11_DP_1_rep1   MS_PT11_Th1_1_rep1  MS_PT11_Th17_1_rep1    HC_I467_DN_1_rep1    HC_I467_DP_1_rep1
0                    0                    5                    6                    0                    0
HC_I467_Th1_1_rep1  HC_I467_Th17_1_rep1    MS_PT17_DN_1_rep1    MS_PT17_DP_1_rep1   MS_PT17_Th1_1_rep1  MS_PT17_Th17_1_rep1
0                    0                    0                    0                    0                    0
HC_I1752_DN_1_rep1   HC_I1752_DP_1_rep1  HC_I1752_Th1_1_rep1 HC_I1752_Th17_1_rep1    MS_PT18_DN_1_rep1    MS_PT18_DP_1_rep1
0                    0                    0                    0                    0                    8
MS_PT18_Th1_1_rep1  MS_PT18_Th17_1_rep1    HC_HD20_DN_1_rep1    HC_HD20_DP_1_rep1   HC_HD20_Th1_1_rep1  HC_HD20_Th17_1_rep1
0                    0                    0                    0                    0                    0
MS_PT20_DN_1_rep1    MS_PT20_DP_1_rep1   MS_PT20_Th1_1_rep1  MS_PT20_Th17_1_rep1     HC_HD9_DN_1_rep1     HC_HD9_DP_1_rep1
0                    0                    0                    0                    0                    0
HC_HD9_Th1_1_rep1   HC_HD9_Th17_1_rep1    MS_PT14_DN_1_rep1    MS_PT14_DP_1_rep1   MS_PT14_Th1_1_rep1  MS_PT14_Th17_1_rep1
0                    0                    0                    0                    0                    0
MS_PT12_DN_1_rep1   MS_PT12_Th1_1_rep1  MS_PT12_Th17_1_rep1   HC_I1684_DN_1_rep1   HC_I1684_DP_1_rep1  HC_I1684_Th1_1_rep1
6                    0                    0                    0                    0                    0
HC_I141_Th17_1_rep1    MS_PT8b_DN_2_rep1    MS_PT8b_DP_1_rep1   MS_PT8b_Th1_1_rep1  MS_PT8b_Th17_1_rep1     HC_HD4_DN_1_rep1
0                    0                  350                    0                    0                    0
MS_PT4_DN_1_rep1     HC_HD8_DN_1_rep1    HC_HD10_DN_1_rep1    MS_PT10_DN_1_rep1    MS_PT15_DN_1_rep1    MS_PT16_DN_1_rep1
0                    0                    0                    0                    0                    0 
How should I decide the overall cutoff?


0
Entering edit mode

"lfcShrink apparently does good job in shrinking the LFC, but seems not affect the p-value"

Yes, sorry if that wasn't clear. You can either filter on the absolute value of LFC, or you can regenerate p-values.

You can filter simply with res[abs(res\$log2FoldChange) > theta,]

lfcShrink() will only change the p-value column if you set lfcThreshold to a nonzero value (so testing that the shrunken LFC is larger than 'theta', and this only works with normal or apeglm), or if you set svalue=TRUE (works for apeglm and ashr only).

0
Entering edit mode

One more note: lfcThreshold only works with the latest release v1.20.

0
Entering edit mode

Hi Mike, I really appreciate your time and helps, but I don't think we are on the right question: My true question is why samples with one or two outliers can lead to such a fake "significant" p-value? In the case above (example gene in the DP group), most examples have zero expression and only two samples have high level in the MS group. I made a similar example here:

> b=c(rep(0,10),100,100)
> a=c(rep(0,10),0,0)
> t.test(a,b)
Welch Two Sample t-test
data:  a and b
t = -1.4832, df = 11, p-value = 0.1661
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-41.398398   8.065065
sample estimates:
mean of x mean of y
0.00000  16.66667 

I know DEseq2 uses different test (e.g LRT) with advanced tricks (e.g. considering dispersion, adjust covariates etc.), and I can apply filters like you suggested above, but intuitively why t-test gives p-value 0.166 and DEseq2 gives a very small p-value like 3.238169e-20? This does not make sense to me.

0
Entering edit mode

Hi Xianjun,

I don’t have much more for you here. The GLM pvalue detects the the two 100s vs all 0s. If it were a single large count it would likely be filtered. Shrinkage of LFC alleviate this either way.

0
Entering edit mode

Thanks, Mike. That's OK. Nothing is perfect. I can still play with the tricks you teach me and filter out those cases, and there are not that many of them. Have a nice weekend!

0
Entering edit mode

Last question on this: Would you think setting betaPrior=TRUE in DEseq(), which calculates p-values for the shrunken LFC, can help for outlier cases like this?

0
Entering edit mode

Yes, and certainly if you use lfcThreshold.