HI,
I got the following plot for one of the DE genes. The p-value in DP cell type (the most right panel) is 7.50e-07 (FDR = 3.04e-05), with log2(fold-change) of 30. But obviously it's caused by the two outlier samples on the right corner. How would I avoid reporting genes like this? I am not sure the two outlier samples are same for other genes, so filtering them with minReplicateForReplace=Inf might not be an ideal solution.

I am using DESeq2_1.16.1.
Thanks,
-Xianjun

Thanks, Michael.
I will definitely add the non-zero counts requirement. But this will work for cases where most samples are at 1 (for example), and two samples are at extremely high (e.g. 100). I am wondering if it's possible (and how) to set the cooksCutoff parameter for case like this? Also, would the lfcSE be useful to filter case like this?
Thanks for referring the apeglm. It seems that lfcShrink() currently only accepts coef (not contrast) for type='apeglm'. In my case, I will have to use contrast to specify the reference level.
You can check the Cook's value for these genes, it's in assays(dds)[["cooks"]]
Probably using lfcShrink with type="normal" would also give LFC's of ~0 for these genes.
Thank you, Mike.
lfcShrink apparently does good job in shrinking the LFC, but seems not affect the p-value. e.g.
> res_after_shrink['ENSG00000257885.1',]
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000257885.1 8.157185 0.2021806 0.1781094 9.210762 3.238169e-20 6.098768e-16
> res_before_shrink['ENSG00000257885.1',]
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000257885.1 8.157185 27.65828 3.002822 9.210762 3.238169e-20 6.098768e-16
Here is the raw values (already require non-zero count >= 4). There is an extreme high value of 350, while most are zero.
> assay(dds)['ENSG00000257885.1',] HC_HD5_DN_1_rep1 HC_HD5_DP_1_rep1 HC_HD5_Th1_1_rep1 HC_HD5_Th17_1_rep1 MS_PT5_DN_1_rep1 MS_PT5_DP_1_rep1 0 0 0 0 0 0 MS_PT5_Th1_1_rep1 MS_PT5_Th17_1_rep1 HC_HD7_DN_1_rep1 HC_HD7_DP_1_rep1 HC_HD7_Th1_1_rep1 HC_HD7_Th17_1_rep1 0 0 0 0 0 0 MS_PT7_DN_1_rep1 MS_PT7_Th1_1_rep1 HC_HD11_DN_1_rep1 HC_HD11_DP_1_rep1 HC_HD11_Th1_1_rep1 HC_HD11_Th17_1_rep1 0 0 0 0 0 0 MS_PT11_DN_1_rep1 MS_PT11_DP_1_rep1 MS_PT11_Th1_1_rep1 MS_PT11_Th17_1_rep1 HC_I467_DN_1_rep1 HC_I467_DP_1_rep1 0 0 5 6 0 0 HC_I467_Th1_1_rep1 HC_I467_Th17_1_rep1 MS_PT17_DN_1_rep1 MS_PT17_DP_1_rep1 MS_PT17_Th1_1_rep1 MS_PT17_Th17_1_rep1 0 0 0 0 0 0 HC_I1752_DN_1_rep1 HC_I1752_DP_1_rep1 HC_I1752_Th1_1_rep1 HC_I1752_Th17_1_rep1 MS_PT18_DN_1_rep1 MS_PT18_DP_1_rep1 0 0 0 0 0 8 MS_PT18_Th1_1_rep1 MS_PT18_Th17_1_rep1 HC_HD20_DN_1_rep1 HC_HD20_DP_1_rep1 HC_HD20_Th1_1_rep1 HC_HD20_Th17_1_rep1 0 0 0 0 0 0 MS_PT20_DN_1_rep1 MS_PT20_DP_1_rep1 MS_PT20_Th1_1_rep1 MS_PT20_Th17_1_rep1 HC_HD9_DN_1_rep1 HC_HD9_DP_1_rep1 0 0 0 0 0 0 HC_HD9_Th1_1_rep1 HC_HD9_Th17_1_rep1 MS_PT14_DN_1_rep1 MS_PT14_DP_1_rep1 MS_PT14_Th1_1_rep1 MS_PT14_Th17_1_rep1 0 0 0 0 0 0 MS_PT12_DN_1_rep1 MS_PT12_Th1_1_rep1 MS_PT12_Th17_1_rep1 HC_I1684_DN_1_rep1 HC_I1684_DP_1_rep1 HC_I1684_Th1_1_rep1 6 0 0 0 0 0 HC_I141_Th17_1_rep1 MS_PT8b_DN_2_rep1 MS_PT8b_DP_1_rep1 MS_PT8b_Th1_1_rep1 MS_PT8b_Th17_1_rep1 HC_HD4_DN_1_rep1 0 0 350 0 0 0 MS_PT4_DN_1_rep1 HC_HD8_DN_1_rep1 HC_HD10_DN_1_rep1 MS_PT10_DN_1_rep1 MS_PT15_DN_1_rep1 MS_PT16_DN_1_rep1 0 0 0 0 0 0"lfcShrink apparently does good job in shrinking the LFC, but seems not affect the p-value"
Yes, sorry if that wasn't clear. You can either filter on the absolute value of LFC, or you can regenerate p-values.
You can filter simply with
res[abs(res$log2FoldChange) > theta,]lfcShrink() will only change the p-value column if you set lfcThreshold to a nonzero value (so testing that the shrunken LFC is larger than 'theta', and this only works with normal or apeglm), or if you set svalue=TRUE (works for apeglm and ashr only).
One more note: lfcThreshold only works with the latest release v1.20.
Hi Mike, I really appreciate your time and helps, but I don't think we are on the right question: My true question is why samples with one or two outliers can lead to such a fake "significant" p-value? In the case above (example gene in the DP group), most examples have zero expression and only two samples have high level in the MS group. I made a similar example here:
> b=c(rep(0,10),100,100) > a=c(rep(0,10),0,0) > t.test(a,b) Welch Two Sample t-test data: a and b t = -1.4832, df = 11, p-value = 0.1661 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -41.398398 8.065065 sample estimates: mean of x mean of y 0.00000 16.66667I know DEseq2 uses different test (e.g LRT) with advanced tricks (e.g. considering dispersion, adjust covariates etc.), and I can apply filters like you suggested above, but intuitively why t-test gives p-value 0.166 and DEseq2 gives a very small p-value like 3.238169e-20? This does not make sense to me.
Please help to explain. I might have misunderstand something here.
Hi Xianjun,
I don’t have much more for you here. The GLM pvalue detects the the two 100s vs all 0s. If it were a single large count it would likely be filtered. Shrinkage of LFC alleviate this either way.
Thanks, Mike. That's OK. Nothing is perfect. I can still play with the tricks you teach me and filter out those cases, and there are not that many of them. Have a nice weekend!
Last question on this: Would you think setting betaPrior=TRUE in DEseq(), which calculates p-values for the shrunken LFC, can help for outlier cases like this?
Yes, and certainly if you use lfcThreshold.