Dear all,
Since I am using DESeq2 version 1.18.1 I observe a weird behaviour with the significant genes. As you can see on the heatmap below, the top 50 most significant genes are plotted (FDR<5%) and it is full of outliers. I am running DESeq2 with default options. I also tried to pre-filter the gene list with no success (filtering lowly expressed genes for a certain number of samples). I have 3 different groups with 3 samples each and the filtering I added is having a minimum of >10 normalized reads in at least 3 samples. I have to compute pair-wise comparisons, here a pair (2 groups):
This was not happening with the older version 1.10.1 of DESeq2. What did it change? Why is the software not able to deal with these odd situations by default?
Michael, it is what I did and the filtering did not solve the problem ( I have explained it above).
Maybe you can help me a bit to give better advice: what are the normalized counts represented above when you have the yellowish green for 5 samples and then one sample with bright red.
Michael, these are normalized counts (scale="row"). What happens here is that I have three different groups and I have to perform pairwise comparisons and I run the DESEq2 just once and extract the desired comparisons. Your filter only works if I run DESeq2 every time for the desired comparison. Each time the filter is applied a different set of genes is filtered for each comparison. How can I solve the filtering if I want to run DESeq2 just once to save time?
hi aec,
I made a little example to show how you can de-prioritize these, in case they are not being caught by outlier detection.
Here I make a simulated dataset, and add a gene with counts [0,0,0] vs [0,0,100] vs [110,120,130]. So it passes the filter because it has four samples with high enough counts (although 100 is arguably an outlier with respect to the second group).
If you run DESeq() and results(), in this case, the gene does get filtered (p-value is NA).
But we can turn off filtering to see what would happen if the filtering didn't work. Now we get a very big LFC (because we are comparing the mean of the "B" group which is ~33 to the mean of the "A" group which is 0). And it has a significant p-value.
To remove these genes -- in case they are not caught by the outlier filtering -- you can add LFC shrinkage to re-estimate the LFC. This will not change the test statistics or p-values, though. But now the LFC is close to 0, because the amount of shrinkage takes into account the uncertainty, in a way that the Wald test doesn't catch. The shrinkage methods (all three of them) will give you an accurate LFC estimate here:
You can then apply a basic LFC filter to the results table, e.g.
Note: in the next version of DESeq2 (v1.20), you will be able to use
lfcThreshold
directly within lfcShrink() instead of this subset() call, which is preferred by me, but I didn't have time to implement this in v1.18. Operating on shrunken LFC and doing inference on the posterior distribution helps with these cases where the MLE LFC is large.Thanks a lot for the appropriate reply Michael.