Hi,
I am analysing an rnaseq with interaction terms that shows extreme values on low count genes. As expected, these genes get shrinked to a FC ≈ 0 with lfcShrink
.
Here are some images of the MA-plot without and with apeglm
shrinkage (note the change of the y-axis limits); and volcano plots without and with apeglm
shrinkage to illustrate the context.
As I understand, all the genes that are now shrinked to a FC of ≈ 0 are genes that did not have enough count information to reliably predict a FC, and therefore get "nullified", but they are still significance (as lfcShrink
does not recalculate significance).
However, I am now dubitative in if I should include these genes in downstream analyses (such as enrichment, pathway, etc.) as I am not sure of what is the biological meaning of their significance.
So here are specific questions:
Should I filter these genes from enrichment analysis, as we don't have enough infromation about them?
Shouldn't these genes be picked by the independent filtering of DESeq2?
Would in this case benefit from a pre-filter to directly eliminate all these genes from the dataset?
Note: I am running DESeq2 as:
ddsTxi <- DESeqDataSetFromTximport(txi,
colData = sampleGrouping,
design = ~ genotype + treatment + stress + treatment:stress)
dds <- DESeq(ddsTxi, fitType = "local")
reslfcSh <- lfcShrink(dds, coef = 6, type = 'apeglm')
Thank you very much for your time in advance.
Thanks Michael for your input. I agree with you in the first section, and I trust more the results from the apeglm shrinkage.
However, I am having a hard time with the rest. As one would expect, when applying a certain
lfcTreshold
on theresults()
function (e.g. 0.5 for a 1.5 FC, as an extreme case, although I would use 0.2 or 0.01 in this case) most of the low log2FC genes dissapear while the "extreme" values (that later get shrinked onapeglm
) are kept. I would even say that these are, in fact, the genes that we are not interested in.The rest, the ones that have a log2FC between 0.1 to 0.6, are not heavily shrinked during
apeglm
, are heavily enriched in processes that we know are affected, and even have some qPCR data with similar FC that provides more evidence that they are, in fact, the signal that we are looking for.To be clear, the genes in the rectangles here and here are the ones that we have qPCR data on (for some of them, at least) and are heavily enriched in specific processes, while the rest seem to be "noise".
On the other side, trusting S-values over p-values completely solves the problem. In the correlation padjvssvalues (here) there are clearly two categories. The ones that correlate in their high significance (zoom) are the ones that we can validate and had lower FCs in the Volcano plot.
Again, thanks for helping me. I really don't know which results would have been derived from this analysis without taking into account your suggestions, or if it had been analysed a few years ago.
Ok. So it sounds like you'll go with the s-values then?
I guess the very large LFCs that are not supported by external data, those could be filtered via mechanisms such as:
For reasonable x and y, if you wanted to avoid s-values and instead use p-values and adjusted p-values.
Yes, I think I will go with s-values as I can relate it to the data that we have and expect, as long as I am able to convince my colleagues (and reviewers) that the approach is a valid in our experimental models.
I previously tried to pre-filter these genes, and could get relatively good results with an arbitrary rule of removing all genes with no expression on 12 of the 16 samples as:
However, I understand that I could be missing some genes with very low basal expression (e.g. immune genes that we are interested in), and it also dramatically decreased the significance of several genes. I also prefer not to take biased and arbitrary decisions that I cannot fully argument, so trusting s-values seems a better decision.