Question: What is the biological meaning of genes with FC = 0 after shrinkage
0
4 months ago by
polyptwo0
polyptwo0 wrote:

Hi,

I am analysing an rnaseq with interaction terms that shows extreme values on low count genes. As expected, these genes get shrinked to a FC ≈ 0 with lfcShrink.

Here are some images of the MA-plot without and with apeglm shrinkage (note the change of the y-axis limits); and volcano plots without and with apeglm shrinkage to illustrate the context.

As I understand, all the genes that are now shrinked to a FC of ≈ 0 are genes that did not have enough count information to reliably predict a FC, and therefore get "nullified", but they are still significance (as lfcShrink does not recalculate significance).

However, I am now dubitative in if I should include these genes in downstream analyses (such as enrichment, pathway, etc.) as I am not sure of what is the biological meaning of their significance.

So here are specific questions:

1. Should I filter these genes from enrichment analysis, as we don't have enough infromation about them?

2. Shouldn't these genes be picked by the independent filtering of DESeq2?

3. Would in this case benefit from a pre-filter to directly eliminate all these genes from the dataset?

Note: I am running DESeq2 as:

ddsTxi <- DESeqDataSetFromTximport(txi,
colData = sampleGrouping,
design = ~ genotype + treatment + stress + treatment:stress)
dds <- DESeq(ddsTxi, fitType = "local")
reslfcSh <- lfcShrink(dds, coef = 6, type = 'apeglm')



deseq2 apeglm • 175 views
modified 4 months ago by Michael Love24k • written 4 months ago by polyptwo0
Answer: What is the biological meaning of genes with FC = 0 after shrinkage
2
4 months ago by
Michael Love24k
United States
Michael Love24k wrote:

I see a few points with low mean count with adjusted p < 0.1 which are shrunk to zero by apeglm. apeglm is more conservative than the original DESeq2 NB GLM null hypothesis test when it comes to deciding what is consistent with the null. I would tend to trust apeglm. And one way to obtain what I find are more reliably significance tests (we say this in the DESeq2 paper as well) is to use lfcThreshold to find genes where the estimated LFC is above a positive threshold. You can use:

res <- results(dds, lfcThreshold=0.5)
res <- lfcShrink(dds, coef=6, type="apeglm", res=res)


Another option would be:

res <- lfcShrink(dds, coef=6, type="apeglm", svalue=TRUE)


And use the s-values instead of adjusted p-values. s-values define sets of genes (like adjusted p-values) which have a bounded aggregate error rate (the error being the chance of a false sign). Because the s-values are computed using the shrunken estimate, they will always agree.

Thanks Michael for your input. I agree with you in the first section, and I trust more the results from the apeglm shrinkage.

However, I am having a hard time with the rest. As one would expect, when applying a certain lfcTreshold on the results() function (e.g. 0.5 for a 1.5 FC, as an extreme case, although I would use 0.2 or 0.01 in this case) most of the low log2FC genes dissapear while the "extreme" values (that later get shrinked on apeglm) are kept. I would even say that these are, in fact, the genes that we are not interested in.

The rest, the ones that have a log2FC between 0.1 to 0.6, are not heavily shrinked during apeglm, are heavily enriched in processes that we know are affected, and even have some qPCR data with similar FC that provides more evidence that they are, in fact, the signal that we are looking for.

To be clear, the genes in the rectangles here and here are the ones that we have qPCR data on (for some of them, at least) and are heavily enriched in specific processes, while the rest seem to be "noise".

On the other side, trusting S-values over p-values completely solves the problem. In the correlation padjvssvalues (here) there are clearly two categories. The ones that correlate in their high significance (zoom) are the ones that we can validate and had lower FCs in the Volcano plot.

Again, thanks for helping me. I really don't know which results would have been derived from this analysis without taking into account your suggestions, or if it had been analysed a few years ago.

Ok. So it sounds like you'll go with the s-values then?

I guess the very large LFCs that are not supported by external data, those could be filtered via mechanisms such as:

keep <- rowSums(counts(dds)  >= x) >= y
dds <- dds[keep,]


For reasonable x and y, if you wanted to avoid s-values and instead use p-values and adjusted p-values.

Yes, I think I will go with s-values as I can relate it to the data that we have and expect, as long as I am able to convince my colleagues (and reviewers) that the approach is a valid in our experimental models.

I previously tried to pre-filter these genes, and could get relatively good results with an arbitrary rule of removing all genes with no expression on 12 of the 16 samples as:

idx <- rowSums( counts(ddsTxiGroupedPF, normalized=TRUE) > 0 ) >= 12
ddsTxiGroupedPF <- ddsTxiGroupedPF[idx,]
ddsGroupedPF <- DESeq(ddsTxiGroupedPF, fitType = "local")


However, I understand that I could be missing some genes with very low basal expression (e.g. immune genes that we are interested in), and it also dramatically decreased the significance of several genes. I also prefer not to take biased and arbitrary decisions that I cannot fully argument, so trusting s-values seems a better decision.