Question

What is the biological meaning of genes with FC = 0 after shrinkage

0

Entering edit mode

polyptwo • 0

@polyptwo-19338

Last seen 4.5 years ago

Hi,

I am analysing an rnaseq with interaction terms that shows extreme values on low count genes. As expected, these genes get shrinked to a FC ≈ 0 with lfcShrink.

Here are some images of the MA-plot without and with apeglm shrinkage (note the change of the y-axis limits); and volcano plots without and with apeglm shrinkage to illustrate the context.

As I understand, all the genes that are now shrinked to a FC of ≈ 0 are genes that did not have enough count information to reliably predict a FC, and therefore get "nullified", but they are still significance (as lfcShrink does not recalculate significance).

However, I am now dubitative in if I should include these genes in downstream analyses (such as enrichment, pathway, etc.) as I am not sure of what is the biological meaning of their significance.

So here are specific questions:

Should I filter these genes from enrichment analysis, as we don't have enough infromation about them?
Shouldn't these genes be picked by the independent filtering of DESeq2?
Would in this case benefit from a pre-filter to directly eliminate all these genes from the dataset?

Note: I am running DESeq2 as:

ddsTxi <- DESeqDataSetFromTximport(txi,
                                   colData = sampleGrouping,
                                   design = ~ genotype + treatment + stress + treatment:stress)
dds <- DESeq(ddsTxi, fitType = "local")
reslfcSh <- lfcShrink(dds, coef = 6, type = 'apeglm')

Thank you very much for your time in advance.

deseq2 apeglm • 1.3k views

ADD COMMENT • link updated 5.1 years ago by Michael Love 41k • written 5.1 years ago by polyptwo • 0

score 2 · Accepted Answer · 2019-03-20

2

Entering edit mode

Michael Love 41k

@mikelove

Last seen 9 hours ago

United States

I see a few points with low mean count with adjusted p < 0.1 which are shrunk to zero by apeglm. apeglm is more conservative than the original DESeq2 NB GLM null hypothesis test when it comes to deciding what is consistent with the null. I would tend to trust apeglm. And one way to obtain what I find are more reliably significance tests (we say this in the DESeq2 paper as well) is to use lfcThreshold to find genes where the estimated LFC is above a positive threshold. You can use:

res <- results(dds, lfcThreshold=0.5)
res <- lfcShrink(dds, coef=6, type="apeglm", res=res)

Another option would be:

res <- lfcShrink(dds, coef=6, type="apeglm", svalue=TRUE)

And use the s-values instead of adjusted p-values. s-values define sets of genes (like adjusted p-values) which have a bounded aggregate error rate (the error being the chance of a false sign). Because the s-values are computed using the shrunken estimate, they will always agree.

ADD COMMENT • link 5.1 years ago Michael Love 41k

0

Entering edit mode

Thanks Michael for your input. I agree with you in the first section, and I trust more the results from the apeglm shrinkage.

However, I am having a hard time with the rest. As one would expect, when applying a certain lfcTreshold on the results() function (e.g. 0.5 for a 1.5 FC, as an extreme case, although I would use 0.2 or 0.01 in this case) most of the low log2FC genes dissapear while the "extreme" values (that later get shrinked on apeglm) are kept. I would even say that these are, in fact, the genes that we are not interested in.

The rest, the ones that have a log2FC between 0.1 to 0.6, are not heavily shrinked during apeglm, are heavily enriched in processes that we know are affected, and even have some qPCR data with similar FC that provides more evidence that they are, in fact, the signal that we are looking for.

To be clear, the genes in the rectangles here and here are the ones that we have qPCR data on (for some of them, at least) and are heavily enriched in specific processes, while the rest seem to be "noise".

On the other side, trusting S-values over p-values completely solves the problem. In the correlation padjvssvalues (here) there are clearly two categories. The ones that correlate in their high significance (zoom) are the ones that we can validate and had lower FCs in the Volcano plot.

Again, thanks for helping me. I really don't know which results would have been derived from this analysis without taking into account your suggestions, or if it had been analysed a few years ago.

ADD REPLY • link 5.1 years ago polyptwo • 0

0

Entering edit mode

Ok. So it sounds like you'll go with the s-values then?

I guess the very large LFCs that are not supported by external data, those could be filtered via mechanisms such as:

keep <- rowSums(counts(dds)  >= x) >= y
dds <- dds[keep,]

For reasonable x and y, if you wanted to avoid s-values and instead use p-values and adjusted p-values.

ADD REPLY • link 5.1 years ago Michael Love 41k

0

Entering edit mode

Yes, I think I will go with s-values as I can relate it to the data that we have and expect, as long as I am able to convince my colleagues (and reviewers) that the approach is a valid in our experimental models.

I previously tried to pre-filter these genes, and could get relatively good results with an arbitrary rule of removing all genes with no expression on 12 of the 16 samples as:

idx <- rowSums( counts(ddsTxiGroupedPF, normalized=TRUE) > 0 ) >= 12
ddsTxiGroupedPF <- ddsTxiGroupedPF[idx,]
ddsGroupedPF <- DESeq(ddsTxiGroupedPF, fitType = "local")

However, I understand that I could be missing some genes with very low basal expression (e.g. immune genes that we are interested in), and it also dramatically decreased the significance of several genes. I also prefer not to take biased and arbitrary decisions that I cannot fully argument, so trusting s-values seems a better decision.

ADD REPLY • link 5.1 years ago polyptwo • 0