Question

Using different LFC shrinkage estimators for spike-in normalised RNA-seq data

1

Entering edit mode

nadezda.fursova ▴ 10

@nadezdafursova-23998

Last seen 3.7 years ago

Hi,

I am seeking advice on what LFC shrinkage estimator to use when analysing spike-in normalised RNA-seq data using DESeq2.

We are spiking in Drosophila cells for internal normalisation of our RNA-seq experiments and then using read counts in Drosophila genes to calculate sizeFactors. We then supply these sizeFactors into the DESeq2 analysis of mouse gene expression, similar to what has been described in Taruttis et al., 2017 (https://pubmed.ncbi.nlm.nih.gov/28193148/). The spike-in normalisation allows us to measure global changes in gene expression, i.e when the expression of the majority of genes is affected. Since I have recently updated to DESeq2_1.26.0 where apeglm is now a default LFC shrinkage estimator, I wanted to compare the results of DESeq2 analysis for spike-in normalised RNA-seq data with normal and apeglm LFC shrinking.

As explained in the newest DESeq2 manual, I do see that in general apeglm is much better in preserving large LFC values comparing to normal (Figure A, although in this particular case the difference between the two shrinkage approaches is not that big). However, I also noticed that apeglm much more aggressively shrinks LFC values towards 0 (Figure B), which in case of the spike-in normalised RNA-seq data leads to a reduction in the detected global effect on gene expression and a slightly odd bimodal distribution of LFC values (Figures C and D). In addition, when comparing to non-shrunk LFC values, normal LFC shrinkage results in LFC values which correlate slightly better and show overall a more similar distribution to raw LFC values than apeglm-shrunk LFC (Figures B and C). Given these observations, it appears that in the case of spike-in normalised RNA-seq, apeglm potentially distorts the distribution of LFC values due to a very stringent shrinking of LFC values for genes with low expression or high variance towards 0. This makes me think that maybe in this case it is better to stick with the older normal shrinkage approach.

I would really appreciate if people could share their experience of using apeglm or normal shrinkage for spike-in normalised data analysis or any advice from the statistics guru on whether the choice of normal shrinkage is justified when analysing global gene expression changes using spike-in normalised RNA-seq.

Thank you very much in advance!

deseq2 apeglm rna-seq spike-in • 1.7k views

ADD COMMENT • link updated 3.7 years ago by Michael Love 41k • written 3.7 years ago by nadezda.fursova ▴ 10

score 0 · Answer 1 · 2020-08-14

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 21 minutes ago

United States

Great question and thanks for supplying all the figures.

In apeglm we do have a zero centered prior, whereas here the prior might better be shifted to some value between 0.5 and 1. We can't do that in apeglm directly, but I think you could accomplish this in practice by dividing the "treated" samples by 2^(mode of the LFC distribution) before running DESeq() and lfcShrink(), and then adding back (mode of LFC distribution) to the LFC estimates. Does that make sense? Would you mind trying that to see if it resolves the estimates?

ADD COMMENT • link 3.7 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,

Thank you so much for your reply!

I tried your suggestion to resolve the apeglm LFC estimate issue and summarised the results in the figure below. I did it both for apeglm and normal LFC shrinkage just for comparison. Overall, I can see that correcting for the non-zero mode of LFC distribution prior to running DESeq() and lfcShrink() results in a unimodal distribution of apeglm-shrunk LFC values which now looks more similar to the distribution of non-shrunk LFC values (Figures C and D). However, if I now use padj values derived from the mode-corrected results, this probably not surprisingly results in a much smaller number of significantly changing genes (padj < 0.05): total number of genes for which padj < 0.05 goes down from ~11,500 to ~3,500. I don't know if it would make sense to keep padj values from the initial analysis and use LFC estimates from the mode-corrected analysis?

enter image description here

ADD REPLY • link 3.7 years ago nadezda.fursova ▴ 10

0

Entering edit mode

The new LFC distribution looks a lot better. I would say the p-values to use depend on the null that you want to specify. Do you want to compare against LFC=0 (in these new plots) or against LFC deviating from the mode?

ADD REPLY • link 3.7 years ago Michael Love 41k

0

Entering edit mode

True, this makes sense now. If I want to compare against LFC=0 (to find all significant gene expression changes), I would use p-values from the original testing. If I want to compare against LFC=mode (to find gene expression changes significantly deviating from the mode), I would use p-values from the mode-corrected results.

Many thanks again for all your help!

ADD REPLY • link 3.7 years ago nadezda.fursova ▴ 10