I am a big fan of DESeq2 and I use it to analyze my RNA-seq data. I am interested in looking at the genes that are marked as dispersion outliers.
I find in the Love et al., 2014 paper that a gene is considered as a dispersion outlier "if the residual from the trend fit is more than two standard deviations of logarithmic residuals".
When I look at my data I am surprised to find genes that have high variability in expression that are not marked as dispersion outliers while genes with a variability that seems a little less are marked as dispersion outliers. I understand from the description in the paper that this cutoff is made according to the standard deviation. I do not fully understand how is the calculation of shrinking the dispersion toward the trended prior mean is made.
I would appreciate to get some intuition to why some genes are not marked as outliers.
For example, here are the normalized reads of genes that have a high variability but are not marked as dispersion outliers (I have 12 replicates of the same condition):
1424, 1306, 1127, 618, 796, 768, 455, 466, 428, 590, 375, 306
2158, 1758, 1466, 784, 1021, 1061, 520, 494, 573, 720, 515, 438
2162, 1721, 2067, 244, 2501, 2630, 178, 272, 217, 2439, 349, 304
Here are the normalized reads of genes that are marked as a dispersion outliers:
3732, 3180, 2967, 2468, 956, 1742, 1597, 1738, 792, 573, 654, 611
3378, 3080, 2190, 1491, 1001, 1470, 1199, 1288, 874, 794, 515, 783
The genes in both categories have comparable baseMean expression. Why should not the genes in the first group also be considered as outliers? Under specific circumstances, is it correct to change the cutoff of what is considered a dispersion outlier? Is there a parameter that controls it?
Thanks you very much for the helpful forum.
All the best, Raya
Thank you very much for your answer. I now understand better the prior calculation. A technical question, how can I add gene names to the dispersion plot using the code that you mentioned?
Thanks again, Raya
I hope I can ask another question regarding the dispersion minimal values. You kindly previously explained to me that: For the MLE estimate, estimate at the minimal value means that the data is essentially consistent with Poisson (the variance happened to be at or below the mean).
How to get the dispersion values before fitting?
When I manually calculate the dispersion of each gene according to (variance - mean)/mean^2 , I find that the group with negative values of dispersion is only part of the group that get the minimal estimate values of e-8 by the DESeq2 algorithm. I have two questions:
Thank you very much again, Raya