Hello everybody,
I have a question regarding the usage of DESeq2 for candidate analysis, so the targeted investigation of a pre-specified set of genes.
In particular, I do have some whole-genome RNAseq data from an animal model that I would like to use to support structural imaging findings made in the same model. To get to the cellular/molecular basis of changes in my imaging markers, I would like to investigate the differential expression of a list of genes that I specified.
My idea to approach this would be the following:
1) run DESeq2 on the whole genome data thereby incorporating the information from the whole genome to get more accurate p-values and effect sizes after normalization/dispersion correction. As far as I understand the 2014 paper from Love et al., the dispersion correction will not uniformly reduce variance, but in some cases also increase it (see Fig. 1 in the paper). In other words, this is not a method for hacking your p's as nice as possible, but for making them as accurate as possible using as much information as you have.
2) however, since the actual hypotheses that I want to test are only a fraction of the totality of genes (around 30 out of >8000), the FDR correction built into the FindMarkers function (which uses the whole genome for benjamini hochberg correction), would be unreasonably strict. Thus my idea would be to extract the uncorrected p-values from DESeq2 for my candidate genes and FDR-correct them for 30 tests.
This way I would have more accurate p-/logFC-values thanks to DESeq2's elaborated shrinkage corrections and a reasonable multiple comparison that accounts for the number of hypotheses that I am actually interested in. Again, the list of genes is prespecified and not based on the p-values obviously.
The alternative would be to just use the raw count data without normalization, run Wilcoxon Rank Sum tests over those, and have a higher risk of false positives and negatives given high dispersion rates that stem from my typically small sample size. Subsequent multiple comparison correction on these test would also only use the number of genes in my set, I only lose the normalization/dispersion shrinkage that should increase the robustness of my inferences.
This DESeq2-idea was lined out in a post on another forum (I replied to the answers there as well), but the commentors there trashed the idea. I don't find the arguments there convincing I have to say and would love some insight from somebody who understands more about bioinformatics then I do. https://www.biostars.org/p/461442/#9615058
Help is very much appreciated!
Best regards Piffelpaff