Hello,
I have a question about the genes that get Padj = NA.
In my experiment I compare 3 control samples and 3 treated samples. Out of 4500 genes, about 1800 get Padj = NA. I wish to understand how to treat these genes: as not changed genes or to exclude them from my analysis. Since I want to do a Fisher test on the data it is important for me to know for each gene if it changed, did not change or undetermined.
As I understand from the vignette this happens because of the automatic independent filtering. I read in section 3.8 that this is an optimization of the FDR correction (optimizing the number of genes which will have an adjusted p value below a given FDR cutoff, alpha).
I also read that it is possible to remove the independent filtering by writing independentFiltering=FALSE in the results function.
My question is how to treat these Padj=NA genes and what do I lose if I run DEseq2 without the independent filtering?
Thank you very much,
Raya Romm, PhD student
The Hebrew University of Jerusalem
Hi Michael!
I jumped in in this conversation because I have a similar issue concerning my p values and you could probably help me understand. I am doing a gene expression analysis using deseq2. In total I have 5534 genes and out of them around 230 genes showed NA for both adjusted and non-adjusted. Maybe this is not that important since is a rather low proportion of the whole gene set but I would like to understand why. I read about the reasons why NA can be generated but when I check the data set those gene counts seem to be quite ok and not very extreme or different from the others.
For example this gene below is one of those that gives NA for both kinds of p-values. the 3 first numbers are the replicates from the first treatment and the second 3 numbers are the replicates from the second treatment:
PP_2663: 4106 30886 4353 1297 6438 7720 these are the not normalized counts
PP_2663: 3701.2 115446.2 3025.1 1942.6 3665 3689.9 these are the deseq-normalized counts
This is how a **normal gene (no NA p values)** looks like:
PP_4980: 8896 5882 9057 5371 11917 13615 not normalized
PP_4980: 8019 21985.8 6294.2 8044.6 6784.2 6507.5 normalized
This weirdo also does give a number and not a NA for the adjusted and not-adjusted p-values:
PP_5640: 0 0 1 0 3 2 not normalized
PP_5640: 0 0 0.6 0 1.7 0.9 normalized
Soo what is going on here? am I doing something wrong? the pipeline and commands are quite straightforward. I just provide my count files matrix and DESeq it.
As I said maybe is not that important but it feels that these analysis are not correct. I do not think that filtering the low count genes would affect the results much as only three genes have a row sum lower than 10. The other genes have much higher counts (at least 300).
I wanna get to the bottom of this because I am failing to find differences in gene expression even between conditions that should give differences. The replicate number is low, I know, and there is variation between the replicates of the treatments which I suspect come from the library preparation (the proportion of coding-RNA vs non-coding like RNA is highly variable between replicates). Maybe that is not connected at all with my question above but I am just trying to connect the dots and give all the info of the peculiarities of this data set. It might help.
Thank you very much in advance!!
Regards