Dealing with genes that have Padj=NA
3
5
Entering edit mode
raya.fai ▴ 60
@rayafai-9396
Last seen 13 months ago
Israel

Hello,

I have a question about  the genes that get Padj = NA.

In my experiment I compare 3 control samples and 3 treated samples. Out of 4500 genes, about 1800 get Padj = NA. I wish to understand how to treat these genes: as not changed genes or to exclude them from my analysis. Since I want to do a Fisher test on the data it is important for me to know for each gene if it changed, did not change or undetermined.

As I understand from the vignette this happens because of the automatic independent filtering. I read in section 3.8 that this is an optimization of the FDR correction (optimizing the number of genes which will have an adjusted p value below a given FDR cutoff, alpha).

I also read that it is possible to remove the independent filtering by writing independentFiltering=FALSE in the results function.

My question is how to treat these Padj=NA genes and what do I lose if I run DEseq2 without the independent filtering?

Thank you very much,

Raya Romm, PhD student

The Hebrew University of Jerusalem

deseq2 • 22k views
8
Entering edit mode
@mikelove
Last seen 1 day ago
United States

The genes with adjusted p-value of NA have less mean normalized counts than the optimal threshold. You can make the same plot as in the vignette to see how the power increases when the threshold increases.

There's no right answer of exactly what filter to use, it's a sliding scale of "counts high enough to have good power to detect differential expression". One choice is to optimize the power, as detailed in the independent filtering reference by Bourgon, which is what you get by default.

If you want to include more genes (so have less NA adjusted p-values) you can pick a lower threshold using this plot, and then:

results(dds, independentFiltering=FALSE)

res$pvalue[res$baseMean < x] <- NA

res$padj <- p.adjust(res$pvalue, method="BH")
0
Entering edit mode

Hi Michael!

I jumped in in this conversation because I have a similar issue concerning my p values and you could probably help me understand. I am doing a gene expression analysis using deseq2. In total I have 5534 genes and out of them around 230 genes showed NA for both adjusted and non-adjusted. Maybe this is not that important since is a rather low proportion of the whole gene set but I would like to understand why. I read about the reasons why NA can be generated but when I check the data set those gene counts seem to be quite ok and not very extreme or different from the others.

For example this gene below is one of those that gives NA for both kinds of p-values. the 3 first numbers are the replicates from the first treatment and the second 3 numbers are the replicates from the second treatment:

PP_2663: 4106 30886 4353 1297 6438 7720 these are the not normalized counts

PP_2663: 3701.2 115446.2 3025.1 1942.6 3665 3689.9 these are the deseq-normalized counts

This is how a **normal gene (no NA p values)** looks like:

PP_4980: 8896 5882 9057 5371 11917 13615 not normalized

PP_4980: 8019 21985.8 6294.2 8044.6 6784.2 6507.5 normalized

This weirdo also does give a number and not a NA for the adjusted and not-adjusted p-values:

PP_5640: 0 0 1 0 3 2 not normalized

PP_5640: 0 0 0.6 0 1.7 0.9 normalized

Soo what is going on here? am I doing something wrong? the pipeline and commands are quite straightforward. I just provide my count files matrix and DESeq it.

As I said maybe is not that important but it feels that these analysis are not correct. I do not think that filtering the low count genes would affect the results much as only three genes have a row sum lower than 10. The other genes have much higher counts (at least 300).

I wanna get to the bottom of this because I am failing to find differences in gene expression even between conditions that should give differences. The replicate number is low, I know, and there is variation between the replicates of the treatments which I suspect come from the library preparation (the proportion of coding-RNA vs non-coding like RNA is highly variable between replicates). Maybe that is not connected at all with my question above but I am just trying to connect the dots and give all the info of the peculiarities of this data set. It might help.

Thank you very much in advance!!

Regards

1
Entering edit mode
Fuqi Xu ▴ 10
@fuqi-xu-16585
Last seen 6.0 years ago

I came across the same problem when analyzing my data. This is how I dealt with it. If P value = NA, there is an extreme count, which is defined by Cook's distance. So the simplest way is to delete the abnormal observations until it returns a valid p-value.

Also, we need to make sure those observations are deletable. This observation doesn't have special biological meaning and deleting those observations doesn't change much of the p-value of other genes.

0
Entering edit mode

can you explain with more detail what should i do?

0
Entering edit mode
raya.fai ▴ 60
@rayafai-9396
Last seen 13 months ago
Israel

Hi Michael,

Raya