Hi Im doing DGE analysis with DESeq2 to test difference in gene expression under 2 conditions. As other people reported in this foro I got a bunch of genes with NA. However in my case the NA value is in their majority for padj. (see table below) In other words I have a bunch of genes with very low pvalue but got a NA por padj.
I read in the vignette that, at least for p-values the NA represent genes with outliers count in one or more groups. I set the cooksCutoff=FALSE in the results() function but I still got the same results.
When I plot counts for one of these genes:
plotCounts(ddsIAAinCol0,"AT1G11090", intgroup = "treatment")
it does not seems to be an outlier
Any suggestion?
baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | |
AT5G47435 | 185.6364162 | 2.115139359 | 0.366914049 | 5.764672585 | 8.18E-09 | 1.37E-05 |
AT5G66055 | 185.3092117 | 1.644995185 | 0.293873843 | 5.597623681 | 2.17E-08 | 1.81E-05 |
AT1G11090 | 47.78777684 | 1.94133198 | 0.364829883 | 5.321197823 | 1.03E-07 | NA |
AT2G29100 | 15.17214648 | 1.666409849 | 0.450228259 | 3.701255565 | 0.000214535 | NA |
AT1G72600 | 13.04997183 | -1.767013802 | 0.479422186 | -3.685715542 | 0.000228061 | NA |
AT5G57290 | 75.34066235 | -1.224003948 | 0.332385985 | -3.682477613 | 0.000230978 | NA |
AT2G39460 | 33.69752389 | -1.517438704 | 0.412598155 | -3.677764148 | 0.000235287 | NA |
AT3G07050 | 62.78888747 | -1.318916065 | 0.359422 | -3.669547401 | 0.00024298 | NA |
AT1G10400 | 8.787243217 | -1.793799153 | 0.489431764 | -3.665064845 | 0.000247276 | NA |
AT1G48920 | 128.8451058 | -1.121043683 | 0.306312462 | -3.659804357 | 0.000252408 | 0.011708921 |
AT5G51690 | 56.90610637 | 1.385338028 | 0.380859561 | 3.637398583 | 0.000275406 | NA |
AT2G33060 | 82.5823932 | -1.080766384 | 0.297955375 | -3.627276014 | 0.000286427 | 0.012927924 |
AT4G16630 | 71.83981053 | -1.231184086 | 0.341456276 | -3.605685917 | 0.00031133 | NA |
AT5G55730 | 27.34342486 | -1.483301796 | 0.412955145 | -3.591919884 | 0.000328251 | NA |
AT5G65360 | 37.24544013 | -1.324569311 | 0.369607195 | -3.583721667 | 0.000338733 | NA |
AT1G09210 | 67.97424741 | -1.169578499 | 0.327204788 | -3.574454109 | 0.000350959 | NA |
AT2G18270 | 2119.114644 | 0.967748299 | 0.271131214 | 3.569298738 | 0.000357938 | 0.015522311 |
AT1G01160 | 153.078235 | 1.016068318 | 0.285036805 | 3.564691653 | 0.000364284 | 0.015522311 |
AT2G34480 | 103.1868193 | -1.066463342 | 0.299624136 | -3.559337231 | 0.000371792 | 0.015522311 |
AT4G14890 | 47.15691661 | 1.259644836 | 0.354034502 | 3.557971978 | 0.000373729 | NA |
AT4G30280 | 46.33922434 | 1.270915217 | 0.358566567 | 3.544433127 | 0.000393458 | NA |
AT1G06430 | 5.003252608 | 1.744025568 | 0.493321579 | 3.535271195 | 0.000407357 | NA |
AT2G18290 | 544.7909982 | 0.891251585 | 0.253231577 | 3.519512043 | 0.000432341 | 0.017610004 |
AT4G37910 | 42.46839645 | -1.389330143 | 0.396446696 | -3.504456353 | 0.000457541 | NA |
AT5G23420 | 22.05896603 | -1.579334138 | 0.450931699 | -3.502379943 | 0.000461122 | NA |
AT4G14320 | 54.99615495 | -1.230414257 | 0.351421993 | -3.501244323 | 0.000463091 | NA |
AT1G29310 | 10.29744763 | -1.675254517 | 0.479076434 | -3.496841832 | 0.000470801 | NA |
AT3G15960 | 26.15442415 | -1.598265985 | 0.45751882 | -3.493333858 | 0.00047703 | NA |
AT3G62600 | 51.08063801 | -1.271109756 | 0.364746638 | -3.48491151 | 0.0004923 | NA |
Dear Michael, according to the vignette the automatic independent filtering is setting the adjusted p values to NA because it considerer that those genes have a low mean normalized count. I found that only 1600 genes were considered to have a good number of counts. When I look at the data I found that 3000 genes with counts between 20 to 70 are considered low. It is that right? is a value of 20 a low mean normalized count?
I understand that if I remove the if I remove the independent filtering (independentFiltering=FALSE)) I have not more NA, but then I'm worry about where differences found in genes with mean normalized count between 20 to 70 can be trusted. Can you give me some insight?
I will appreciate your help
The threshold is chosen using the genefilter package. You can read more about the strategy in the citation for the genefilter package, or the section of the DESeq2 paper which discusses it.
Briefly: it is not that the genes with mean normalized count less than the data-driven threshold cannot be trusted.
"Trusted" implies that there would be a problem with specificity / false positives / Type I error if we included them. That is not the case. You can filter or not, and this won't affect the specificity (we have a plot showing the uniformity of p-values for various thresholds in the DESeq2 paper).
The reason for filtering low count genes in DESeq2 is to increase sensitivity for the dataset as a whole (i.e. reduces false negatives / Type II error).
However, it's up to you in the end as the data analyst. You can use the filtering or you can set independentFiltering=FALSE and remove the genes with low counts, or keep all the genes with at least a single count per row, etc.