include your problematic code here with any corresponding output

Question

DESeq2: Too many "NA" for adjPvalue

0

Entering edit mode

sagharib • 0

@51826540

Last seen 2.1 years ago

United States

Enter the body of text here When running DESeq2 version 1.28.1, many adjusted P-values have "NA" values even though there are no obvious outliers in the sample transcript counts, nor multiple zeroes.

include your problematic code here with any corresponding output

I did not get any errors running DESeq2, except for a note on replacement of outliers:

dds <-DESeq(dds) estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing -- replacing outliers and refitting for 54 genes -- DESeq argument 'minReplicatesForReplace' = 7 -- original counts are preserved in counts(dds) estimating dispersions fitting model and testing

please also include the results of running the following in an R session

example transcript where there is a significant pvalue, robust count numbers, but a value of "NA" for padj:

geneID GeneName GeneDescription GeneType log2FoldChange lfcSE stat pvalue padj HP0002C HP0003C HP0009C HP0015C HP0121C HP0122C HP0171C HP1002C HP1047C HP1049C HP0172C HP0207C HP0217C HP0220C HP1012C HP1022C HP1035C HP1039C HP1041C ENSG00000005889 ZFX zinc_finger_protein_X-linked protein_coding -0.656100894 0.136681642 -4.800212273 1.58E-06 NA 112 156 185 590 526 519 522 574 208 529 642 718 509 1129 763 676 697 1049 436

DESeq2 • 1.7k views

ADD COMMENT • link 3.1 years ago • updated 2.1 years ago sagharib • 0

score 0 · Answer 1 · 2021-03-10

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 14 hours ago

United States

The vignette has a section that describes the origin of NA values in the adjusted p-values which can also come from filtering (it walks through the various ways you would see an NA).

Try independentFilter=FALSE in results().

ADD COMMENT • link 3.1 years ago Michael Love 41k

0

Entering edit mode

Thank you Michael for your quick response. I have noted this filtering option, but could there be a more fundamental issue with the method the NA values are assigned? I appreciate the benefit of excluding outliers and samples with many "0" values, but in my case many of the transcripts that were flagged NA had high counts and may be of potential biological interest. Can some of the parameters used by DESeq2 be modified to reduce this aggressive independent filtering? Thanks again for your hard work on this excellent package!

ADD REPLY • link 3.1 years ago sagharib • 0

1

Entering edit mode

The filtering is not perfect (it is a greedy optimization), so feel free to turn it off.

ADD REPLY • link 3.1 years ago Michael Love 41k

0

Entering edit mode

Following up on this issue, I noticed that the problem with getting "NA" for adjPvalues from independent filtering even when the counts don't have outlier values in a given comparison occurs when I am performing analysis on a subset of the larger dataset. In such a case, there is an outlier count for one of the samples, but that sample was not part of the sub-analysis. It seems like independent filtering is being applied across all samples, and if a transcript has an outlier value, then it is flagged and adjPvalue set to "NA" regardless of whether the outlier sample is or is not included in the actual sub-analysis.

ADD REPLY • link 2.1 years ago sagharib • 0

0

Entering edit mode

When repeating the analysis limited to the subset comparison from the beginning by only uploading the relevant count data and coldata, the issue was resolved. However, I was under the impression that the recommendation is to build the dds model using all data, then performing subset analyses.

ADD REPLY • link 2.1 years ago sagharib • 0

0

Entering edit mode

Since the outlier affects dispersion estimation, the conservative choice is to flag the gene. But of course you can always turn off the manual flagging: cooksCutoff=FALSE and just examine the Cook's distances manually. If you have subsets with outliers and other subsets without, this is probably the better approach.