DESeq2: too many NA for pvalues?
Entering edit mode
Brian Smith ▴ 120
Last seen 7 months ago
United States


I was trying to use DESeq2 to find differential abundances in two sets of microbiome (16s) data. I seem to get a lot of p-values that are NA (about 10% of p-values are NA). Here is my code and some results:

> myds

class: DESeqDataSet 
dim: 598 50 
assays(1): counts
rownames(708): 192963 4465907 ... 189592 580008
rowRanges metadata column names(0):
colnames(50): Sample135 Sample246 ... Sample25 Sample11
colData names(15): Lab_ID Stool_Num ... amount_mg

> myres <- DESeq(myds)
> res <- results(myres)
> head(res)

log2 fold change (MAP): phenotype case vs control 
Wald test p-value: phenotype case vs control 
DataFrame with 6 rows and 6 columns
         baseMean log2FoldChange     lfcSE       stat       pvalue        padj
        <numeric>      <numeric> <numeric>  <numeric>    <numeric>   <numeric>
189592   8.434254      1.0976058 0.5416153  2.0265414 4.270934e-02 0.269364834
4465907  9.357253     -0.8675216 0.5354922 -1.6200452 1.052226e-01 0.384802311
177310  61.323202      1.8968030 0.8038772  2.3595680 1.829623e-02 0.173950686
4364222 35.717973     -0.1593546 0.8631786 -0.1846137           NA          NA
194107   9.983001      1.4331188 0.5869212  2.4417565 1.461600e-02 0.152629370
189110  43.982965      4.1196519 1.0192709  4.0417634 5.305074e-05 0.005019356

Why am I getting so many NAs? Should I be using different/additional arguments for the DESeq function?


deseq2 deseq • 1.7k views
Entering edit mode
Last seen 14 hours ago
United States

If you read the vignette, it explains that these NA p-values are for genes with outlier counts present in one or more groups. The NA is just the default behavior when there is an outlier and not enough samples that we feel replacement is appropriate. 

You can turn off the outlier filtering with cooksCutoff=FALSE when you call results(), and inspect the genes yourself to see if you feel there are outliers. See the vignette for details on all of this. It could be that a single sample consistently has extreme counts, and this sample might best be removed. In the section on outliers and Cook's distance there is a section in bold "Note on many outliers".

You can also detect problematic samples through EDA plots such as PCA (see vignette also on how to do this). Just to be clear, you want to make sure to only remove samples for which the experiment may have failed or the quality is too low. You don't want to remove samples simply because they deviate slightly from the others, as this could represent normal biological variation. This is a somewhat subjective decision which the bioinformatic analyst should make and be able to defend.


Login before adding your answer.

Traffic: 578 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6