Question

DESeq2: too many NA for pvalues?

0

Entering edit mode

Brian Smith ▴ 120

@brian-smith-6197

Last seen 3.6 years ago

United States

Hi,

I was trying to use DESeq2 to find differential abundances in two sets of microbiome (16s) data. I seem to get a lot of p-values that are NA (about 10% of p-values are NA). Here is my code and some results:

> myds

class: DESeqDataSet
dim: 598 50
metadata(0):
assays(1): counts
rownames(708): 192963 4465907 ... 189592 580008
rowRanges metadata column names(0):
colnames(50): Sample135 Sample246 ... Sample25 Sample11
colData names(15): Lab_ID Stool_Num ... amount_mg

> myres <- DESeq(myds)
> res <- results(myres)
> head(res)

log2 fold change (MAP): phenotype case vs control
Wald test p-value: phenotype case vs control
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
189592 8.434254 1.0976058 0.5416153 2.0265414 4.270934e-02 0.269364834
4465907 9.357253 -0.8675216 0.5354922 -1.6200452 1.052226e-01 0.384802311
177310 61.323202 1.8968030 0.8038772 2.3595680 1.829623e-02 0.173950686
4364222 35.717973 -0.1593546 0.8631786 -0.1846137 NA NA
194107 9.983001 1.4331188 0.5869212 2.4417565 1.461600e-02 0.152629370
189110 43.982965 4.1196519 1.0192709 4.0417634 5.305074e-05 0.005019356

Why am I getting so many NAs? Should I be using different/additional arguments for the DESeq function?

thanks!!

deseq2 deseq • 3.3k views

ADD COMMENT • link updated 8.4 years ago by Michael Love 41k • written 8.4 years ago by Brian Smith ▴ 120

score 0 · Answer 1 · 2015-11-24

If you read the vignette, it explains that these NA p-values are for genes with outlier counts present in one or more groups. The NA is just the default behavior when there is an outlier and not enough samples that we feel replacement is appropriate.

You can turn off the outlier filtering with cooksCutoff=FALSE when you call results(), and inspect the genes yourself to see if you feel there are outliers. See the vignette for details on all of this. It could be that a single sample consistently has extreme counts, and this sample might best be removed. In the section on outliers and Cook's distance there is a section in bold "Note on many outliers".

You can also detect problematic samples through EDA plots such as PCA (see vignette also on how to do this). Just to be clear, you want to make sure to only remove samples for which the experiment may have failed or the quality is too low. You don't want to remove samples simply because they deviate slightly from the others, as this could represent normal biological variation. This is a somewhat subjective decision which the bioinformatic analyst should make and be able to defend.