Question

DESeq2 error due to zeros

0

Entering edit mode

ariel ▴ 20

@ariel-16886

Last seen 2.4 years ago

United States

I'm getting the following error:

estimating size factors Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc, : every gene contains at least one zero, cannot compute log geometric means

However, I was under the impression that DESeq2 could handle zeros without padding.

I am doing 16s sequencing analysis. Prior to deseq, I filter the counts such that:

If any taxon (feature, equivalent to gene) has a relative abundance less than 0.002 in any sample, set that count to zero. That is, assume very low counts are noise.
Remove any taxon that appears in less than 10% of all samples. That is, the taxon might be a contaminant in just a few samples. This is very different from RNA-Seq because a rare gene could be important. However we all have mostly the same array of gut bacteria but in different amounts.

After this filtering, I have the following counts for number of zeros in each row:

> genus_taxa_counts %>% select(num_zero, rowmed, rowmax, rowiqr)
   num_zero  rowmed rowmax   rowiqr
1        73     0.0   5943   776.50
2        50   415.5   4782  1106.00
3       120     0.0   1315     0.00
4         2 19626.0  85948 16946.75
5        77     0.0   6190   797.25
6       117     0.0   1537     0.00
7        99     0.0   2241   128.25
8        75     0.0   4950   342.00
9       101     0.0   7263     0.00
10      119     0.0  18411     0.00
11      114     0.0  59479     0.00
12        8  1672.0  18372  1948.00
13       24   994.0  68267  1410.00
14       75     0.0   9171   414.25
15       50   415.5  10773  1174.50
16       95     0.0   2103   203.00
17      103     0.0   2264     0.00
18       16  1211.0  10253  2198.25
19       16  1172.5  17777  2097.00
20        1  4025.5  22392  3910.75
21      112     0.0   1398     0.00
22      103     0.0    888     0.00
23      108     0.0   3710     0.00
24       69     0.0   1937   364.50
25       33   540.0  15660   946.50
26      114     0.0   1470     0.00
27       35   615.5   4251  1240.25
28       16   712.0  10736  1028.00
29       25   986.5   7542  1564.50
30      120     0.0    424     0.00
31       65   143.0   1850   417.25
32       37   417.0   9197  1078.00
33      108     0.0   1130     0.00
34      105     0.0   1461     0.00
35      115     0.0    392     0.00
36      111     0.0    664     0.00
37       25   657.5  13927  1391.25
38      105     0.0   2017     0.00
39      100     0.0   2489   124.50
40      106     0.0   1156     0.00
41       81     0.0   1224   248.00
42      120     0.0   2782     0.00
43       13  4224.5  25059  5618.00
44       88     0.0   2355   254.25
45       80     0.0   1221   233.00
46       86     0.0   1450   256.50
47       54   238.0   1923   517.25
48      112     0.0   1687     0.00
49       34   414.0   5126   750.50
50       73     0.0  11691   838.00
51       45   241.5   2160   458.75
52       56   270.5   3683   535.50
53       35  1643.0  18562  3046.25
54       83     0.0   3640   288.75
55       44   365.0   4675   851.75
56       92     0.0   2664   210.00
57       69     0.0   1056   326.25
58       68     0.0   8789  1162.25
59       40   600.0   5797  1462.50
60       43  1114.5  12190  2504.50
61       26  1167.0  10098  2202.75
62      106     0.0   4431     0.00
63      116     0.0   6652     0.00
64      107     0.0   1298     0.00
65       90     0.0   2525   287.75
66      113     0.0   3232     0.00
67       43   765.0   5903  1820.00
68       96     0.0   4890   204.00
69      116     0.0   1388     0.00
70       80     0.0   1250   296.25
71      107     0.0   3529     0.00
72       87     0.0   2554   447.75
73       83     0.0   3832   463.50
74       65   195.0 142574  4498.25
75      111     0.0  20603     0.00
76       46  1054.5  28857  4191.25

There are 134 total samples, so the taxa with the most zeros appears in only 14 out of 134 samples, 10.4%.

Is this just too many zeros?

https://support.bioconductor.org/p/89067/#89075

deseq2 • 948 views

ADD COMMENT • link updated 4.7 years ago by Michael Love 41k • written 4.7 years ago by ariel ▴ 20

score 2 · Accepted Answer · 2019-08-26

2

Entering edit mode

Michael Love 41k

@mikelove

Last seen 1 day ago

United States

I'm not so certain that DESeq2 is the best method for 16S analysis in general, as I'm not sure the assumptions are always met. Here for example, how can there be sufficient genes for estimating the size factors if there is not a single gene with a positive count across all samples? I haven't done any testing or development of DESeq2 for microbiome data and haven't analyzed this type of data myself.

Despite these caveats, I will mention that type="poscounts" size factor estimation can handle this problem you mention above. Take a look at the man page for estimateSizeFactors().

ADD COMMENT • link 4.7 years ago Michael Love 41k

0

Entering edit mode

For the record, I 100% agree with you about 16s. However, it is being used in publications, and I'm an analyst, not the PI, so DESeq it is for the time being.

ADD REPLY • link 4.7 years ago ariel ▴ 20