I'm getting the following error:
estimating size factors Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc, : every gene contains at least one zero, cannot compute log geometric means
However, I was under the impression that DESeq2 could handle zeros without padding.
I am doing 16s sequencing analysis. Prior to deseq, I filter the counts such that:
- If any taxon (feature, equivalent to gene) has a relative abundance less than 0.002 in any sample, set that count to zero. That is, assume very low counts are noise.
- Remove any taxon that appears in less than 10% of all samples. That is, the taxon might be a contaminant in just a few samples. This is very different from RNA-Seq because a rare gene could be important. However we all have mostly the same array of gut bacteria but in different amounts.
After this filtering, I have the following counts for number of zeros in each row:
> genus_taxa_counts %>% select(num_zero, rowmed, rowmax, rowiqr)
num_zero rowmed rowmax rowiqr
1 73 0.0 5943 776.50
2 50 415.5 4782 1106.00
3 120 0.0 1315 0.00
4 2 19626.0 85948 16946.75
5 77 0.0 6190 797.25
6 117 0.0 1537 0.00
7 99 0.0 2241 128.25
8 75 0.0 4950 342.00
9 101 0.0 7263 0.00
10 119 0.0 18411 0.00
11 114 0.0 59479 0.00
12 8 1672.0 18372 1948.00
13 24 994.0 68267 1410.00
14 75 0.0 9171 414.25
15 50 415.5 10773 1174.50
16 95 0.0 2103 203.00
17 103 0.0 2264 0.00
18 16 1211.0 10253 2198.25
19 16 1172.5 17777 2097.00
20 1 4025.5 22392 3910.75
21 112 0.0 1398 0.00
22 103 0.0 888 0.00
23 108 0.0 3710 0.00
24 69 0.0 1937 364.50
25 33 540.0 15660 946.50
26 114 0.0 1470 0.00
27 35 615.5 4251 1240.25
28 16 712.0 10736 1028.00
29 25 986.5 7542 1564.50
30 120 0.0 424 0.00
31 65 143.0 1850 417.25
32 37 417.0 9197 1078.00
33 108 0.0 1130 0.00
34 105 0.0 1461 0.00
35 115 0.0 392 0.00
36 111 0.0 664 0.00
37 25 657.5 13927 1391.25
38 105 0.0 2017 0.00
39 100 0.0 2489 124.50
40 106 0.0 1156 0.00
41 81 0.0 1224 248.00
42 120 0.0 2782 0.00
43 13 4224.5 25059 5618.00
44 88 0.0 2355 254.25
45 80 0.0 1221 233.00
46 86 0.0 1450 256.50
47 54 238.0 1923 517.25
48 112 0.0 1687 0.00
49 34 414.0 5126 750.50
50 73 0.0 11691 838.00
51 45 241.5 2160 458.75
52 56 270.5 3683 535.50
53 35 1643.0 18562 3046.25
54 83 0.0 3640 288.75
55 44 365.0 4675 851.75
56 92 0.0 2664 210.00
57 69 0.0 1056 326.25
58 68 0.0 8789 1162.25
59 40 600.0 5797 1462.50
60 43 1114.5 12190 2504.50
61 26 1167.0 10098 2202.75
62 106 0.0 4431 0.00
63 116 0.0 6652 0.00
64 107 0.0 1298 0.00
65 90 0.0 2525 287.75
66 113 0.0 3232 0.00
67 43 765.0 5903 1820.00
68 96 0.0 4890 204.00
69 116 0.0 1388 0.00
70 80 0.0 1250 296.25
71 107 0.0 3529 0.00
72 87 0.0 2554 447.75
73 83 0.0 3832 463.50
74 65 195.0 142574 4498.25
75 111 0.0 20603 0.00
76 46 1054.5 28857 4191.25
There are 134 total samples, so the taxa with the most zeros appears in only 14 out of 134 samples, 10.4%.
Is this just too many zeros?
For the record, I 100% agree with you about 16s. However, it is being used in publications, and I'm an analyst, not the PI, so DESeq it is for the time being.