Question: DropletUtils' emptyDrops() fails with non-integer counts
0
8 weeks ago by
jmanning0
jmanning0 wrote:

Hi!

I'm using the emptyDrops() function from DropletUtils in a possibly atypical manner, in that I'm running Alevin with relaxed parameters and post-filtering with emptyDrops() (Alevin's own empty drop detection not being all that robust).

Unfortunately the emptyDrops() function (DropletUtils v1.4.2 tested) does not do well with non integer counts as produced by Alevin and friends:

empty <- emptyDrops(counts(single_cell_experiment))
There were 36 warnings (use warnings() to see them)


You get warnings like:

1: In optimize(LOGLIK, interval = interval, maximum = TRUE) :
NA/Inf replaced by maximum positive value
2: In optimize(LOGLIK, interval = interval, maximum = TRUE) :
NA/Inf replaced by maximum positive value
3: In optimize(LOGLIK, interval = interval, maximum = TRUE) :
NA/Inf replaced by maximum positive value


... which come from the .estimate_alpha() function due to the presence of NA values, which come from running edgeR::goodTuringProportions() on non-integer values. The errors mean that you don't get a good alpha, and you don't get sensible p values/ FDRs:

> head(empty)
DataFrame with 6 rows and 5 columns
Total   LogProb    PValue   Limited       FDR
<integer> <numeric> <numeric> <logical> <numeric>
1       204        NA         1     FALSE         1
2       281        NA         1     FALSE         1
3       390        NA         1     FALSE         1
4       547        NA         1     FALSE         1
5       283        NA         1     FALSE         1
6      3098        NA         1     FALSE         0


Rounding the values seems to address the issue:

testmat <- counts(single_cell_experiment)
testmat@x <- round(testmat@x)
empty <- emptyDrops(testmat)


... and the outputs then look more sensible:

> head(empty)
DataFrame with 6 rows and 5 columns
Total           LogProb            PValue   Limited                  FDR
<integer>         <numeric>         <numeric> <logical>            <numeric>
1       195 -929.398506649979 9.99900009999e-05      TRUE 0.000112840447556976
2       271  -1290.7519266567 9.99900009999e-05      TRUE 0.000112840447556976
3       377 -1700.71801792383 9.99900009999e-05      TRUE 0.000112840447556976
4       526  -2338.8659368385 9.99900009999e-05      TRUE 0.000112840447556976
5       266 -1159.62654569938 9.99900009999e-05      TRUE 0.000112840447556976
6      3007 -15306.8305754847 9.99900009999e-05      TRUE 0.000112840447556976


Could the integer check maybe be made an integral part of the testEmptyDrops() function? Or is there a more sensible workaround?

Thanks,

Jon

dropletutils • 124 views
modified 8 weeks ago by Aaron Lun25k • written 8 weeks ago by jmanning0
Answer: DropletUtils' emptyDrops() fails with non-integer counts
0
8 weeks ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

This is an interesting question, and it made me think about how emptyDrops() might integrate with alevin.

tl;dr Someone, somewhere will have to coerce the data into discrete counts.

The Good-Turing algorithm depends on counts. It literally counts the frequency of counts, so if the counts aren't discrete integers, it gets confused in both theory and implementation. One could evade this somewhat by rounding the ambient profile after summing across all low-count droplets, so that that whatever goes into goodTuringProportions() is still a count vector. This will work (i.e., not throw an error) but I don't know whether it is correct.

This leads to the second problem in that the multinomial calculations also depend on counts. Perhaps the likelihood calculations could be generalized to non-count data, but I don't know whether that would be right or not. And even then... how do I do Monte Carlo simulations for a multinomial distribution that has a fractional total count? I have to simulate data from a multinomial distribution - how do I simulate data for a fractional library size?

The best I can do is to coerce things to integers inside emptyDrops() (perhaps it does that already, I can't remember). But there must be coercion, as it won't work otherwise.

Thanks for that detailed response Aaron, and for the work you've done on this more generally.

I think that integer coercion inside emptyDrops() or testEmptyDrops() would be a good immediate solution. Right now, the 'clever' bits of the empty droplet detection fail without triggering an error (just those warnings), leaving you just with the non-empty results from knee detection or providing a 'retain'. I don't think it would be obvious to many users that 1) this has happened or 2) that it's because they supplied non-integer counts.

The Bioc-devel version (1.5.6) has been updated to round values at the very start of emptyDrops(). In the end, I decided that rounding everything at the start was the safest thing to do, as this would guarantee it would play nice (in practice and theory) with the remaining operations. I'm open to being convinced otherwise but it had better be good.