Question

goseq: Wallenius Vs Fisher's Noncentral Hypergeometric Distribution

0

Entering edit mode

Steve Pederson ▴ 170

@steve-pederson-3490

Last seen 24 months ago

Australia

I've found myself in the situation where goseq fails with some relatively extreme numbers (derived from ChIP-Seq data with extreme changes in the IP target) and have tracked it down to the calls to pWNCHypergeo()

library(BiasedUrn)
pWNCHypergeo(x = 643, m1 = 643, m2 = 16976, n = 17611, odds = 1.00001)

Which yields

Error in pWNCHypergeo(x = 643, m1 = 643, m2 = 16976, n = 17611, odds = 1.00001) : Inconsistency. mean = 643, lower limit = 637, upper limit = 642

Applying Fisher's Noncentral Hypergeometric returns the answer I'd naively expect

pFNCHypergeo(x = 643, m1 = 643, m2 = 16976, n = 17611, odds = 1.00001)

[1] 1

I must confess my ignorance and lack of expertise with Noncentral Hypergeometric Distributions, however after my very shallow reading of the differences on Wikipedia, Fisher's Distribution makes conceptual sense in the context of gene testing, i.e. draws are not consecutive/dependent.

I'm in the process of implementing a fairly complex automated pipeline and given this kind of edge case has just appeared and caused a pipeline failure, my question is simply why the authors chose Wallenius' above Fisher's distribution, and would they caution against using Fisher's distribution?

goseq • 831 views

ADD COMMENT • link 24 months ago Steve Pederson ▴ 170

score 0 · Answer 1 · 2022-04-29

Thanks for isolating the error to a simple use case. A few comments:

The error in your example isn't an issue with the Wallenius distribution per se, rather it seems to a programming error in the pWNCHypergeo() function.
If we substituted Fisher's NCH distribution for Wallenius in goseq, there is no guarantee that it would not also fail for some input values.
I don't remember our reasoning 15 years ago when we created the goseq package, but the Wallenius distribution still seems more natural to me. We posit a fixed number of DE genes (draws) and the genes are ordered by p-value, so considering them to be drawn in sequence is perfectly natural. By contrast, the description of Fisher's NCH distribution doesn't entirely make sense to me.
I doubt that there is much difference between the two distributions in practice, but goseq has worked successfully for thousands of analyses over the past 10 years and changing it's basic statistical method at this stage would not be ideal.
The scenario of your example truly is very extreme. Your calculation assumes that 99.955% of all genes are differentially bound. For most types of analyses, that sort of scenario is impossible. If it did arise in practice, then you would already know in advance that any gene ontology analysis is pointless.
ChIP-seq analyses are not usually affected by gene length biases. I don't know the context of your analysis, but have you considered whether this is an appropriate application for goseq?
I have written to the BiasedUrn author to see if he can fix this error. If not, I imagine that we can make a fix to goseq to intercept the error.