I've found myself in the situation where goseq fails with some relatively extreme numbers (derived from ChIP-Seq data with extreme changes in the IP target) and have tracked it down to the calls to pWNCHypergeo()
library(BiasedUrn)
pWNCHypergeo(x = 643, m1 = 643, m2 = 16976, n = 17611, odds = 1.00001)
Which yields
Error in pWNCHypergeo(x = 643, m1 = 643, m2 = 16976, n = 17611, odds = 1.00001) :
Inconsistency. mean = 643, lower limit = 637, upper limit = 642
Applying Fisher's Noncentral Hypergeometric returns the answer I'd naively expect
pFNCHypergeo(x = 643, m1 = 643, m2 = 16976, n = 17611, odds = 1.00001)
[1] 1
I must confess my ignorance and lack of expertise with Noncentral Hypergeometric Distributions, however after my very shallow reading of the differences on Wikipedia, Fisher's Distribution makes conceptual sense in the context of gene testing, i.e. draws are not consecutive/dependent.
I'm in the process of implementing a fairly complex automated pipeline and given this kind of edge case has just appeared and caused a pipeline failure, my question is simply why the authors chose Wallenius' above Fisher's distribution, and would they caution against using Fisher's distribution?
Thanks for the swift and informative reply Gordon. Taking a step back with your clarity was most helpful. I'd been wrangling more with "How do I fix this right here & now so I can have results immediately" than the more appropriate larger questions you've addressed.
For context, this is a standardised workflow comparing multiple ChIP targets, one of which is mostly cytoplasmic in the control condition, then mostly nuclear after treatment. Having a robust & standardised workflow I find best for integration across multiple targets. In reality these results would be ignored for this specific target, but would be generally informative for the other ChIP targets. However given the standardised nature of the pipeline, all targets are processed in the same way, and the possibility of failure during extreme cases exists. The more informative steps for this particular target are when assessing the combined behaviour with other targets, hence it's inclusion during standardised processing.
Your comments regarding gene length bias are also most helpful. I'll give that some thought and will dig more deeply through these datasets. Probability weight functions in early explorations I did showed this to be prudent decision, but I may need to revisit.
Pleased to hear there may be a fix too.
Thanks again