goseq: Wallenius Vs Fisher's Noncentral Hypergeometric Distribution
1
0
Entering edit mode
@steve-pederson-3490
Last seen 24 months ago
Australia

I've found myself in the situation where goseq fails with some relatively extreme numbers (derived from ChIP-Seq data with extreme changes in the IP target) and have tracked it down to the calls to pWNCHypergeo()

library(BiasedUrn)
pWNCHypergeo(x = 643, m1 = 643, m2 = 16976, n = 17611, odds = 1.00001)

Which yields

Error in pWNCHypergeo(x = 643, m1 = 643, m2 = 16976, n = 17611, odds = 1.00001) : Inconsistency. mean = 643, lower limit = 637, upper limit = 642

Applying Fisher's Noncentral Hypergeometric returns the answer I'd naively expect

pFNCHypergeo(x = 643, m1 = 643, m2 = 16976, n = 17611, odds = 1.00001)

[1] 1

I must confess my ignorance and lack of expertise with Noncentral Hypergeometric Distributions, however after my very shallow reading of the differences on Wikipedia, Fisher's Distribution makes conceptual sense in the context of gene testing, i.e. draws are not consecutive/dependent.

I'm in the process of implementing a fairly complex automated pipeline and given this kind of edge case has just appeared and caused a pipeline failure, my question is simply why the authors chose Wallenius' above Fisher's distribution, and would they caution against using Fisher's distribution?

goseq • 831 views
ADD COMMENT
0
Entering edit mode
@gordon-smyth
Last seen 5 hours ago
WEHI, Melbourne, Australia

Thanks for isolating the error to a simple use case. A few comments:

  • The error in your example isn't an issue with the Wallenius distribution per se, rather it seems to a programming error in the pWNCHypergeo() function.
  • If we substituted Fisher's NCH distribution for Wallenius in goseq, there is no guarantee that it would not also fail for some input values.
  • I don't remember our reasoning 15 years ago when we created the goseq package, but the Wallenius distribution still seems more natural to me. We posit a fixed number of DE genes (draws) and the genes are ordered by p-value, so considering them to be drawn in sequence is perfectly natural. By contrast, the description of Fisher's NCH distribution doesn't entirely make sense to me.
  • I doubt that there is much difference between the two distributions in practice, but goseq has worked successfully for thousands of analyses over the past 10 years and changing it's basic statistical method at this stage would not be ideal.
  • The scenario of your example truly is very extreme. Your calculation assumes that 99.955% of all genes are differentially bound. For most types of analyses, that sort of scenario is impossible. If it did arise in practice, then you would already know in advance that any gene ontology analysis is pointless.
  • ChIP-seq analyses are not usually affected by gene length biases. I don't know the context of your analysis, but have you considered whether this is an appropriate application for goseq?
  • I have written to the BiasedUrn author to see if he can fix this error. If not, I imagine that we can make a fix to goseq to intercept the error.
ADD COMMENT
0
Entering edit mode

Thanks for the swift and informative reply Gordon. Taking a step back with your clarity was most helpful. I'd been wrangling more with "How do I fix this right here & now so I can have results immediately" than the more appropriate larger questions you've addressed.

For context, this is a standardised workflow comparing multiple ChIP targets, one of which is mostly cytoplasmic in the control condition, then mostly nuclear after treatment. Having a robust & standardised workflow I find best for integration across multiple targets. In reality these results would be ignored for this specific target, but would be generally informative for the other ChIP targets. However given the standardised nature of the pipeline, all targets are processed in the same way, and the possibility of failure during extreme cases exists. The more informative steps for this particular target are when assessing the combined behaviour with other targets, hence it's inclusion during standardised processing.

Your comments regarding gene length bias are also most helpful. I'll give that some thought and will dig more deeply through these datasets. Probability weight functions in early explorations I did showed this to be prudent decision, but I may need to revisit.

Pleased to hear there may be a fix too.

Thanks again

ADD REPLY

Login before adding your answer.

Traffic: 970 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6