Properly constructing a hypergeometric test
Entering edit mode
Last seen 14 months ago

I'm having some conceptual challenges ensuring that I have properly constructed a hypergeometric test in R. I would appreciate some feedback.

For some background, we have carried out transcriptomic analyses and determined a set of differentially expressed genes (DEGs) between our experimental conditions. We wish to determine whether genes associated with a particular syndrome are overrepresented in the set of DEGs. We've obtained a set of curated genes associated with the syndrome of interested.

I have the following:

  • Set of differentially expressed gene IDs = DEGs (character vector)
  • Gene IDs for all genes detected in the study (i.e., those that are DEGs + those that are not DEGs) = universe (character vector)
  • Set of gene IDs associated with the syndrome of interest, filtered to only include those detected in the universe set = syndrome_genes (character vector)
  • Overlap between DEGs and syndrome_genes

I see the hypergeometric test being set up as follows:

### Formulation 1 ###
overlap <- intersect(DEGs, syndrome_genes)
result <- phyper(q = length(overlap) - 1,
                 m = length(syndrome_genes),
                 n = length(universe) - length(syndrome_genes),
                 k = length(DEGs),
                 lower.tail = FALSE)

I've formulated the test in this way because, using the classical urn terminology of phyper, I see the DEGs as being the number of balls sampled from the urn, the number of white balls in the urn being the syndrome_genes, the overlap being the number of white balls drawn during the sampling of DEGs, and the number of black balls in the urn being the genes in the universe set that are not genes associated with the syndrome of interest.

However, a collaborator formulates the test differently:

### Formulation 2 ###
overlap <- intersect(DEGs, syndrome_genes)
result <- phyper(q = length(overlap),
                 m = length(DEGs),
                 n = length(universe) - length(DEGs),
                 k = length(syndrome_genes),
                 lower.tail = FALSE)

Which of these formulations is correct? Thanks in advance.

overrepresentation overlap R hypergeometric • 1.7k views
Entering edit mode
ATpoint ★ 4.2k
Last seen 24 minutes ago

Here is an answer from the (I think) clusterProfiler author that I used as guideline:

The key point is to define the background properly, that would imo be all genes in your analysis that have any annotation in the database you enrich against.

Entering edit mode
Robert Castelo ★ 3.3k
Last seen 1 day ago
Barcelona/Universitat Pompeu Fabra

To me, yours is correct, the one of the your collaborator does not adhere to the urn model used to specify the parameters in the help page of the phyper() function. For instance, to get the one-tailed probability you need to set lower.tail=FALSE, but then because according to the help page, parameter lower.tail, you're getting P[X > x], because you actually want P[X >= x], then you need to set length(overlap)-1 in the first parameter, as you rightly do.




Login before adding your answer.

Traffic: 785 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6