Properly constructing a hypergeometric test
2
0
Entering edit mode
@charlesfoster-17652
Last seen 20 days ago
Australia

I'm having some conceptual challenges ensuring that I have properly constructed a hypergeometric test in R. I would appreciate some feedback.

For some background, we have carried out transcriptomic analyses and determined a set of differentially expressed genes (DEGs) between our experimental conditions. We wish to determine whether genes associated with a particular syndrome are overrepresented in the set of DEGs. We've obtained a set of curated genes associated with the syndrome of interested.

I have the following:

  • Set of differentially expressed gene IDs = DEGs (character vector)
  • Gene IDs for all genes detected in the study (i.e., those that are DEGs + those that are not DEGs) = universe (character vector)
  • Set of gene IDs associated with the syndrome of interest, filtered to only include those detected in the universe set = syndrome_genes (character vector)
  • Overlap between DEGs and syndrome_genes

I see the hypergeometric test being set up as follows:

### Formulation 1 ###
overlap <- intersect(DEGs, syndrome_genes)
result <- phyper(q = length(overlap) - 1,
                 m = length(syndrome_genes),
                 n = length(universe) - length(syndrome_genes),
                 k = length(DEGs),
                 lower.tail = FALSE)

I've formulated the test in this way because, using the classical urn terminology of phyper, I see the DEGs as being the number of balls sampled from the urn, the number of white balls in the urn being the syndrome_genes, the overlap being the number of white balls drawn during the sampling of DEGs, and the number of black balls in the urn being the genes in the universe set that are not genes associated with the syndrome of interest.

However, a collaborator formulates the test differently:

### Formulation 2 ###
overlap <- intersect(DEGs, syndrome_genes)
result <- phyper(q = length(overlap),
                 m = length(DEGs),
                 n = length(universe) - length(DEGs),
                 k = length(syndrome_genes),
                 lower.tail = FALSE)

Which of these formulations is correct? Thanks in advance.

overrepresentation overlap R hypergeometric • 2.9k views
ADD COMMENT
1
Entering edit mode
Robert Castelo ★ 3.4k
@rcastelo
Last seen 2 days ago
Barcelona/Universitat Pompeu Fabra

To me, yours is correct, the one of the your collaborator does not adhere to the urn model used to specify the parameters in the help page of the phyper() function. For instance, to get the one-tailed probability you need to set lower.tail=FALSE, but then because according to the help page, parameter lower.tail, you're getting P[X > x], because you actually want P[X >= x], then you need to set length(overlap)-1 in the first parameter, as you rightly do.

cheers,

robert.

ADD COMMENT
0
Entering edit mode
ATpoint ★ 4.6k
@atpoint-13662
Last seen 4 hours ago
Germany

Here is an answer from the (I think) clusterProfiler author that I used as guideline: https://www.biostars.org/p/485827/#9483835

The key point is to define the background properly, that would imo be all genes in your analysis that have any annotation in the database you enrich against.

ADD COMMENT

Login before adding your answer.

Traffic: 577 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6