I'm having some conceptual challenges ensuring that I have properly constructed a hypergeometric test in R. I would appreciate some feedback.
For some background, we have carried out transcriptomic analyses and determined a set of differentially expressed genes (DEGs) between our experimental conditions. We wish to determine whether genes associated with a particular syndrome are overrepresented in the set of DEGs. We've obtained a set of curated genes associated with the syndrome of interested.
I have the following:
- Set of differentially expressed gene IDs =
DEGs
(character vector) - Gene IDs for all genes detected in the study (i.e., those that are DEGs + those that are not DEGs) =
universe
(character vector) - Set of gene IDs associated with the syndrome of interest, filtered to only include those detected in the
universe
set =syndrome_genes
(character vector) - Overlap between
DEGs
andsyndrome_genes
I see the hypergeometric test being set up as follows:
### Formulation 1 ###
overlap <- intersect(DEGs, syndrome_genes)
result <- phyper(q = length(overlap) - 1,
m = length(syndrome_genes),
n = length(universe) - length(syndrome_genes),
k = length(DEGs),
lower.tail = FALSE)
I've formulated the test in this way because, using the classical urn terminology of phyper
, I see the DEGs
as being the number of balls sampled from the urn, the number of white balls in the urn being the syndrome_genes
, the overlap
being the number of white balls drawn during the sampling of DEGs
, and the number of black balls in the urn being the genes in the universe
set that are not genes associated with the syndrome of interest.
However, a collaborator formulates the test differently:
### Formulation 2 ###
overlap <- intersect(DEGs, syndrome_genes)
result <- phyper(q = length(overlap),
m = length(DEGs),
n = length(universe) - length(DEGs),
k = length(syndrome_genes),
lower.tail = FALSE)
Which of these formulations is correct? Thanks in advance.