I'm having some conceptual challenges ensuring that I have properly constructed a hypergeometric test in R. I would appreciate some feedback.
For some background, we have carried out transcriptomic analyses and determined a set of differentially expressed genes (DEGs) between our experimental conditions. We wish to determine whether genes associated with a particular syndrome are overrepresented in the set of DEGs. We've obtained a set of curated genes associated with the syndrome of interested.
I have the following:
- Set of differentially expressed gene IDs =
DEGs(character vector) - Gene IDs for all genes detected in the study (i.e., those that are DEGs + those that are not DEGs) =
universe(character vector) - Set of gene IDs associated with the syndrome of interest, filtered to only include those detected in the
universeset =syndrome_genes(character vector) - Overlap between
DEGsandsyndrome_genes
I see the hypergeometric test being set up as follows:
### Formulation 1 ###
overlap <- intersect(DEGs, syndrome_genes)
result <- phyper(q = length(overlap) - 1,
m = length(syndrome_genes),
n = length(universe) - length(syndrome_genes),
k = length(DEGs),
lower.tail = FALSE)
I've formulated the test in this way because, using the classical urn terminology of phyper, I see the DEGs as being the number of balls sampled from the urn, the number of white balls in the urn being the syndrome_genes, the overlap being the number of white balls drawn during the sampling of DEGs, and the number of black balls in the urn being the genes in the universe set that are not genes associated with the syndrome of interest.
However, a collaborator formulates the test differently:
### Formulation 2 ###
overlap <- intersect(DEGs, syndrome_genes)
result <- phyper(q = length(overlap),
m = length(DEGs),
n = length(universe) - length(DEGs),
k = length(syndrome_genes),
lower.tail = FALSE)
Which of these formulations is correct? Thanks in advance.
