Question

Hypergeometric (gene set enrichment) test using "Categories" package

0

Entering edit mode

josmantorres ▴ 10

@josmantorres-23988

Last seen 4.6 years ago

Argentina

Dear Bioconductor community,

We have performed a differential gene expression analysis in an insect and identified some genes belonging to detoxification processes as differentially expressed. Now, I am trying to perform a gene set enrichment analysis based on PFAM domains as we want to see if some specific families related to detoxification (cytochromes, GST, etc..) are enriched in our dataset. We are using "Categories" and the "hyperg" function to do it. Do you suggest other type of analysis within "Categories" considering this objective?

I have some problems with the input files to perform a Hypergeometric (gene set enrichment) test. As far as I understand, I need three files:

assayed - I included all gene ids (first column) with the corresponding pfam domain codes (second column and separated by ;)
significant - IDs of differentially expressed genes
universe - IDs of all genes

When I used the function:

result <- hyperg(assayed, sigsets, universe)

Appears the following error:

Error in .local(assayed, significant, universe, representation, ...) :
  some 'assayed' genes not in 'universe'

As "assayed" and "universe" files were generated from the same file, I think that the problem would be that my "assayed" file has an incorrect format. What would be the correct format for the "assayed "file? I have tested PFAM domains separated by tab and it gives the same error.

Thanks in advance for your time and help,

Best wishes,

Jose

R session info: ``` R version 4.0.5 (2021-03-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale: [1] LC_CTYPE=es_AR.UTF-8 LC_NUMERIC=C LC_TIME=es_AR.UTF-8 LC_COLLATE=es_AR.UTF-8 LC_MONETARY=es_AR.UTF-8
[6] LC_MESSAGES=es_AR.UTF-8 LC_PAPER=es_AR.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=es_AR.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] Category_2.54.0 Matrix_1.3-3 AnnotationDbi_1.50.3 IRanges_2.24.1 S4Vectors_0.28.1 Biobase_2.50.0
[7] BiocGenerics_0.36.0 edgeR_3.30.3 limma_3.44.3

loaded via a namespace (and not attached): [1] Rcpp_1.0.6 pillar_1.6.1 compiler_4.0.5 bitops_1.0-7 tools_4.0.5 bit_4.0.4 tibble_3.1.2
[8] lifecycle_1.0.0 annotate_1.66.0 RSQLite_2.2.7 memoise_2.0.0 lattice_0.20-44 pkgconfig_2.0.3 rlang_0.4.11
[15] graph_1.66.0 DBI_1.1.1 fastmap_1.1.0 genefilter_1.70.0 hms_1.1.0 vctrs_0.3.8 locfit_1.5-9.4
[22] bit64_4.0.5 grid_4.0.5 GSEABase_1.50.1 R6_2.5.0 fansi_0.4.2 XML_3.99-0.6 RBGL_1.64.0
[29] survival_3.2-11 magrittr_2.0.1 readr_1.4.0 blob_1.2.1 ellipsis_0.3.2 splines_4.0.5 xtable_1.8-4
[36] utf8_1.2.1 RCurl_1.98-1.3 cachem_1.0.5 crayon_1.4.1

enrichment hypergeometric Categories • 1.3k views

ADD COMMENT • link 4.6 years ago josmantorres ▴ 10

score 2 · Accepted Answer · 2021-05-19

The hyperg function doesn't use 'files' it uses R objects. And what those should be is described in the help page (?hyperg).

Arguments:

 assayed: A vector of assayed genes (or other identifiers). 'assayed'
          may be a character vector (defining a single gene set) or
          list of character vectors (defining a collection of gene
          sets).

significant: A vector of assayed genes that were differentially
          expressed. If 'assayed' is a character vector, then
          'significant' must also be a character vector; likewise when
          'assayed' is a 'list'.

universe: A character vector defining the universe of genes.

So you can pass in two lists and a character vector, or three character vectors, depending on what you are doing. So you could have a vector of IDs that are in a particular PFAM domain, a vector of IDs from that same domain that are significant, and a vector of IDs that define the entirety of the PFAM IDs that were tested. Or if you want to test multiple PFAM domains, you do the same, only the 'assayed' object is a list of IDs for each PFAM domain you want to test, the 'significant' object is a list containing the IDs from the 'assayed' object that were significant, and the universe is still just all the PFAM IDs that were tested.

And if you get an error saying there are things in either the significant object or the assayed object that aren't in the universe, well, it's because there are things in one of those objects that aren't in the universe. R wouldn't lie to you about that, would it?