Part of the input for Gene Set Enrichment Analysis with fgseaMultilevel is a list of GO terms and genes assigned to it. E.g.:
library(fgsea) data(examplePathways) examplePathways[1:3] $`1221633_Meiotic_Synapsis`  "12189" "13006" "15077" "15078"... $`1368092_Rora_activates_gene_expression`  "11865" "12753" "12894" "18143" "19017" "19883" "20787" "217166" "328572" $`1368110_Bmal1:Clock,Npas2_activates_circadian_gene_expression`  "11865" "11998" "12753" "12952" "12953"...
My question is: Am I right that if a gene is annotated with a specific GO term, then all the ancestors of that GO term should also be assigned to that gene? (This is because if you are annotated with a specific term then you are autamtically annotated also with the less specific terms, right?)
Assuming I'm correct, ensembl/biomart often assigns a specific GO term to a gene but it doesn't assign all of the ancestor terms. This means that data retrieved from biomart is not immediately usable for
fgsea. Here's an example:
At first impression it would appear that no human gene is annotated with term
library(biomaRt) library(GO.db) ensembl <- useEnsembl(biomart = 'genes', dataset = 'hsapiens_gene_ensembl', version = 105) getBM(filters=c('go'), attributes=c('ensembl_gene_id'), values=list('GO:0044815'), mart=ensembl)  ensembl_gene_id <0 rows> (or 0-length row.names)
However, several genes are annotated with the offspring of
offspring <- as.list(GOCCOFFSPRING['GO:0044815']) offspring $`GO:0044815`  "GO:0000786" "GO:0000796" "GO:0043505" "GO:0140573" getBM(filters=c('go'), attributes=c('ensembl_gene_id'), values=list(offspring[]), mart=ensembl) ensembl_gene_id 1 ENSG00000281813 2 ENSG00000285435 3 ENSG00000284841 4 ENSG00000285449 ... 122 ENSG00000113810 123 ENSG00000151503
This means that if you prepare the data for
fgsea straight from ensembl/biomart, the term
GO:0044815 is not tested at all even if there are many genes effectively annotated with it. So, the data from biomart should be "augmented" to add to each gene the ancestors of each GO term assigned to that gene. This augmentation is not too difficult (I think) but I haven't seen it mentioned anywhere so I am wondering if I'm misunderstanding something here.
sessionInfo() R version 4.1.1 (2021-08-10) Platform: x86_64-conda-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core) Matrix products: default BLAS/LAPACK: /export/III-data/wcmp_bioinformatics/db291g/miniconda3/envs/20220405_chris_cerebral_malaria/lib/libopenblasp-r0.3.18.so locale:  LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C  LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages:  stats4 stats graphics grDevices utils datasets methods base other attached packages:  fgsea_1.20.0 GO.db_3.14.0 AnnotationDbi_1.56.1 IRanges_2.28.0 S4Vectors_0.32.3 Biobase_2.54.0 BiocGenerics_0.40.0 biomaRt_2.50.0 loaded via a namespace (and not attached):  Rcpp_18.104.22.168 lattice_0.20-45 prettyunits_1.1.1 png_0.1-7 Biostrings_2.62.0 assertthat_0.2.1 digest_0.6.29 utf8_1.2.2 BiocFileCache_2.2.0 R6_2.5.1 GenomeInfoDb_1.30.0  RSQLite_2.2.8 httr_1.4.2 ggplot2_3.3.3 pillar_1.7.0 zlibbioc_1.40.0 rlang_1.0.2 progress_1.2.2 curl_4.3.2 data.table_1.14.0 blob_1.2.2 Matrix_1.4-1  BiocParallel_1.28.3 stringr_1.4.1 RCurl_1.98-1.6 bit_4.0.4 munsell_0.5.0 compiler_4.1.1 pkgconfig_2.0.3 tidyselect_1.1.2 KEGGREST_1.34.0 gridExtra_2.3 tibble_3.1.6  GenomeInfoDbData_1.2.7 XML_3.99-0.9 fansi_1.0.3 crayon_1.5.1 dplyr_1.0.9 dbplyr_2.2.1 bitops_1.0-7 rappdirs_0.3.3 grid_4.1.1 gtable_0.3.0 lifecycle_1.0.1  DBI_1.1.2 magrittr_2.0.3 scales_1.1.1 cli_3.3.0 stringi_1.7.6 cachem_1.0.6 XVector_0.34.0 xml2_1.3.3 ellipsis_0.3.2 filelock_1.0.2 generics_0.1.3  vctrs_0.4.1 fastmatch_1.1-3 tools_4.1.1 bit64_4.0.5 glue_1.6.2 purrr_0.3.4 hms_1.1.2 parallel_4.1.1 fastmap_1.1.0 colorspace_2.0-3 memoise_2.0.1