Part of the input for Gene Set Enrichment Analysis with fgseaMultilevel is a list of GO terms and genes assigned to it. E.g.:
library(fgsea)
data(examplePathways)
examplePathways[1:3]
$`1221633_Meiotic_Synapsis`
[1] "12189" "13006" "15077" "15078"...
$`1368092_Rora_activates_gene_expression`
[1] "11865" "12753" "12894" "18143" "19017" "19883" "20787" "217166" "328572"
$`1368110_Bmal1:Clock,Npas2_activates_circadian_gene_expression`
[1] "11865" "11998" "12753" "12952" "12953"...
My question is: Am I right that if a gene is annotated with a specific GO term, then all the ancestors of that GO term should also be assigned to that gene? (This is because if you are annotated with a specific term then you are autamtically annotated also with the less specific terms, right?)
Assuming I'm correct, ensembl/biomart often assigns a specific GO term to a gene but it doesn't assign all of the ancestor terms. This means that data retrieved from biomart is not immediately usable for fgsea
. Here's an example:
At first impression it would appear that no human gene is annotated with term GO:0044815
:
library(biomaRt)
library(GO.db)
ensembl <- useEnsembl(biomart = 'genes',
dataset = 'hsapiens_gene_ensembl',
version = 105)
getBM(filters=c('go'), attributes=c('ensembl_gene_id'), values=list('GO:0044815'), mart=ensembl)
[1] ensembl_gene_id
<0 rows> (or 0-length row.names)
However, several genes are annotated with the offspring of GO:0044815
:
offspring <- as.list(GOCCOFFSPRING['GO:0044815'])
offspring
$`GO:0044815`
[1] "GO:0000786" "GO:0000796" "GO:0043505" "GO:0140573"
getBM(filters=c('go'), attributes=c('ensembl_gene_id'), values=list(offspring[[1]]), mart=ensembl)
ensembl_gene_id
1 ENSG00000281813
2 ENSG00000285435
3 ENSG00000284841
4 ENSG00000285449
...
122 ENSG00000113810
123 ENSG00000151503
This means that if you prepare the data for fgsea
straight from ensembl/biomart, the term GO:0044815
is not tested at all even if there are many genes effectively annotated with it. So, the data from biomart should be "augmented" to add to each gene the ancestors of each GO term assigned to that gene. This augmentation is not too difficult (I think) but I haven't seen it mentioned anywhere so I am wondering if I'm misunderstanding something here.
sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /export/III-data/wcmp_bioinformatics/db291g/miniconda3/envs/20220405_chris_cerebral_malaria/lib/libopenblasp-r0.3.18.so
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] fgsea_1.20.0 GO.db_3.14.0 AnnotationDbi_1.56.1 IRanges_2.28.0 S4Vectors_0.32.3 Biobase_2.54.0 BiocGenerics_0.40.0 biomaRt_2.50.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.8.3 lattice_0.20-45 prettyunits_1.1.1 png_0.1-7 Biostrings_2.62.0 assertthat_0.2.1 digest_0.6.29 utf8_1.2.2 BiocFileCache_2.2.0 R6_2.5.1 GenomeInfoDb_1.30.0
[12] RSQLite_2.2.8 httr_1.4.2 ggplot2_3.3.3 pillar_1.7.0 zlibbioc_1.40.0 rlang_1.0.2 progress_1.2.2 curl_4.3.2 data.table_1.14.0 blob_1.2.2 Matrix_1.4-1
[23] BiocParallel_1.28.3 stringr_1.4.1 RCurl_1.98-1.6 bit_4.0.4 munsell_0.5.0 compiler_4.1.1 pkgconfig_2.0.3 tidyselect_1.1.2 KEGGREST_1.34.0 gridExtra_2.3 tibble_3.1.6
[34] GenomeInfoDbData_1.2.7 XML_3.99-0.9 fansi_1.0.3 crayon_1.5.1 dplyr_1.0.9 dbplyr_2.2.1 bitops_1.0-7 rappdirs_0.3.3 grid_4.1.1 gtable_0.3.0 lifecycle_1.0.1
[45] DBI_1.1.2 magrittr_2.0.3 scales_1.1.1 cli_3.3.0 stringi_1.7.6 cachem_1.0.6 XVector_0.34.0 xml2_1.3.3 ellipsis_0.3.2 filelock_1.0.2 generics_0.1.3
[56] vctrs_0.4.1 fastmatch_1.1-3 tools_4.1.1 bit64_4.0.5 glue_1.6.2 purrr_0.3.4 hms_1.1.2 parallel_4.1.1 fastmap_1.1.0 colorspace_2.0-3 memoise_2.0.1
Thanks, but I'm afraid this doesn't resolve my issue. You can see from your output that biomart assigns
GO:0000786
toENSG00000281813
. GO:0000786 is an offspring ofGO:0044815
. However,GO:0044815
is not listed as a term forENSG00000281813
. This is my concern: Why ensembl/biomart omits parent terms? And should I augment the GO terms forENSG00000281813
(and any other gene) with the parent terms before GSEA?Your issue has more to do with what the Biomart server returns than an actual Bioconductor question. As far as I can tell,
biomaRt
is faithfully returning results. To be honest, Biomart doesn't really say what GO terms they are meant to be returning, so it's hard to say if they are doing the right thing or not.Anyway, if I am planning to do any GO type analyses, I always use NCBI Gene IDs rather than Ensembl, as I know exactly how the GO.db is created, and am sure of its accuracy.
And do note that the
GO.db
package even includes the weird cross-ontology mapping from CC to BP that is shown in your QuickGO DAG.Thanks again. I realize my question is not really about Bioconductor. I crossposted on biostars.