Question

GO data for fgsea: should ensembl data be augmented to contain parent terms?

0

Entering edit mode

dario.beraldi • 0

@darioberaldi-6991

Last seen 17 months ago

United Kingdom

Part of the input for Gene Set Enrichment Analysis with fgseaMultilevel is a list of GO terms and genes assigned to it. E.g.:

library(fgsea)

data(examplePathways)
examplePathways[1:3]
$`1221633_Meiotic_Synapsis`
 [1] "12189"     "13006"     "15077"     "15078"...

$`1368092_Rora_activates_gene_expression`
[1] "11865"  "12753"  "12894"  "18143"  "19017"  "19883"  "20787"  "217166" "328572"

$`1368110_Bmal1:Clock,Npas2_activates_circadian_gene_expression`
 [1] "11865"  "11998"  "12753"  "12952"  "12953"...

My question is: Am I right that if a gene is annotated with a specific GO term, then all the ancestors of that GO term should also be assigned to that gene? (This is because if you are annotated with a specific term then you are autamtically annotated also with the less specific terms, right?)

Assuming I'm correct, ensembl/biomart often assigns a specific GO term to a gene but it doesn't assign all of the ancestor terms. This means that data retrieved from biomart is not immediately usable for fgsea. Here's an example:

At first impression it would appear that no human gene is annotated with term GO:0044815:

library(biomaRt)
library(GO.db)

ensembl <- useEnsembl(biomart = 'genes',
  dataset = 'hsapiens_gene_ensembl',
  version = 105)

getBM(filters=c('go'), attributes=c('ensembl_gene_id'), values=list('GO:0044815'), mart=ensembl)
[1] ensembl_gene_id
<0 rows> (or 0-length row.names)

However, several genes are annotated with the offspring of GO:0044815:

offspring <- as.list(GOCCOFFSPRING['GO:0044815'])
offspring
$`GO:0044815`
[1] "GO:0000786" "GO:0000796" "GO:0043505" "GO:0140573"

getBM(filters=c('go'), attributes=c('ensembl_gene_id'), values=list(offspring[[1]]), mart=ensembl)
    ensembl_gene_id
1   ENSG00000281813
2   ENSG00000285435
3   ENSG00000284841
4   ENSG00000285449
...
122 ENSG00000113810
123 ENSG00000151503

This means that if you prepare the data for fgsea straight from ensembl/biomart, the term GO:0044815 is not tested at all even if there are many genes effectively annotated with it. So, the data from biomart should be "augmented" to add to each gene the ancestors of each GO term assigned to that gene. This augmentation is not too difficult (I think) but I haven't seen it mentioned anywhere so I am wondering if I'm misunderstanding something here.

sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /export/III-data/wcmp_bioinformatics/db291g/miniconda3/envs/20220405_chris_cerebral_malaria/lib/libopenblasp-r0.3.18.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] fgsea_1.20.0         GO.db_3.14.0         AnnotationDbi_1.56.1 IRanges_2.28.0       S4Vectors_0.32.3     Biobase_2.54.0       BiocGenerics_0.40.0  biomaRt_2.50.0      

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8.3           lattice_0.20-45        prettyunits_1.1.1      png_0.1-7              Biostrings_2.62.0      assertthat_0.2.1       digest_0.6.29          utf8_1.2.2             BiocFileCache_2.2.0    R6_2.5.1               GenomeInfoDb_1.30.0   
[12] RSQLite_2.2.8          httr_1.4.2             ggplot2_3.3.3          pillar_1.7.0           zlibbioc_1.40.0        rlang_1.0.2            progress_1.2.2         curl_4.3.2             data.table_1.14.0      blob_1.2.2             Matrix_1.4-1          
[23] BiocParallel_1.28.3    stringr_1.4.1          RCurl_1.98-1.6         bit_4.0.4              munsell_0.5.0          compiler_4.1.1         pkgconfig_2.0.3        tidyselect_1.1.2       KEGGREST_1.34.0        gridExtra_2.3          tibble_3.1.6          
[34] GenomeInfoDbData_1.2.7 XML_3.99-0.9           fansi_1.0.3            crayon_1.5.1           dplyr_1.0.9            dbplyr_2.2.1           bitops_1.0-7           rappdirs_0.3.3         grid_4.1.1             gtable_0.3.0           lifecycle_1.0.1       
[45] DBI_1.1.2              magrittr_2.0.3         scales_1.1.1           cli_3.3.0              stringi_1.7.6          cachem_1.0.6           XVector_0.34.0         xml2_1.3.3             ellipsis_0.3.2         filelock_1.0.2         generics_0.1.3        
[56] vctrs_0.4.1            fastmatch_1.1-3        tools_4.1.1            bit64_4.0.5            glue_1.6.2             purrr_0.3.4            hms_1.1.2              parallel_4.1.1         fastmap_1.1.0          colorspace_2.0-3       memoise_2.0.1

biomaRt GO ensembldb fgsea • 1.1k views

ADD COMMENT • link 19 months ago dario.beraldi • 0

score 0 · Answer 1 · 2022-09-29

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 6 hours ago

United States

You should include the filter as an attribute. I would also reverse the query, asking instead for the GO terms for each gene. But anyway.

> mart <- useEnsembl("ensembl","hsapiens_gene_ensembl")
> z <- getBM(filters=c('go'), attributes=c('go_id','ensembl_gene_id'), values=list('GO:0044815'), mart=mart)
> z
        go_id ensembl_gene_id
1  GO:0000781 ENSG00000146047
2  GO:0005634 ENSG00000146047
3  GO:0003677 ENSG00000146047
4  GO:0030527 ENSG00000146047
5  GO:0046982 ENSG00000146047
6  GO:0006334 ENSG00000146047
7  GO:0000786 ENSG00000146047
8  GO:0005694 ENSG00000146047
9  GO:0005654 ENSG00000146047
10 GO:0003674 ENSG00000146047
11 GO:0035093 ENSG00000146047
12 GO:0006337 ENSG00000146047
13 GO:0042393 ENSG00000146047
14 GO:0006325 ENSG00000146047
15 GO:0006954 ENSG00000146047
16 GO:0031639 ENSG00000146047
17 GO:0051099 ENSG00000146047
18 GO:0051276 ENSG00000146047
19 GO:0071674 ENSG00000146047
20 GO:0001674 ENSG00000146047
21 GO:0019897 ENSG00000146047
22 GO:0044815 ENSG00000146047

> gos <- getBM(c("go_id","ensembl_gene_id"), "ensembl_gene_id", "ENSG00000281813", mart)
> gos
        go_id ensembl_gene_id
1             ENSG00000281813
2  GO:0005634 ENSG00000281813
3  GO:0046872 ENSG00000281813
4  GO:0016740 ENSG00000281813
5  GO:0006355 ENSG00000281813
6  GO:0003677 ENSG00000281813
7  GO:0006325 ENSG00000281813
8  GO:0016746 ENSG00000281813
9  GO:0006334 ENSG00000281813
10 GO:0000786 ENSG00000281813
11 GO:0043966 ENSG00000281813
12 GO:0000123 ENSG00000281813
13 GO:0045893 ENSG00000281813
14 GO:0004402 ENSG00000281813
15 GO:0016573 ENSG00000281813
16 GO:0005515 ENSG00000281813
17 GO:0042393 ENSG00000281813
18 GO:0045892 ENSG00000281813
19 GO:0061629 ENSG00000281813
20 GO:0005654 ENSG00000281813
21 GO:0045944 ENSG00000281813
22 GO:0003712 ENSG00000281813
23 GO:0070776 ENSG00000281813
24 GO:0050793 ENSG00000281813
25 GO:1903706 ENSG00000281813
26 GO:0016407 ENSG00000281813

This is based on the current version, not 105.

ADD COMMENT • link 19 months ago James W. MacDonald 65k

0

Entering edit mode

Thanks, but I'm afraid this doesn't resolve my issue. You can see from your output that biomart assigns GO:0000786 to ENSG00000281813. GO:0000786 is an offspring of GO:0044815. However, GO:0044815 is not listed as a term for ENSG00000281813. This is my concern: Why ensembl/biomart omits parent terms? And should I augment the GO terms for ENSG00000281813 (and any other gene) with the parent terms before GSEA?

ADD REPLY • link 19 months ago dario.beraldi • 0

0

Entering edit mode

Your issue has more to do with what the Biomart server returns than an actual Bioconductor question. As far as I can tell, biomaRt is faithfully returning results. To be honest, Biomart doesn't really say what GO terms they are meant to be returning, so it's hard to say if they are doing the right thing or not.

Anyway, if I am planning to do any GO type analyses, I always use NCBI Gene IDs rather than Ensembl, as I know exactly how the GO.db is created, and am sure of its accuracy.

> library(org.Hs.eg.db)
## map to Gene ID - not recommended in general, but...
> select(org.Hs.eg.db, "ENSG00000281813", "ENTREZID","ENSEMBL")
'select()' returned 1:1 mapping between keys and columns
          ENSEMBL ENTREZID
1 ENSG00000281813    23522
> z <- select(org.Hs.eg.db, "23522", "GOALL")
'select()' returned 1:many mapping between keys and columns
> zz <- subset(z, ONTOLOGYALL == "CC")
> zz <- zz[!duplicated(zz[,2]),]
> zz
    ENTREZID      GOALL EVIDENCEALL ONTOLOGYALL
1      23522 GO:0000123         IBA          CC
3      23522 GO:0000785         NAS          CC
4      23522 GO:0000786         NAS          CC
23     23522 GO:0005575         IBA          CC
27     23522 GO:0005622         IBA          CC
31     23522 GO:0005634         IBA          CC
34     23522 GO:0005654         IBA          CC
37     23522 GO:0005694         NAS          CC
152    23522 GO:0031248         IBA          CC
169    23522 GO:0031974         IBA          CC
172    23522 GO:0031981         IBA          CC
178    23522 GO:0032991         IBA          CC
181    23522 GO:0032993         NAS          CC
199    23522 GO:0043226         IBA          CC
203    23522 GO:0043227         IBA          CC
206    23522 GO:0043228         NAS          CC
207    23522 GO:0043229         IBA          CC
211    23522 GO:0043231         IBA          CC
214    23522 GO:0043232         NAS          CC
215    23522 GO:0043233         IBA          CC
242    23522 GO:0044815         NAS          CC
296    23522 GO:0070013         IBA          CC
299    23522 GO:0070775         IBA          CC
301    23522 GO:0070776         IBA          CC
318    23522 GO:0110165         IBA          CC
326    23522 GO:0140513         IBA          CC
328    23522 GO:0140535         IBA          CC
342    23522 GO:1902493         IBA          CC
344    23522 GO:1902494         IBA          CC
357    23522 GO:1990234         IBA          CC

And do note that the GO.db package even includes the weird cross-ontology mapping from CC to BP that is shown in your QuickGO DAG.

> subset(z, GOALL %in% paste0("GO:", sprintf("%07.0f", c(8150,9987,71840,16043,6996,51276))))
    ENTREZID      GOALL EVIDENCEALL ONTOLOGYALL
64     23522 GO:0006996         NAS          BP
68     23522 GO:0008150         IBA          BP
69     23522 GO:0008150         IDA          BP
70     23522 GO:0008150         NAS          BP
91     23522 GO:0009987         IBA          BP
92     23522 GO:0009987         IDA          BP
93     23522 GO:0009987         NAS          BP
111    23522 GO:0016043         NAS          BP
284    23522 GO:0051276         NAS          BP
307    23522 GO:0071840         NAS          BP

ADD REPLY • link 19 months ago James W. MacDonald 65k

0

Entering edit mode

Thanks again. I realize my question is not really about Bioconductor. I crossposted on biostars.

ADD REPLY • link 19 months ago dario.beraldi • 0