Updating Gene Annotation in GeneSetCollection {GSEABase}
1
1
Entering edit mode
rocanja ▴ 60
@rocanja-5374
Last seen 15 months ago

I am using the object types 'GeneSet' and 'GeneSetCollection' from the {GSEABase} package for my gene set enrichment analyses. One of the GeneSetCollections I am working heavily with I have setup from the downloaded gmt file containing the latest release of the Molecular Signatures Database MSigDB.v.7.1 (gene symbol annotation based on Gencode release 33). I am wondering whether there was an easy way to change the gene annotation in the entire GeneSetCollection (~23k gene sets) to a different Gencode release version ... or even change to Ensembl.IDs instead of using gene symbols? The only solution I could come up with so far is to extract all gene symbols from the gmt file, match them to their corresponding gene symbols or Ensembl.ID in the Gencode release I am after, using {biomaRt}, then assemble those back into a gmt format and create a new GeneSetCollection from that. However, that appears quite cumbersome and error prone. Thus, I was wondering whether there was a more elegant way to do the same on the level of the GeneSetCollection? Any suggestions are greatly appreciated!

GSEABase annotation gsea gsva GeneSetCollection • 598 views
1
Entering edit mode
Robert Castelo ★ 2.7k
@rcastelo
Last seen 12 weeks ago
Barcelona/Universitat Pompeu Fabra

hi,

you can use the function mapIdentifiers() from the GSEABase package as follows:

library(GSEABase)

## illustrated with the 'c2.cp.kegg.v7.1.symbols.gmt' file from MSigDB 7.1
keggsym <- getGmt("c2.cp.kegg.v7.1.symbols.gmt",  geneIdType=SymbolIdentifier())
keggsym
GeneSetCollection
names: KEGG_GLYCOLYSIS_GLUCONEOGENESIS, KEGG_CITRATE_CYCLE_TCA_CYCLE, ..., KEGG_VIRAL_MYOCARDITIS (186 total)
unique identifiers: ACSS2, GCK, ..., EIF4G1 (5242 total)
types in collection:
geneIdType: SymbolIdentifier (1 total)
collectionType: NullCollection (1 total)
keggens <- mapIdentifiers(keggsym, ENSEMBLIdentifier("org.Hs.eg.db"))
keggens
GeneSetCollection
names: KEGG_GLYCOLYSIS_GLUCONEOGENESIS, KEGG_CITRATE_CYCLE_TCA_CYCLE, ..., KEGG_VIRAL_MYOCARDITIS (186 total)
unique identifiers: ENSG00000131069, ENSG00000106633, ..., ENSG00000114867 (6333 total)
types in collection:
geneIdType: ENSEMBLIdentifier (1 total)
collectionType: NullCollection (1 total)


cheers,

robert.

0
Entering edit mode

That's exactly the function I was looking for. Thank you so much for your reply. In addition to converting from e.g. gene symbols to Ensembl.gene.IDs, is it also possible to specify the exact database release(s) and convert across two different releases? Conceptually, something like this ...

keggens <- mapIdentifiers(what=keggsym, to=ENSEMBLIdentifier(annotation='biomaRt', host='oct2014.archive.ensembl.org'), from=SymbolIdentifier(annotation='biomaRt', host='jan2020.archive.ensembl.org'))


... for converting Ensembl.99 gene.symbols (Jan-2020) to Ensembl.77 Ensembl.gene.IDs (Oct-2014) ...

Looking up the help on 'GeneIdentifierType', it gives the option to specify the Bioconductor package from which the annotations are drawn, but I can't quite figure out whether or how I can specify a particular database release. E.g with the {biomaRt} package, I can access all Ensembl archive databases and specify with the 'host=' parameter which version to use for converting my annotations.

1
Entering edit mode

You cannot use 'biomaRt' as an annotation package because it is not an annotation package, is a software package that provides an API for accessing biomaRt programmatically online.

If what you need is to convert a 'GeneSetCollection' object based on symbols to a Ensembl identifiers from a specific release, then it should be possible to do that using the appropriate Ensembl annotation package. So, theoretically, I would have expected that this would work:

mapIdentifiers(keggsym, ENSEMBLIdentifier("EnsDb.Hsapiens.v75"))
Error in eval(parse(text = pkg)) :
traceback()
17: eval(parse(text = pkg))
16: eval(parse(text = pkg))
15: eval(parse(text = pkg))
14: getAnnMap(toupper(geneIdType(from)), annotation(from))
13: .mapIdentifiers_selectMaps(from, to)
12: .mapIdentifiers_map(ids, type[[1]], type[[2]], verbose)
11: mapIdentifiers(what, to, from = geneIdType(what), ..., verbose = verbose)
10: mapIdentifiers(what, to, from = geneIdType(what), ..., verbose = verbose)
9: eval(call, parent.frame())
8: eval(call, parent.frame())
7: callGeneric(what, to, from = geneIdType(what), ..., verbose = verbose)
6: FUN(X[[i]], ...)
5: FUN(X[[i]], ...)
4: lapply(what, mapIdentifiers, to, ..., verbose = verbose)
3: GeneSetCollection(lapply(what, mapIdentifiers, to, ..., verbose = verbose))
2: mapIdentifiers(keggsym, ENSEMBLIdentifier("EnsDb.Hsapiens.v75"))
1: mapIdentifiers(keggsym, ENSEMBLIdentifier("EnsDb.Hsapiens.v75"))


unfortunately, as you can see it's not working and the problem seems to be upstream with the function 'getAnnMap()' from the package 'Annotate'. The BioC core team is maintaining 'GSEABase' and 'annotate', so hopefully they can tell you whether there is some way to use an Ensembl annotation package to do the mapping to a specific Ensembl release with the function 'mapIdentifiers()'.

that said, if your ultimate goal is to map identifiers between different Ensembl releases, then I'm not sure this is the right route to follow and I would advice you to post a question with that specific goal made clear.

1
Entering edit mode

Ok, so ?getAnnMap states that the supported database types are "db" and "env". Combined with the error message 'object 'EnsDb.Hsapiens.v75.db' not found', I had a hunch that it was looking for the .db extension. Thus, I just tried renaming the object to contain the .db extension:

EnsDb.Hsapiens.v75.db<-EnsDb.Hsapiens.v75
mapIdentifiers(keggsym, ENSEMBLIdentifier("EnsDb.Hsapiens.v75.db"))
Error in .select(x = x, keys = keys, columns = columns, keytype = keytype,  :
Argument keytype is mandatory if keys is a character vector!


That got me one step further, but now running into the next error, which has been described before here: https://www.biostars.org/p/300662/ ... 2 years ago ... but unfortunately no replies and no solution!

2
Entering edit mode

Interesting workaround. The error with the keytype comes from ensembldb (keytype had no default value). I fixed that by using as default the Ensembl gene ID keytype for this parameter in the select,EnsDb method:

> library(GSEABase)
> keggsym <- getGmt("c2.cp.kegg.v7.0.symbols.gmt",  geneIdType=SymbolIdentifier())
> library(EnsDb.Hsapiens.v86)
> EnsDb.Hsapiens.v86.db <- EnsDb.Hsapiens.v86
> mapIdentifiers(keggsym, ENSEMBLIdentifier("EnsDb.Hsapiens.v86"))
GeneSetCollection
names: KEGG_GLYCOLYSIS_GLUCONEOGENESIS, KEGG_CITRATE_CYCLE_TCA_CYCLE, ..., KEGG_VIRAL_MYOCARDITIS (186 total)
unique identifiers: ENSG00000131069, ENSG00000106633, ..., ENSG00000114867 (6524 total)
types in collection:
geneIdType: ENSEMBLIdentifier (1 total)
collectionType: NullCollection (1 total)


This is fixed in the current release version (for Bioconductor 3.11) as well as in the developmental version and will be available as an update via BiocManager::install() in the coming days. Alternatively, you could install it already now with devtools::install_github("jorainer/ensembldb", ref = "RELEASE_3_11").

cheers, jo

1
Entering edit mode

Thank you so much for fixing this issue! I have upgraded to R.4.0 and Bioconductor 3.11, updated all required packages and got the fixed version of {ensembldb} from your github ... and it's now working! Now I need to dive a bit deeper into {ensembldb} to generate annotation databases for the specific Ensembl releases I need.

That still leaves me with the challenge of how to convert between two database releases, but since the Ensembl.IDs are much more 'stable' over time than gene.symbols, I may just get away with converting all my gene.symbol based GeneSetCollections to Ensembl.IDs based on their respective Ensembl release(s) and then run my GSEA analyses on those.

0
Entering edit mode

Actually, I am building EnsDb databases for all species and every Ensembl release. These are available through AnnotationHub (if you query AnnotationHub for "EnsDb" or "ensembldb" they should get all listed; have a look at the ensembldb package vignette for more info). The only issue with that is that they are not distributed as packages but the actual SQLite database is provided. What you would need to do is then:

• retrieve an EnsDb of your interest from AnnotationHub.
• locate the downloaded file and rename that into something like EnsDb.Hsapiens.v98.sqlite (in that format, v98 stands for the Ensembl release).
• use the ensembldb::makeEnsembldbPackage function to create a package.
0
Entering edit mode

Thank you, that's fantastic! I have come across AnnotationHub before and it currently has 14 Homo sapiens EnsDb releases - from Ensembl.87 to 99 and 79. The main ones I need are 97, 88 and 77 (ancient, I know!). Thus, I'll definitely go ahead with your instructions on making packages for those databases available in AnnotationHub and I'll try my luck with making a database for 77. Thanks again so much for your help, it's greatly appreciated!

0
Entering edit mode

I'm afraid there's no immediate solution at the moment, i've opened an issue at the GSEABase GitHub repo and hopefully there'll be a fix sometime soon. I'll let you know in this thread when the issue is resolved.

0
Entering edit mode

Thank you so much for taking the time to look into this and for the additional pointers (e.g. the Ensembl annotation packages), greatly appreciated! I'll keep working on a solution and maybe I'll get some more suggestions from the community, too. My ultimate goal is to find a function or process that allows to ('easily') convert GeneSetCollections between different gene identifiers as well as between different database releases. The MSigDB GeneSetCollection is only one of many I am working with and currently, they are all based on different annotations in terms of identifiers and releases. Thus, I would like to unify them all and bring them in-line with the annotation used in my data sets, so I won't have genes excluded from the GSEA analyses just because of a clash in the annotation, which in turn falsifies the results.

0
Entering edit mode

You're welcome, in the end Johannes fixed it really quickly and as you've seen the issue is resolved and closed. There's still the issue about renaming the Ensembl package and i guess this will get also fixed at some point. Regarding what you comment about your ultimate goal, a few months ago we discussed about this in this thread, where one option is to use gene symbols as "bridge" to switch between annotations, although my take on this is to stick as much as possible to the annotations used to generate the expression profiles and attempt to do the mapping at gene set level since there you don't have to worry about 1-to-many or many-to-1 mappings, which i think this is what you're saying you want to do.