I'm a little rusty on this, so there might be a better way. To get some data to share, I ran (some of) the
goSlim> myIds <- c("GO:0016564", "GO:0003677", "GO:0004345", "GO:0008265",
goSlim+ "GO:0003841", "GO:0030151", "GO:0006355", "GO:0009664",
goSlim+ "GO:0006412", "GO:0015979", "GO:0006457", "GO:0005618",
goSlim+ "GO:0005622", "GO:0005840", "GO:0015935", "GO:0000311")
goSlim> myCollection <- GOCollection(myIds)
goSlim> fl <- system.file("extdata", "goslim_plant.obo", package="GSEABase")
goSlim> slim <- getOBOCollection(fl)
goSlim> goSlim(myCollection, slim, "MF")
and captured the output of the last line as
slimdf <- goSlim(myCollection, slim, "MF")
In the example, it looks like we're looking at the 'MF' (molecular function) ontology. We need to know the 'offspring' of each term in the ontology, and this is given by the data in
I subset this object to contain just the identifiers in
slimdf, and coerce the resulting object to a list
gomap <- as.list(GOMFOFFSPRING[rownames(slimdf)])
This is a named list, with each element corresponding to a GO terms that are 'offspring' of the name of the list element
List of 24
$ GO:0000166: chr [1:56] "GO:0000062" "GO:0000774" "GO:0002134" "GO:0002135" ...
$ GO:0003674: chr [1:11147] "GO:0044183" "GO:0051082" "GO:0000006" "GO:0000007" ...
$ GO:0003676: chr [1:333] "GO:0000049" "GO:0000182" "GO:0000217" "GO:0000332" ...
For each element of the list, I want to find the intersection of the offspring with the identifiers in
> mapped <- lapply(gomap, intersect, ids(myCollection))
List of 24
$ GO:0000166: chr(0)
$ GO:0003674: chr [1:5] "GO:0003677" "GO:0003841" "GO:0004345" "GO:0008265" ...
$ GO:0003676: chr "GO:0003677"
$ GO:0003677: chr(0)
which is the information requested; this could be added to
slimdf$ids <- lapply(gomap, intersect, ids(myCollection))
A helper function might be written as
function(df, collection, OFFSPRING)
map <- as.list(OFFSPRING[rownames(df)])
mapped <- lapply(map, intersect, ids(collection))
df[["mapped"]] <- unname(mapped)
and used as
mappedIds(slimdf, myCollection, GOMFOFFSPRING)
Whoa! Thanks so much for the help and quick response!
I've updated the function to convert the "mapped" column output to:
Convert from a list to character. This makes the dataframe writeable to csv.
;delimiter for the GO terms column to simplify downstream parsing.
Update the column name (just a personal preference).
Looks good! A highly technical and irrelevant comment is that
OFFSPRINGis an environment serving as a hash, and in principle a hash is unordered -- the keys can be arranged arbitrarily, unlike, e.g., a list. So it would be more correct (but less efficient, but correctness is important) to coerce the environment to a list and then subset
What approach would one take to map back to gene IDs?
e.g. input is something like (input layout isn't really important, this is just an example of GO terms mapped to multiple gene IDs):
Is there a way to load this type of info into a geneset in GSEABase and map to GOslims while retaining the gene ID info associated with each GO term that maps to a GOslim?
Those don't look like standard gene ids, but maybe trinity de novo transcript identifiers? That probably deserves a new question -- 'how to map from ...?'
Probably, the solution involves the
GeneSetCollectionclass and constructor. This code is adapted from the example of its help page:
If you can use standard gene identifiers such as Entrez, then you would add the argument shown in the help page
geneIdType=EntrezIdentifier()to the call to
GeneSet. If these are novel transcripts assembled with trinity, as your sample input suggests, then you are not in that situation.
Doing a bit of a deeper dive into this recently and noticed a potential issue that I'm not certain how to address.
Using the example provided by Martin Morgan (GSEABase - View/extract GO terms mapped to each GOslim), I recently noticed that the "mapping" step misses GO IDs that were part of the initial vector of GO IDs (
Here's evidence of that. Please note if you want to reproduce, I'm using Biological Process (
BP) for my analysis - the solution/example provided above used Molecular Function (
Does not map
GO:0006355to itself, as evidenced here:
Seeing how the code works using
intersect, it's obvious why this is missed.
However, I'm not sure how to modify things so that if a GO ID in the
myIDsvector matches a GOslim ID, the GO ID will get also get "mapped" to itself. It's possible that the solution is more strictly an understanding of how to use R, rather than a specific GSEAbase concern. Maybe there's a way to instruct the
intersect()function to include the parent GO term and not just the offspring?
Alrighty, figured this out! Here's the full code, including the initial mapping step described in the accepted answer above, and subsequent steps to get GO slims to map to themselves.
The resulting output file looks like this: