I'm a little rusty on this, so there might be a better way. To get some data to share, I ran (some of) the goSlim()
example
> example(goSlim)
goSlim> myIds <- c("GO:0016564", "GO:0003677", "GO:0004345", "GO:0008265",
goSlim+ "GO:0003841", "GO:0030151", "GO:0006355", "GO:0009664",
goSlim+ "GO:0006412", "GO:0015979", "GO:0006457", "GO:0005618",
goSlim+ "GO:0005622", "GO:0005840", "GO:0015935", "GO:0000311")
goSlim> myCollection <- GOCollection(myIds)
goSlim> fl <- system.file("extdata", "goslim_plant.obo", package="GSEABase")
goSlim> slim <- getOBOCollection(fl)
goSlim> goSlim(myCollection, slim, "MF")
...
and captured the output of the last line as
slimdf <- goSlim(myCollection, slim, "MF")
In the example, it looks like we're looking at the 'MF' (molecular function) ontology. We need to know the 'offspring' of each term in the ontology, and this is given by the data in
GO.db::GOMFOFFSPRING
I subset this object to contain just the identifiers in slimdf
, and coerce the resulting object to a list
gomap <- as.list(GOMFOFFSPRING[rownames(slimdf)])
This is a named list, with each element corresponding to a GO terms that are 'offspring' of the name of the list element
> str(gomap)
List of 24
$ GO:0000166: chr [1:56] "GO:0000062" "GO:0000774" "GO:0002134" "GO:0002135" ...
$ GO:0003674: chr [1:11147] "GO:0044183" "GO:0051082" "GO:0000006" "GO:0000007" ...
$ GO:0003676: chr [1:333] "GO:0000049" "GO:0000182" "GO:0000217" "GO:0000332" ...
...
For each element of the list, I want to find the intersection of the offspring with the identifiers in myCollection
> mapped <- lapply(gomap, intersect, ids(myCollection))
> str(mapped)
List of 24
$ GO:0000166: chr(0)
$ GO:0003674: chr [1:5] "GO:0003677" "GO:0003841" "GO:0004345" "GO:0008265" ...
$ GO:0003676: chr "GO:0003677"
$ GO:0003677: chr(0)
which is the information requested; this could be added to slimdf
slimdf$ids <- lapply(gomap, intersect, ids(myCollection))
A helper function might be written as
mappedIds <-
function(df, collection, OFFSPRING)
{
map <- as.list(OFFSPRING[rownames(df)])
mapped <- lapply(map, intersect, ids(collection))
df[["mapped"]] <- unname(mapped)
df
}
and used as
mappedIds(slimdf, myCollection, GOMFOFFSPRING)
Whoa! Thanks so much for the help and quick response!
I've updated the function to convert the "mapped" column output to:
Convert from a list to character. This makes the dataframe writeable to csv.
Set a
;
delimiter for the GO terms column to simplify downstream parsing.Update the column name (just a personal preference).
Looks good! A highly technical and irrelevant comment is that
OFFSPRING
is an environment serving as a hash, and in principle a hash is unordered -- the keys can be arranged arbitrarily, unlike, e.g., a list. So it would be more correct (but less efficient, but correctness is important) to coerce the environment to a list and then subsetas.list(OFFSPRING)[rownames(df)]
.What approach would one take to map back to gene IDs?
e.g. input is something like (input layout isn't really important, this is just an example of GO terms mapped to multiple gene IDs):
Is there a way to load this type of info into a geneset in GSEABase and map to GOslims while retaining the gene ID info associated with each GO term that maps to a GOslim?
Those don't look like standard gene ids, but maybe trinity de novo transcript identifiers? That probably deserves a new question -- 'how to map from ...?'
Probably, the solution involves the
GeneSetCollection
class and constructor. This code is adapted from the example of its help page:If you can use standard gene identifiers such as Entrez, then you would add the argument shown in the help page
geneIdType=EntrezIdentifier()
to the call toGeneSet
. If these are novel transcripts assembled with trinity, as your sample input suggests, then you are not in that situation.Doing a bit of a deeper dive into this recently and noticed a potential issue that I'm not certain how to address.
Using the example provided by Martin Morgan (GSEABase - View/extract GO terms mapped to each GOslim), I recently noticed that the "mapping" step misses GO IDs that were part of the initial vector of GO IDs (
myIDs
).Here's evidence of that. Please note if you want to reproduce, I'm using Biological Process (
BP
) for my analysis - the solution/example provided above used Molecular Function (MF
).This step
Does not map
GO:0006355
to itself, as evidenced here:Seeing how the code works using
intersect
, it's obvious why this is missed.However, I'm not sure how to modify things so that if a GO ID in the
myIDs
vector matches a GOslim ID, the GO ID will get also get "mapped" to itself. It's possible that the solution is more strictly an understanding of how to use R, rather than a specific GSEAbase concern. Maybe there's a way to instruct theintersect()
function to include the parent GO term and not just the offspring?Alrighty, figured this out! Here's the full code, including the initial mapping step described in the accepted answer above, and subsequent steps to get GO slims to map to themselves.
The resulting output file looks like this: