I'm a little rusty on this, so there might be a better way. To get some data to share, I ran (some of) the goSlim()
example
> example(goSlim)
goSlim> myIds <- c("GO:0016564", "GO:0003677", "GO:0004345", "GO:0008265",
goSlim+ "GO:0003841", "GO:0030151", "GO:0006355", "GO:0009664",
goSlim+ "GO:0006412", "GO:0015979", "GO:0006457", "GO:0005618",
goSlim+ "GO:0005622", "GO:0005840", "GO:0015935", "GO:0000311")
goSlim> myCollection <- GOCollection(myIds)
goSlim> fl <- system.file("extdata", "goslim_plant.obo", package="GSEABase")
goSlim> slim <- getOBOCollection(fl)
goSlim> goSlim(myCollection, slim, "MF")
...
and captured the output of the last line as
slimdf <- goSlim(myCollection, slim, "MF")
In the example, it looks like we're looking at the 'MF' (molecular function) ontology. We need to know the 'offspring' of each term in the ontology, and this is given by the data in
GO.db::GOMFOFFSPRING
I subset this object to contain just the identifiers in slimdf
, and coerce the resulting object to a list
gomap <- as.list(GOMFOFFSPRING[rownames(slimdf)])
This is a named list, with each element corresponding to a GO terms that are 'offspring' of the name of the list element
> str(gomap)
List of 24
$ GO:0000166: chr [1:56] "GO:0000062" "GO:0000774" "GO:0002134" "GO:0002135" ...
$ GO:0003674: chr [1:11147] "GO:0044183" "GO:0051082" "GO:0000006" "GO:0000007" ...
$ GO:0003676: chr [1:333] "GO:0000049" "GO:0000182" "GO:0000217" "GO:0000332" ...
...
For each element of the list, I want to find the intersection of the offspring with the identifiers in myCollection
> mapped <- lapply(gomap, intersect, ids(myCollection))
> str(mapped)
List of 24
$ GO:0000166: chr(0)
$ GO:0003674: chr [1:5] "GO:0003677" "GO:0003841" "GO:0004345" "GO:0008265" ...
$ GO:0003676: chr "GO:0003677"
$ GO:0003677: chr(0)
which is the information requested; this could be added to slimdf
slimdf$ids <- lapply(gomap, intersect, ids(myCollection))
A helper function might be written as
mappedIds <-
function(df, collection, OFFSPRING)
{
map <- as.list(OFFSPRING[rownames(df)])
mapped <- lapply(map, intersect, ids(collection))
df[["mapped"]] <- unname(mapped)
df
}
and used as
mappedIds(slimdf, myCollection, GOMFOFFSPRING)
Whoa! Thanks so much for the help and quick response!
I've updated the function to convert the "mapped" column output to:
Convert from a list to character. This makes the dataframe writeable to csv.
Set a
;
delimiter for the GO terms column to simplify downstream parsing.Update the column name (just a personal preference).
Looks good! A highly technical and irrelevant comment is that
OFFSPRING
is an environment serving as a hash, and in principle a hash is unordered -- the keys can be arranged arbitrarily, unlike, e.g., a list. So it would be more correct (but less efficient, but correctness is important) to coerce the environment to a list and then subsetas.list(OFFSPRING)[rownames(df)]
.What approach would one take to map back to gene IDs?
e.g. input is something like (input layout isn't really important, this is just an example of GO terms mapped to multiple gene IDs):
Is there a way to load this type of info into a geneset in GSEABase and map to GOslims while retaining the gene ID info associated with each GO term that maps to a GOslim?
Those don't look like standard gene ids, but maybe trinity de novo transcript identifiers? That probably deserves a new question -- 'how to map from ...?'
Probably, the solution involves the
GeneSetCollection
class and constructor. This code is adapted from the example of its help page:If you can use standard gene identifiers such as Entrez, then you would add the argument shown in the help page
geneIdType=EntrezIdentifier()
to the call toGeneSet
. If these are novel transcripts assembled with trinity, as your sample input suggests, then you are not in that situation.