GSEABase - View/extract GO terms mapped to each GOslim
1
1
Entering edit mode
samwhite ▴ 20
@samwhite-22943
Last seen 2.4 years ago

GSEABase has the ability to take a list of GO terms and map them to a set of GOslims. The function goSlim() function outputs a nice summary table with counts and percentage of the input GO terms that mapped to each GOslim.

However, I'd like to be able to find out which GO terms have been assigned to each of the GOslims.

The bigger pictures is that I'd really like to see which genes in a gene set are getting assigned to any given GOslim.

Is there a way to get this information?

go GSEABase GOslim gene ontology • 770 views
3
Entering edit mode
@martin-morgan-1513
Last seen 5 days ago
United States

I'm a little rusty on this, so there might be a better way. To get some data to share, I ran (some of) the goSlim() example

> example(goSlim)

goSlim> myIds <- c("GO:0016564", "GO:0003677", "GO:0004345", "GO:0008265",
goSlim+            "GO:0003841", "GO:0030151", "GO:0006355", "GO:0009664",
goSlim+            "GO:0006412", "GO:0015979", "GO:0006457", "GO:0005618",
goSlim+            "GO:0005622", "GO:0005840", "GO:0015935", "GO:0000311")
goSlim> myCollection <- GOCollection(myIds)
goSlim> fl <- system.file("extdata", "goslim_plant.obo", package="GSEABase")
goSlim> slim <- getOBOCollection(fl)
goSlim> goSlim(myCollection, slim, "MF")
...


and captured the output of the last line as

slimdf <- goSlim(myCollection, slim, "MF")


In the example, it looks like we're looking at the 'MF' (molecular function) ontology. We need to know the 'offspring' of each term in the ontology, and this is given by the data in

GO.db::GOMFOFFSPRING


I subset this object to contain just the identifiers in slimdf, and coerce the resulting object to a list

gomap <- as.list(GOMFOFFSPRING[rownames(slimdf)])


This is a named list, with each element corresponding to a GO terms that are 'offspring' of the name of the list element

> str(gomap)
List of 24
$GO:0000166: chr [1:56] "GO:0000062" "GO:0000774" "GO:0002134" "GO:0002135" ...$ GO:0003674: chr [1:11147] "GO:0044183" "GO:0051082" "GO:0000006" "GO:0000007" ...
$GO:0003676: chr [1:333] "GO:0000049" "GO:0000182" "GO:0000217" "GO:0000332" ... ...  For each element of the list, I want to find the intersection of the offspring with the identifiers in myCollection > mapped <- lapply(gomap, intersect, ids(myCollection)) > str(mapped) List of 24$ GO:0000166: chr(0)
$GO:0003674: chr [1:5] "GO:0003677" "GO:0003841" "GO:0004345" "GO:0008265" ...$ GO:0003676: chr "GO:0003677"
$GO:0003677: chr(0)  which is the information requested; this could be added to slimdf slimdf$ids <- lapply(gomap, intersect, ids(myCollection))


A helper function might be written as

mappedIds <-
function(df, collection, OFFSPRING)
{
map <- as.list(OFFSPRING[rownames(df)])
mapped <- lapply(map, intersect, ids(collection))
df[["mapped"]] <- unname(mapped)
df
}


and used as

mappedIds(slimdf, myCollection, GOMFOFFSPRING)

0
Entering edit mode

Whoa! Thanks so much for the help and quick response!

1
Entering edit mode

I've updated the function to convert the "mapped" column output to:

1. Convert from a list to character. This makes the dataframe writeable to csv.

2. Set a ; delimiter for the GO terms column to simplify downstream parsing.

3. Update the column name (just a personal preference).

mappedIds <-
function(df, collection, OFFSPRING)
{
map <- as.list(OFFSPRING[rownames(df)])
mapped <- lapply(map, intersect, ids(collection))
df[["go_terms"]] <- vapply(unname(mapped), paste, collapse = ";", character(1L))
df
}

0
Entering edit mode

Looks good! A highly technical and irrelevant comment is that OFFSPRING is an environment serving as a hash, and in principle a hash is unordered -- the keys can be arranged arbitrarily, unlike, e.g., a list. So it would be more correct (but less efficient, but correctness is important) to coerce the environment to a list and then subset as.list(OFFSPRING)[rownames(df)].

0
Entering edit mode

What approach would one take to map back to gene IDs?

e.g. input is something like (input layout isn't really important, this is just an example of GO terms mapped to multiple gene IDs):

GO:term1,TRINITY_g1,TRINITY_g2
GO:term2,TRINITY_g2,TRINITY_g5


Is there a way to load this type of info into a geneset in GSEABase and map to GOslims while retaining the gene ID info associated with each GO term that maps to a GOslim?

1
Entering edit mode

Those don't look like standard gene ids, but maybe trinity de novo transcript identifiers? That probably deserves a new question -- 'how to map from ...?'

1
Entering edit mode

Probably, the solution involves the GeneSetCollection class and constructor. This code is adapted from the example of its help page:

lst <- list(GOterm1=c("TRINITY_g1", "TRINITY_g2"),
GOterm2=c("TRINITY_g2", "TRINITY_g5"))
gsc <- GeneSetCollection(mapply(function(geneIds, GOID) {
GeneSet(geneIds, collectionType=GOCollection(GOID), setName=GOID)
}, lst, names(lst)))
gsc
GeneSetCollection
names: GOterm1, GOterm2 (2 total)
unique identifiers: TRINITY_g1, TRINITY_g2, TRINITY_g5 (3 total)
types in collection:
geneIdType: NullIdentifier (1 total)
collectionType: GOCollection (1 total)


If you can use standard gene identifiers such as Entrez, then you would add the argument shown in the help page geneIdType=EntrezIdentifier() to the call to GeneSet. If these are novel transcripts assembled with trinity, as your sample input suggests, then you are not in that situation.