Question

GSEABase - View/extract GO terms mapped to each GOslim

1

Entering edit mode

samwhite ▴ 20

@samwhite-22943

Last seen 13 months ago

United States

GSEABase has the ability to take a list of GO terms and map them to a set of GOslims. The function goSlim() function outputs a nice summary table with counts and percentage of the input GO terms that mapped to each GOslim.

However, I'd like to be able to find out which GO terms have been assigned to each of the GOslims.

The bigger pictures is that I'd really like to see which genes in a gene set are getting assigned to any given GOslim.

Is there a way to get this information?

go GSEABase GOslim gene ontology • 3.0k views

ADD COMMENT • link 4.2 years ago • updated 13 months ago samwhite ▴ 20

score 3 · Accepted Answer · 2020-02-18

3

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 2 days ago

United States

I'm a little rusty on this, so there might be a better way. To get some data to share, I ran (some of) the goSlim() example

> example(goSlim)

goSlim> myIds <- c("GO:0016564", "GO:0003677", "GO:0004345", "GO:0008265",
goSlim+            "GO:0003841", "GO:0030151", "GO:0006355", "GO:0009664",
goSlim+            "GO:0006412", "GO:0015979", "GO:0006457", "GO:0005618",
goSlim+            "GO:0005622", "GO:0005840", "GO:0015935", "GO:0000311")
goSlim> myCollection <- GOCollection(myIds)
goSlim> fl <- system.file("extdata", "goslim_plant.obo", package="GSEABase")
goSlim> slim <- getOBOCollection(fl)
goSlim> goSlim(myCollection, slim, "MF")
...

and captured the output of the last line as

slimdf <- goSlim(myCollection, slim, "MF")

In the example, it looks like we're looking at the 'MF' (molecular function) ontology. We need to know the 'offspring' of each term in the ontology, and this is given by the data in

GO.db::GOMFOFFSPRING

I subset this object to contain just the identifiers in slimdf, and coerce the resulting object to a list

gomap <- as.list(GOMFOFFSPRING[rownames(slimdf)])

This is a named list, with each element corresponding to a GO terms that are 'offspring' of the name of the list element

> str(gomap)
List of 24
 $ GO:0000166: chr [1:56] "GO:0000062" "GO:0000774" "GO:0002134" "GO:0002135" ...
 $ GO:0003674: chr [1:11147] "GO:0044183" "GO:0051082" "GO:0000006" "GO:0000007" ...
 $ GO:0003676: chr [1:333] "GO:0000049" "GO:0000182" "GO:0000217" "GO:0000332" ...
...

For each element of the list, I want to find the intersection of the offspring with the identifiers in myCollection

> mapped <- lapply(gomap, intersect, ids(myCollection))
> str(mapped)
List of 24
 $ GO:0000166: chr(0) 
 $ GO:0003674: chr [1:5] "GO:0003677" "GO:0003841" "GO:0004345" "GO:0008265" ...
 $ GO:0003676: chr "GO:0003677"
 $ GO:0003677: chr(0)

which is the information requested; this could be added to slimdf

slimdf$ids <- lapply(gomap, intersect, ids(myCollection))

A helper function might be written as

mappedIds <-
    function(df, collection, OFFSPRING)
{
    map <- as.list(OFFSPRING[rownames(df)])
    mapped <- lapply(map, intersect, ids(collection))
    df[["mapped"]] <- unname(mapped)
    df
}

and used as

mappedIds(slimdf, myCollection, GOMFOFFSPRING)

ADD COMMENT • link 4.2 years ago Martin Morgan 25k

0

Entering edit mode

Whoa! Thanks so much for the help and quick response!

ADD REPLY • link 4.2 years ago samwhite ▴ 20

1

Entering edit mode

I've updated the function to convert the "mapped" column output to:

Convert from a list to character. This makes the dataframe writeable to csv.
Set a ; delimiter for the GO terms column to simplify downstream parsing.
Update the column name (just a personal preference).

mappedIds <-
  function(df, collection, OFFSPRING)
  {
    map <- as.list(OFFSPRING[rownames(df)])
    mapped <- lapply(map, intersect, ids(collection))
    df[["go_terms"]] <- vapply(unname(mapped), paste, collapse = ";", character(1L))
    df
  }

ADD REPLY • link 4.2 years ago samwhite ▴ 20

0

Entering edit mode

Looks good! A highly technical and irrelevant comment is that OFFSPRING is an environment serving as a hash, and in principle a hash is unordered -- the keys can be arranged arbitrarily, unlike, e.g., a list. So it would be more correct (but less efficient, but correctness is important) to coerce the environment to a list and then subset as.list(OFFSPRING)[rownames(df)].

ADD REPLY • link 4.2 years ago Martin Morgan 25k

0

Entering edit mode

What approach would one take to map back to gene IDs?

e.g. input is something like (input layout isn't really important, this is just an example of GO terms mapped to multiple gene IDs):

GO:term1,TRINITY_g1,TRINITY_g2
GO:term2,TRINITY_g2,TRINITY_g5

Is there a way to load this type of info into a geneset in GSEABase and map to GOslims while retaining the gene ID info associated with each GO term that maps to a GOslim?

ADD REPLY • link 4.2 years ago samwhite ▴ 20

1

Entering edit mode

Those don't look like standard gene ids, but maybe trinity de novo transcript identifiers? That probably deserves a new question -- 'how to map from ...?'

ADD REPLY • link 4.2 years ago Martin Morgan 25k

1

Entering edit mode

Probably, the solution involves the GeneSetCollection class and constructor. This code is adapted from the example of its help page:

lst <- list(GOterm1=c("TRINITY_g1", "TRINITY_g2"),
            GOterm2=c("TRINITY_g2", "TRINITY_g5"))
gsc <- GeneSetCollection(mapply(function(geneIds, GOID) {
         GeneSet(geneIds, collectionType=GOCollection(GOID), setName=GOID)
       }, lst, names(lst)))
gsc
GeneSetCollection
  names: GOterm1, GOterm2 (2 total)
  unique identifiers: TRINITY_g1, TRINITY_g2, TRINITY_g5 (3 total)
  types in collection:
    geneIdType: NullIdentifier (1 total)
    collectionType: GOCollection (1 total)

If you can use standard gene identifiers such as Entrez, then you would add the argument shown in the help page geneIdType=EntrezIdentifier() to the call to GeneSet. If these are novel transcripts assembled with trinity, as your sample input suggests, then you are not in that situation.

ADD REPLY • link 4.2 years ago Robert Castelo ★ 3.3k

0

Entering edit mode

Doing a bit of a deeper dive into this recently and noticed a potential issue that I'm not certain how to address.

Using the example provided by Martin Morgan (GSEABase - View/extract GO terms mapped to each GOslim), I recently noticed that the "mapping" step misses GO IDs that were part of the initial vector of GO IDs (myIDs).

Here's evidence of that. Please note if you want to reproduce, I'm using Biological Process (BP) for my analysis - the solution/example provided above used Molecular Function (MF).

This step

mapped <- lapply(gomap, intersect, ids(myCollection))

Does not map GO:0006355 to itself, as evidenced here:

View of "mapped" list contents with red arrow showing that GO term "GO:0006355" doesn't map to itself.

Seeing how the code works using intersect, it's obvious why this is missed.

However, I'm not sure how to modify things so that if a GO ID in the myIDs vector matches a GOslim ID, the GO ID will get also get "mapped" to itself. It's possible that the solution is more strictly an understanding of how to use R, rather than a specific GSEAbase concern. Maybe there's a way to instruct the intersect() function to include the parent GO term and not just the offspring?

ADD REPLY • link 13 months ago samwhite ▴ 20

0

Entering edit mode

Alrighty, figured this out! Here's the full code, including the initial mapping step described in the accepted answer above, and subsequent steps to get GO slims to map to themselves.

# Vector of GO IDs
go_ids <- pgen_GO_gene_mappings_table_cleaned_separated_grouped_df$all_GO_ids

# Create GSEAbase GOCollection using `go_ids`
myCollection <- GOCollection(go_ids)

# Retrieve GOslims from GO OBO file set
slim <- getOBOCollection(gseabase_files)

# Retrieve Biological Process (BP) GOslims
slimdf <- goSlim(myCollection, slim, "BP", verbose)

# List of GOslims and all GO IDs from `go_ids`
gomap <- as.list(GOBPOFFSPRING[rownames(slimdf)])

# Maps `go_ids` to matching GOslims
mapped <- lapply(gomap, intersect, ids(myCollection))

# Append all mapped GO IDs to `slimdf`
# `sapply` needed to apply paste() to create semi-colon delimited values
slimdf$ids <- sapply(lapply(gomap, intersect, ids(myCollection)), paste, collapse=";")

# Remove "character(0) string from "ids" column
slimdf$ids[slimdf$ids == "character(0)"] <- ""

# Add self-matching GOIDs to "ids" column, if not present
for (go_id in go_ids) {
  # Check if the go_id is present in the row names
  if (go_id %in% rownames(updated_slimdf)) {
    # Check if the go_id is not present in the ids column
    # Also removes white space "trimws()" and converts all to upper case to handle
    # any weird, "invisible" formatting issues.
    if (!go_id %in% trimws(toupper(strsplit(updated_slimdf[go_id, "ids"], ";")[[1]]))) {
      # Append the go_id to the ids column with a semi-colon separator
      if (length(updated_slimdf$ids) > 0 && nchar(updated_slimdf$ids[nrow(updated_slimdf)]) > 0) {
        updated_slimdf[go_id, "ids"] <- paste0(updated_slimdf[go_id, "ids"], "; ", go_id)
      } else {
        updated_slimdf[go_id, "ids"] <- go_id
      }
    }
  }
}

The resulting output file looks like this:

Screenshot showing formatted output from R code above: Rownames (goslims), Count, Percent, Term, semi-colon delimited GO Ids

ADD REPLY • link 13 months ago samwhite ▴ 20