Question

GSEABase - Map gene/transcript IDs to GOslims?

0

Entering edit mode

samwhite ▴ 20

@samwhite-22943

Last seen 12 months ago

United States

I recently posted a question regarding identifying GO terms that were mapped to GOslims and received a great answer on how to do this.

However, in the grand scheme of things, I really want to find out which genes/transcripts are mapped to a set of GOslims. Since GSEABase is able to map GO terms to GOslims, it seems like identifying corresponding gene/transcript IDs should be feasible. Unfortunately, I have no idea how to proceed.

If I have an input file with GO terms and corresponding gene/transcript IDs (note: actual format/layout is unimportant - I'm confident I can change things around to match needed input format for use in GSEABase):

GO:0016874  TRINITY_DN8634_c0_g1
GO:0016874  TRINITY_DN9386_c0_g1
GO:0019693  TRINITY_DN10297_c1_g1
GO:0019693  TRINITY_DN12835_c0_g1
GO:0019693  TRINITY_DN1421_c0_g1

I know how to map the GO terms to GOslims, as well as identify which GO terms map to GOslims.

Now, I am hoping there's a "simple" solution built-in to GSEABase that will allow me to subset gene/transcript IDs from any given GOslim.

Would anyone have any suggestions as to how to approach this within GSEABase? It seems like creating a GeneSetCollection (or, a GOCollection?) would be the first approach, but I'm not familiar enough with R and GSEABase to figure out where to go from there. Heck, I might not even be able to figure out how to properly construct a GeneSetCollection properly with the information provided above...

I can probably wrangle some command line stuff to accomplish this task, but it would be nicer (and probably easier?) to just use GSEABase to accomplish this.

Oh, one last thing to add, all of our work is with non-model organisms with limited/non-existent genomic resources. So, I do have UniProt IDs associated with these gene/transcript IDs, but won't be able to tap into the standard ENTREZ/PMID/etc databases.

gseabase go gene ontology goslim • 1.5k views

ADD COMMENT • link 4.2 years ago • updated 13 months ago samwhite ▴ 20

score 0 · Answer 1 · 2020-02-19

0

Entering edit mode

Robert Castelo ★ 3.2k

@rcastelo

Last seen 4 days ago

Barcelona/Universitat Pompeu Fabra

hi,

one possible way to follow using the GSEABase infrastructure is to build a GeneSetCollection object with the GO annotations. Let's say we have the GO annotations you show as an example in a text file called goannot.txt, where the first columns has the GO identifier and the second column, separated by blank spaces from the first, has the gene identifier. We can build this GeneSetCollection object of these annotations as follows:

library(GSEABase)
library(GO.db)

## read GO annotations from a text file with one row per annotation
goannot <- read.table("goannot.txt", sep="", quote="", colClasses="character")

## organize GO annotations by GO term
genesbygo <- split(goannot[, 2], goannot[, 1])
genesbygo
$`GO:0016874`
[1] "TRINITY_DN8634_c0_g1" "TRINITY_DN9386_c0_g1"

$`GO:0019693`
[1] "TRINITY_DN10297_c1_g1" "TRINITY_DN12835_c0_g1" "TRINITY_DN1421_c0_g1" 

## fetch GO terms into a named vector
goterms <- select(GO.db, keys=names(genesbygo), columns="TERM")
goterms <- setNames(goterms$TERM, goterms$GOID)

## build GeneSetCollection object
gsc <- do.call("GeneSetCollection",
               mapply(function(geneSetID, geneIDs, goterms) {
                        GeneSet(NullIdentifier(),
                                geneIds=geneIDs,
                                setName=geneSetID,
                                shortDescription=goterms[geneSetID])
                     }, names(genesbygo), genesbygo,
                     MoreArgs=list(goterms=goterms), USE.NAMES=FALSE))
gsc
GeneSetCollection
  names: GO:0016874, GO:0019693 (2 total)
  unique identifiers: TRINITY_DN8634_c0_g1, TRINITY_DN9386_c0_g1, ..., TRINITY_DN1421_c0_g1 (5 total)
  types in collection:
    geneIdType: NullIdentifier (1 total)
    collectionType: NullCollection (1 total)

If you have UnitProt IDs in the GO annotations, then you could replace the constructor NullIdentifier() by UniprotIdentifier(), which will help to map with other identifiers, whenever those mappings exist. In principle, you should be able to use the solution given in your previous question to do the mapping you're looking for.

cheers,

robert.

ADD COMMENT • link 4.2 years ago Robert Castelo ★ 3.2k

0

Entering edit mode

Thanks so much for this! This is very useful.

Unfortunately, I still need some hand-holding to figure out how to work with a GeneSetCollection and pass the necessary info to the goSlim function (and, in turn, get the GO terms and/or gene/transcript IDs pulled out of the GOslim mappings).

ADD REPLY • link 4.2 years ago samwhite ▴ 20

3

Entering edit mode

If I'm interpreting all this correctly, you could modify your mappedIds() function from the other post as follows:

mappedIds <- function(df, collection, OFFSPRING, goannotGSC) {
    map <- as.list(OFFSPRING[rownames(df)])
    mapped <- lapply(map, intersect, ids(collection))
    df[["go_terms"]] <- vapply(unname(mapped), paste, collapse = ";", character(1L))
    mt <- match(rownames(df), names(goannotGSC))
    df[["Genes"]] <- rep(NA_character_, nrow(df))
    df[["Genes"]][!is.na(mt)] <- vapply(geneIds(goannotGSC)[mt[!is.na(mt)]], 
paste, collapse=";", character(1L))
    df
}

Modifying the example of GO annotations you gave above, as follows:

GO:0000166  TRINITY_DN8634_c0_g1
GO:0000166  TRINITY_DN9386_c0_g1
GO:0003674  TRINITY_DN10297_c1_g1
GO:0003674  TRINITY_DN12835_c0_g1
GO:0003674  TRINITY_DN1421_c0_g1

We can follow up the example you started in your previous post, as follows:

mids <- mappedIds(slimdf, myCollection, GOMFOFFSPRING, gsc)
> head(mids, 3)
           Count   Percent                 Term
GO:0000166     0  0.000000   nucleotide binding
GO:0003674     5 35.714286   molecular_function
GO:0003676     1  7.142857 nucleic acid binding
                                                         go_terms
GO:0000166                                                       
GO:0003674 GO:0003677;GO:0003841;GO:0004345;GO:0008265;GO:0030151
GO:0003676                                             GO:0003677
                                                                      Genes
GO:0000166                        TRINITY_DN8634_c0_g1;TRINITY_DN9386_c0_g1
GO:0003674 TRINITY_DN10297_c1_g1;TRINITY_DN12835_c0_g1;TRINITY_DN1421_c0_g1
GO:0003676                                                             <NA>

Note that i'm using the gsc object with the annotations built in the way I described in my answer from the updated GO annotations in this comment.

ADD REPLY • link 4.2 years ago Robert Castelo ★ 3.2k

0

Entering edit mode

I have no idea how I missed this response, but it's amazing! Thank you so much!!

ADD REPLY • link 13 months ago samwhite ▴ 20