Question

Extract the relevant information of the genes sets from the Bioconductor annotation package

0

Entering edit mode

jose.wo • 0

@josewo-23070

Last seen 2.5 years ago

Spain

Hello Bioconductor comunity.

I'm new with Bioconductor so probably my question is very basic. I have a list of gene names from a proteomic analysis we've run in the lab. I need to obtain the GO annotations of the genes (from Rattus norvergicus) I have in my list.

The rownames of my expression set object is:

> Gene names
head(rownames(AverageDataSet))
[1] "Emc2"   "GLTP"   "Nceh1"  "Erlin2" "Atp9b"  "Tep1"

I download Bioconductor annotations as follows:

gsets = AnnotationDbi::as.list(org.Rn.egGO2ALLEGS)
gsets[[1]]
      IEA      ISO      IEA      ISO      IEA      ISO      IBA      TAS
  "24842"  "24842"  "25591"  "25591"  "29414"  "29414"  "54304"  "83474"
      IEA      ISO      IBA      ISO      ISO      ISS      ISO      IEA
  "83516"  "83516"  "85472"  "85472" "171116" "171116" "291824" "294388"
      ISO      IBA      IBA      IEA      ISO      ISO      ISS      ISO
 "294388" "295237" "296200" "296200" "296200" "298203" "298203" "298933"
      IEA      ISO      IEA      ISO      IEA      ISO      IBA      IEA
 "299976" "299976" "303185" "303185" "303194" "303194" "303369" "303369"
      ISO      IBA      IEA      ISO      IEA      ISO      IEA      ISO
 "303369" "303612" "303612" "303612" "309441" "309441" "309957" "309957"
      ISO      IMP      IEA      ISO      IEA      ISO      IEA      ISO
 "312559" "312647" "315219" "315219" "360463" "360463" "360481" "360481"
      IBA      IEA      ISO      IEA      IEA      IEA      ISO      IEA
 "361147" "361147" "361147" "361836" "367645" "501039" "501039" "502988"
      ISO      IEA      ISO
 "502988" "691431" "691431"

Now, when I try to filter for the genes that I have in my list I get no matches.

> unlist(sapply(gsets, function(gset) intersect(rownames(AverageDataSet), unique(as.character(gset)))))
 character(0)

I'm truly lost here. I'm not even sure how data is stored in the gsets list nor even if the genes names are in that list for that matter. I'm sure I'm missing something very basic here but can't see what. Could anyone give me a hand with this issue??

Best regards, Jose.

org.Rn.eg.db annotation go • 833 views

ADD COMMENT • link updated 4.1 years ago by James W. MacDonald 65k • written 4.1 years ago by jose.wo • 0

score 1 · Answer 1 · 2020-03-10

First you should note that the rownames of your AverageDataSet are gene symbols, and the values in your 'gsets' object are Gene IDs. So there is no reason that you should think that there is any intersection between the two! I mean, your example shows that one is a set of things like 'Emc2` and the other has quoted numbers like "24842". So trying to get the intersection seems fruitless?

Also, converting things to a list like that is not really how you should do things. People do that, but it's really just a holdover from like the middle of the last decade. These days you should use either select or mapIds. In this situation I would use the latter

> z <- c("Emc2",   "GLTP",   "Nceh1",  "Erlin2", "Atp9b",  "Tep1")
> zz <- mapIds(org.Rn.eg.db, z, "GOALL", "SYMBOL", multiVals = "list")
'select()' returned 1:many mapping between keys and columns
> sapply(zz, length)
  Emc2   GLTP  Nceh1 Erlin2  Atp9b   Tep1 
   101      1    113    434    140    285 
> lapply(zz, head)
$Emc2
[1] "GO:0005575" "GO:0005575" "GO:0005575" "GO:0005575" "GO:0005622"
[6] "GO:0005622"

$GLTP
[1] NA

$Nceh1
[1] "GO:0003674" "GO:0003674" "GO:0003824" "GO:0003824" "GO:0005488"
[6] "GO:0005488"

$Erlin2
[1] "GO:0003674" "GO:0003674" "GO:0003674" "GO:0005488" "GO:0005488"
[6] "GO:0005488"

$Atp9b
[1] "GO:0000166" "GO:0000287" "GO:0003674" "GO:0005488" "GO:0005524"
[6] "GO:0005575"

$Tep1
[1] "GO:0000166" "GO:0000722" "GO:0000722" "GO:0000722" "GO:0000723"
[6] "GO:0000723"

So now I have a list, where the list names are the gene symbols, and the contents are all the GO terms that are directly appended to that gene, plus all offspring terms. Given the nature of the GO DAG, any offspring term is also associated with the gene. But you might want just the direct terms.

> zzz <- mapIds(org.Rn.eg.db, z, "GO", "SYMBOL", multiVals = "list")
'select()' returned 1:many mapping between keys and columns
> sapply(zzz, length)
  Emc2   GLTP  Nceh1 Erlin2  Atp9b   Tep1 
     9      1     15     24     14     29 
> zzz
$Emc2
[1] "GO:0005634" "GO:0005737" "GO:0005737" "GO:0005783" "GO:0005783"
[6] "GO:0034975" "GO:0072546" "GO:0072546" "GO:0072546"

$GLTP
[1] NA

$Nceh1
 [1] "GO:0005783" "GO:0006470" "GO:0006470" "GO:0006805" "GO:0006805"
 [6] "GO:0016020" "GO:0016021" "GO:0016042" "GO:0017171" "GO:0017171"
[11] "GO:0042301" "GO:0042301" "GO:0052689" "GO:0060395" "GO:0060395"

$Erlin2
 [1] "GO:0005783" "GO:0005789" "GO:0005789" "GO:0005829" "GO:0005829"
 [6] "GO:0005886" "GO:0005886" "GO:0008203" "GO:0015485" "GO:0016021"
[11] "GO:0030433" "GO:0030433" "GO:0031625" "GO:0031625" "GO:0032933"
[16] "GO:0032933" "GO:0032991" "GO:0032991" "GO:0045121" "GO:0045121"
[21] "GO:0045541" "GO:0045541" "GO:0045717" "GO:0045717"

$Atp9b
 [1] "GO:0000287" "GO:0005524" "GO:0005768" "GO:0005802" "GO:0005802"
 [6] "GO:0005802" "GO:0005886" "GO:0006890" "GO:0006897" "GO:0016021"
[11] "GO:0045332" "GO:0045332" "GO:0048471" "GO:0048471"

$Tep1
 [1] "GO:0000722" "GO:0000722" "GO:0000722" "GO:0000781" "GO:0002039"
 [6] "GO:0002039" "GO:0003720" "GO:0003720" "GO:0003720" "GO:0003720"
[11] "GO:0003723" "GO:0003723" "GO:0005524" "GO:0005697" "GO:0005697"
[16] "GO:0005697" "GO:0005697" "GO:0005737" "GO:0005737" "GO:0006278"
[21] "GO:0006278" "GO:0016363" "GO:0016363" "GO:0019899" "GO:0019899"
[26] "GO:0070034" "GO:0070034" "GO:0070034" "GO:1990904"

You can also get information about those GO terms

> library(GO.db)

> select(GO.db, zzz[[1]], "TERM")
'select()' returned many:1 mapping between keys and columns
        GOID                                     TERM
1 GO:0005634                                  nucleus
2 GO:0005737                                cytoplasm
3 GO:0005737                                cytoplasm
4 GO:0005783                    endoplasmic reticulum
5 GO:0005783                    endoplasmic reticulum
6 GO:0034975 protein folding in endoplasmic reticulum
7 GO:0072546              ER membrane protein complex
8 GO:0072546              ER membrane protein complex
9 GO:0072546              ER membrane protein complex

See the vignette for AnnotationDbi for more information.