Extract the relevant information of the genes sets from the Bioconductor annotation package
1
0
Entering edit mode
jose.wo • 0
@josewo-23070
Last seen 3.1 years ago
Spain

Hello Bioconductor comunity.

I'm new with Bioconductor so probably my question is very basic. I have a list of gene names from a proteomic analysis we've run in the lab. I need to obtain the GO annotations of the genes (from Rattus norvergicus) I have in my list.

The rownames of my expression set object is:

> Gene names
head(rownames(AverageDataSet))
[1] "Emc2"   "GLTP"   "Nceh1"  "Erlin2" "Atp9b"  "Tep1"

I download Bioconductor annotations as follows:

gsets = AnnotationDbi::as.list(org.Rn.egGO2ALLEGS)
gsets[[1]]
      IEA      ISO      IEA      ISO      IEA      ISO      IBA      TAS
  "24842"  "24842"  "25591"  "25591"  "29414"  "29414"  "54304"  "83474"
      IEA      ISO      IBA      ISO      ISO      ISS      ISO      IEA
  "83516"  "83516"  "85472"  "85472" "171116" "171116" "291824" "294388"
      ISO      IBA      IBA      IEA      ISO      ISO      ISS      ISO
 "294388" "295237" "296200" "296200" "296200" "298203" "298203" "298933"
      IEA      ISO      IEA      ISO      IEA      ISO      IBA      IEA
 "299976" "299976" "303185" "303185" "303194" "303194" "303369" "303369"
      ISO      IBA      IEA      ISO      IEA      ISO      IEA      ISO
 "303369" "303612" "303612" "303612" "309441" "309441" "309957" "309957"
      ISO      IMP      IEA      ISO      IEA      ISO      IEA      ISO
 "312559" "312647" "315219" "315219" "360463" "360463" "360481" "360481"
      IBA      IEA      ISO      IEA      IEA      IEA      ISO      IEA
 "361147" "361147" "361147" "361836" "367645" "501039" "501039" "502988"
      ISO      IEA      ISO
 "502988" "691431" "691431"

Now, when I try to filter for the genes that I have in my list I get no matches.

> unlist(sapply(gsets, function(gset) intersect(rownames(AverageDataSet), unique(as.character(gset)))))
 character(0)

I'm truly lost here. I'm not even sure how data is stored in the gsets list nor even if the genes names are in that list for that matter. I'm sure I'm missing something very basic here but can't see what. Could anyone give me a hand with this issue??

Best regards, Jose.

org.Rn.eg.db annotation go • 1.0k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 7 hours ago
United States

First you should note that the rownames of your AverageDataSet are gene symbols, and the values in your 'gsets' object are Gene IDs. So there is no reason that you should think that there is any intersection between the two! I mean, your example shows that one is a set of things like 'Emc2` and the other has quoted numbers like "24842". So trying to get the intersection seems fruitless?

Also, converting things to a list like that is not really how you should do things. People do that, but it's really just a holdover from like the middle of the last decade. These days you should use either select or mapIds. In this situation I would use the latter

> z <- c("Emc2",   "GLTP",   "Nceh1",  "Erlin2", "Atp9b",  "Tep1")
> zz <- mapIds(org.Rn.eg.db, z, "GOALL", "SYMBOL", multiVals = "list")
'select()' returned 1:many mapping between keys and columns
> sapply(zz, length)
  Emc2   GLTP  Nceh1 Erlin2  Atp9b   Tep1 
   101      1    113    434    140    285 
> lapply(zz, head)
$Emc2
[1] "GO:0005575" "GO:0005575" "GO:0005575" "GO:0005575" "GO:0005622"
[6] "GO:0005622"

$GLTP
[1] NA

$Nceh1
[1] "GO:0003674" "GO:0003674" "GO:0003824" "GO:0003824" "GO:0005488"
[6] "GO:0005488"

$Erlin2
[1] "GO:0003674" "GO:0003674" "GO:0003674" "GO:0005488" "GO:0005488"
[6] "GO:0005488"

$Atp9b
[1] "GO:0000166" "GO:0000287" "GO:0003674" "GO:0005488" "GO:0005524"
[6] "GO:0005575"

$Tep1
[1] "GO:0000166" "GO:0000722" "GO:0000722" "GO:0000722" "GO:0000723"
[6] "GO:0000723"

So now I have a list, where the list names are the gene symbols, and the contents are all the GO terms that are directly appended to that gene, plus all offspring terms. Given the nature of the GO DAG, any offspring term is also associated with the gene. But you might want just the direct terms.

> zzz <- mapIds(org.Rn.eg.db, z, "GO", "SYMBOL", multiVals = "list")
'select()' returned 1:many mapping between keys and columns
> sapply(zzz, length)
  Emc2   GLTP  Nceh1 Erlin2  Atp9b   Tep1 
     9      1     15     24     14     29 
> zzz
$Emc2
[1] "GO:0005634" "GO:0005737" "GO:0005737" "GO:0005783" "GO:0005783"
[6] "GO:0034975" "GO:0072546" "GO:0072546" "GO:0072546"

$GLTP
[1] NA

$Nceh1
 [1] "GO:0005783" "GO:0006470" "GO:0006470" "GO:0006805" "GO:0006805"
 [6] "GO:0016020" "GO:0016021" "GO:0016042" "GO:0017171" "GO:0017171"
[11] "GO:0042301" "GO:0042301" "GO:0052689" "GO:0060395" "GO:0060395"

$Erlin2
 [1] "GO:0005783" "GO:0005789" "GO:0005789" "GO:0005829" "GO:0005829"
 [6] "GO:0005886" "GO:0005886" "GO:0008203" "GO:0015485" "GO:0016021"
[11] "GO:0030433" "GO:0030433" "GO:0031625" "GO:0031625" "GO:0032933"
[16] "GO:0032933" "GO:0032991" "GO:0032991" "GO:0045121" "GO:0045121"
[21] "GO:0045541" "GO:0045541" "GO:0045717" "GO:0045717"

$Atp9b
 [1] "GO:0000287" "GO:0005524" "GO:0005768" "GO:0005802" "GO:0005802"
 [6] "GO:0005802" "GO:0005886" "GO:0006890" "GO:0006897" "GO:0016021"
[11] "GO:0045332" "GO:0045332" "GO:0048471" "GO:0048471"

$Tep1
 [1] "GO:0000722" "GO:0000722" "GO:0000722" "GO:0000781" "GO:0002039"
 [6] "GO:0002039" "GO:0003720" "GO:0003720" "GO:0003720" "GO:0003720"
[11] "GO:0003723" "GO:0003723" "GO:0005524" "GO:0005697" "GO:0005697"
[16] "GO:0005697" "GO:0005697" "GO:0005737" "GO:0005737" "GO:0006278"
[21] "GO:0006278" "GO:0016363" "GO:0016363" "GO:0019899" "GO:0019899"
[26] "GO:0070034" "GO:0070034" "GO:0070034" "GO:1990904"

You can also get information about those GO terms

> library(GO.db)

> select(GO.db, zzz[[1]], "TERM")
'select()' returned many:1 mapping between keys and columns
        GOID                                     TERM
1 GO:0005634                                  nucleus
2 GO:0005737                                cytoplasm
3 GO:0005737                                cytoplasm
4 GO:0005783                    endoplasmic reticulum
5 GO:0005783                    endoplasmic reticulum
6 GO:0034975 protein folding in endoplasmic reticulum
7 GO:0072546              ER membrane protein complex
8 GO:0072546              ER membrane protein complex
9 GO:0072546              ER membrane protein complex

See the vignette for AnnotationDbi for more information.

ADD COMMENT
0
Entering edit mode

Also, do note that GLTP isn't a gene symbol for rat, but is a gene symbol for human. You need to convert to the correct format for things to work correctly (and shouldn't rely on gene symbols anyway, but people tend to do so...)

> z2 <- sapply(strsplit(z, ""), function(x) paste0(x[1], paste(tolower(x[-1]), collapse = "")))
> z2
[1] "Emc2"   "Gltp"   "Nceh1"  "Erlin2" "Atp9b"  "Tep1"  
> zz <- mapIds(org.Rn.eg.db, z2, "GOALL", "SYMBOL", multiVals = "list")
'select()' returned 1:many mapping between keys and columns
> sapply(zz, length)
  Emc2   Gltp  Nceh1 Erlin2  Atp9b   Tep1 
   101     98    113    434    140    285 
ADD REPLY
0
Entering edit mode

Thank you very much James!! Your answer indeed solved my problem and I also got to understand the logic of it. Best, Jose.

ADD REPLY

Login before adding your answer.

Traffic: 729 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6