Question

How to use TopGO with gene symbols extracted from Illumina probes

1

Entering edit mode

Ahdee ▴ 50

@ahdee-8938

Last seen 18 months ago

United States

Hi all, I have a named vector with gene symbols and p-value extracted previously from an Illumina microarray; I'm wondering how to create the topgo object with the proper annotation call; so far I have something like this.

glist <- ko_pk[,4] # this are p-values
names(glist) <- row.names(ko_pk)

sum(topDiffGenes(glist))

sampleGOdata <- new("topGOdata",
description = "Simple session", ontology = "BP",
allGenes = glist, geneSel = topDiffGenes,
nodeSize = 10,
?? annot = annFUN.org, ??)

thanks in advance.

Ahdee

topgo • 2.3k views

ADD COMMENT • link updated 8.5 years ago by James W. MacDonald 65k • written 8.5 years ago by Ahdee ▴ 50

score 2 · Answer 1 · 2015-10-09

You are making things more difficult for yourself. Rather than coming up with a vector of p-values with HUGO gene symbols as the names, you should be using the Illumina IDs as names, and using annFUN.db, just like in the vignette. That way you can just follow along with code that makes sense.

You could use the vector you have, but the help page for annFUN.org() is, like, not very helpful. So I can show you how to use your vector, but without saying how or why I know that you should be doing this. I will use the data from the vignette as an example.

## load stuff

> library(topGO)
> data(geneList)
> library(hgu95av2.db)

## we need a vector like yours, so do some stuff
> z <- select(hgu95av2.db, names(geneList), "SYMBOL")
'select()' returned 1:many mapping between keys and columns
> z <- z[!duplicated(z[,1]),]
> geneList2 <- geneList
> names(geneList2) <- z[,2]

## the original geneList
> head(geneList)
1095_s_at   1130_at   1196_at 1329_s_at 1340_s_at 1342_g_at
1.0000000 1.0000000 0.6223795 0.5412240 1.0000000 1.0000000

## something similar to what you have
> head(geneList2)
      HGF    MAP2K1      RCC1     TERF1       HGF     TERF1
1.0000000 1.0000000 0.6223795 0.5412240 1.0000000 1.0000000
> sampleGOdata <- new("topGOdata", description = "whatevs",ontology = "BP", allGenes = geneList2, geneSel = topDiffGenes, nodeSize = 10, annot = annFUN.org, ID = "alias", mapping = "org.Hs.eg")

Building most specific GOs .....    ( 1566 GO terms found. )

Build GO DAG topology ..........    ( 4215 GO terms and 9916 relations. )

Annotating nodes ...............    ( 225 genes annotated to the GO terms. )

> resultFisher <- runTest(sampleGOdata, "classic","fisher")

             -- Classic Algorithm --

         the algorithm is scoring 776 nontrivial nodes
         parameters:
             test statistic:  fisher
> resultFisher

Description: whatevs
Ontology: BP
'classic' algorithm with the 'fisher' test
797 GO terms scored: 11 terms with p < 0.01
Annotation data:
    Annotated genes: 310
    Significant genes: 46
    Min. no. of genes annotated to a GO: 10
    Nontrivial nodes: 776

Note that I get fewer GO terms this way (compare to the results on page 4 of the vignette), which is probably because gene symbols are really not useful for most data analysis. If you want to do things 'the right way', you will instead rely on actual IDs like the Illumina IDs, or Entrez Gene or Ensembl IDs, which are more likely to be unique.