Question

topGO annotations issue

0

Entering edit mode

willmacnair • 0

@willmacnair-7054

Last seen 8.1 years ago

Switzerland

Hi
I am trying to run a GO enrichment test on genes identified in experiments on an
affymetrix hgu133plus2 chip. I get zero annotations, even though within hgu133plus2.db
I can find annotations for the genes of interest. I've spent a while trying to get it to
work but with no success. What (no doubt very basic thing) am I doing wrong?

Best
Will

library_calls <- function() {
    library('hgu133plus2.db')
    library('topGO')
}

get_gene_universe <- function(affy_library) {

    # get all entrez ids associated with this chip
    gene_universe     = as.integer(keys(eval(parse(text = affy_library)), keytype='ENTREZID'))

    return(gene_universe)
}

probesets_to_entrez <- function(probeset_keys, affy_library) {

    # get entrez ids for each of these 
    entrez_ids_df     = select(eval(parse(text = affy_library)), probeset_keys, 'ENTREZID', 'PROBEID')

    # return unique
    entrez_ids         = unique(entrez_ids_df$ENTREZID)

    # remove any NAs
    entrez_ids         = as.integer(entrez_ids[!is.na(entrez_ids)])

    return(entrez_ids)
}

find_GO_terms_for_probesets <- function(probeset_keys, affy_library) {
    # get GO ids for each of these 
    GO_ids_df         = select(eval(parse(text = affy_library)), probeset_keys, 'GO','PROBEID')

    # restrict to BP
    GO_ids             = unique(GO_ids_df[GO_ids_df$ONTOLOGY == 'BP', 'GO'])

    # check for annotations
    select(GO.db, keys = GO_ids, columns = 'DEFINITION', keytype = 'GOID')
}

# admin
library_calls()

# define library
affy_library     = 'hgu133plus2.db'
# define test keys
probeset_keys = c('201060_x_at', '213558_at', '223294_at', '201061_s_at', '202976_s_at', '224177_s_at')

# get gene universe in entrez ids
gene_universe     = get_gene_universe(affy_library)

# get genes of interest in entrez ids
gene_list         = probesets_to_entrez(probeset_keys, affy_library)

# define inputs to topGOdata creator
gene_vector         = factor(as.integer(gene_universe %in% gene_list))
names(gene_vector)     = gene_universe

# check that GO terms exist
find_GO_terms_for_probesets(probeset_keys, affy_library)
#         GOID
# 1  GO:0051260  The process of creating protein oligomers, compounds composed of a small number, usually between three and ten, of identical component monomers. Oligomers may be formed by the polymerization of a number of monomers or the depolymerization of a large protein polymer.
# 2  GO:1901585                                                                                                                                                                              Any process that modulates the frequency, rate or extent of acid-sensing ion channel activity.
# 3  GO:0007010                                                                                                             A process that is carried out at the cellular level which results in the assembly, arrangement of constituent parts, or disassembly of cytoskeletal structures.
# ...

# create new topGOdata object
GOdata_BP = new("topGOdata", 
    ontology         = 'BP', 
    allGenes         = gene_vector, 
    annotationFun     = annFUN.db, 
    affyLib         = affy_library
    )

# Fails:
# Building most specific GOs .....    ( 0 GO terms found. )

# Build GO DAG topology ..........    ( 0 GO terms and 0 relations. )
# Error in if (is.na(index) || index < 0 || index > length(nd)) stop("vertex is not in graph: ",  : 
#   missing value where TRUE/FALSE needed

sessionInfo()
# R version 3.1.2 (2014-10-31)
# Platform: x86_64-w64-mingw32/x64 (64-bit)

# locale:
# [1] LC_COLLATE=German_Switzerland.1252  LC_CTYPE=German_Switzerland.1252    LC_MONETARY=German_Switzerland.1252
# [4] LC_NUMERIC=C                        LC_TIME=German_Switzerland.1252    

# attached base packages:
# [1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

# other attached packages:
#  [1] topGO_2.18.0         SparseM_1.05         GO.db_3.0.0          graph_1.44.0         hgu133plus2.db_3.0.0
#  [6] org.Hs.eg.db_3.0.0   RSQLite_1.0.0        DBI_0.3.1            AnnotationDbi_1.28.1 GenomeInfoDb_1.2.2  
# [11] IRanges_2.0.0        S4Vectors_0.4.0      Biobase_2.26.0       BiocGenerics_0.12.0 

# loaded via a namespace (and not attached):
# [1] grid_3.1.2      lattice_0.20-29 tools_3.1.2

topgo annotationdbi • 2.9k views

ADD COMMENT • link 9.2 years ago willmacnair • 0

score 0 · Answer 1 · 2015-02-18

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 5 hours ago

United States

I believe you are missing the genSel argument when you are creating your topGO object (which IIRC should be your gene_vector, and allGenes should be your gene_universe).

ADD COMMENT • link 9.2 years ago James W. MacDonald 65k

0

Entering edit mode

So I have now tried

GOdata_BP = new("topGOdata",
    ontology         = 'BP',
    allGenes         = gene_universe,
    geneSel         = gene_vector,
    annotationFun     = annFUN.db,
    affyLib         = affy_library
    )

however I get the following error:

Error in .local(.Object, ...) : allGenes must be a named vector

I tried to correct for this with

names(gene_universe) = gene_universe

but I then get a different error:

Error in checkAtAssignment("topGOdata", "geneSelectionFun", "factor") :
  assignment of an object of class “factor” is not valid for @‘geneSelectionFun’ in an object of class “topGOdata”; is(value, "function") is not TRUE

I would love to check all of this with the topGO documentation, but the reference manual is sadly
so sparsely filled in that there is effectively no more detail than you get in a vignette.

ADD REPLY • link 9.2 years ago willmacnair • 0

0

Entering edit mode

I think I have set you wrong. The geneSel argument actually appears to be some function used to decide which of your allGenes are the selected genes. In other words

genSel(gene_universe)

should return a boolean vector, where TRUE indicates that a particular gene is in the selected gene set, and FALSE. As an example,

> library(topGO)

> data(geneList)

> topDiffGenes
function (allScore)
{
return(allScore < 0.01)
}

> head(geneList)
1095_s_at 1130_at 1196_at 1329_s_at 1340_s_at 1342_g_at
1.0000000 1.0000000 0.6223795 0.5412240 1.0000000 1.0000000

So in the example in the vignette, you pass in the geneList vector as the allGenes argument, and topDiffGenes() as the geneSel argument.

Sorry to have mixed you up - I in general use GOstats for this sort of analysis, as I find it to be much less confusing (but I have made several contributions to that package, so go figure...).

ADD REPLY • link 9.2 years ago James W. MacDonald 65k

score 0 · Answer 2 · 2015-02-19

0

Entering edit mode

willmacnair • 0

@willmacnair-7054

Last seen 8.1 years ago

Switzerland

OK, thanks for trying to help! I think what you've now suggested was what I was originally doing in the example I posted - I'd previously had success using gene_vector as a boolean vector, converted to a factor, but with a different chip.

Thanks for the suggestion of GOstats. I'll try anything which seems likely to work...

ADD COMMENT • link 9.2 years ago willmacnair • 0