TopGO - Incorrect mapping genes to GO terms
1
0
Entering edit mode
lehallib • 0
@lehallib-14965
Last seen 5.2 years ago

I’m using topGO a lot for GO analysis and I’m worrying about the accuracy of the genes to GO terms mapping

As an example, 2600+ genes are associated with a given GO term (e.g "GO:1903561”, Extracellular vesicle) using the topGO annFUN.org mapping and the org.Mm.eg.db database.

When manually searching in the org.Mm.eg.db database, I get only 49 genes which is way less…

What could explain these differences?

thanks in advance

set.seed(1234)
require(org.Mm.eg.db)
require(DBI)
require(topGO)

# select a random list of gene symbol
x <- unique(unlist(as.list(org.Mm.egSYMBOL)))
names(x)=x
genesOfInterest=sample(x,2000,replace = F)
# format  this list for topGO
geneList = x
geneList[!geneList %in% genesOfInterest] <- 0
geneList[geneList %in% genesOfInterest] <- 1
geneList = factor(geneList)
table(geneList)

# Create topGO object
GOdata_CC = NULL
GOdata_CC <-
  new(
    "topGOdata",
    ontology = "CC",
    allGenes = geneList,
    description = "Test",
    annot = annFUN.org,
    mapping = "org.Mm.eg.db",
    ID = "SYMBOL"
  )

# number of genes for the "extracellular vesicle" GO term, GO:1903561
length(genesInTerm(GOdata_CC,"GO:1903561")[[1]])

# Comparison with manual searching in the org.Mm.eg.db package
anno <- AnnotationDbi::select(org.Mm.eg.db, 
                              keys="GO:1903561",
                              columns=c("SYMBOL","GO"),
                              keytype="GO")
unique(anno$GO)
dim(anno)

sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] topGO_2.28.0         SparseM_1.77         GO.db_3.4.1          graph_1.54.0        
 [5] DBI_0.7              org.Mm.eg.db_3.4.1   AnnotationDbi_1.38.2 IRanges_2.10.5      
 [9] S4Vectors_0.14.7     Biobase_2.36.2       BiocGenerics_0.22.1 

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.15       bit_1.1-12         lattice_0.20-35    rlang_0.1.6        blob_1.1.0        
 [6] tools_3.4.1        grid_3.4.1         matrixStats_0.53.0 bit64_0.9-7        digest_0.6.15     
[11] tibble_1.4.2       memoise_1.1.0      RSQLite_2.0        compiler_3.4.1     pillar_1.1.0      
[16] pkgconfig_2.0.1   

topgo go org.mm.eg.db • 1.4k views
ADD COMMENT
2
Entering edit mode
@james-w-macdonald-5106
Last seen 3 days ago
United States

You are looking at the genes that have a direct mapping to that GO term, whereas topGO uses all genes that map directly to that term as well as all of its progeny. 

Put another way, topGO uses GOALL, whereas you are using GO:

> nrow(select(org.Mm.eg.db, "GO:1903561", "ENTREZID", "GOALL"))
'select()' returned 1:many mapping between keys and columns
[1] 2636
> nrow(select(org.Mm.eg.db, "GO:1903561", "ENTREZID", "GO"))
'select()' returned 1:many mapping between keys and columns
[1] 53​
ADD COMMENT

Login before adding your answer.

Traffic: 587 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6