Question

How to get GOSemSim work on custom packages built with AnnotationForge

0

Entering edit mode

Cei • 0

@cei-23383

Last seen 4.0 years ago

Langebio - Mexico

Dear Guangchuang,

I would really like to use your GOSemSim package to calculate similarities between GO Terms for several custom genome annotations I work with (parasitic nematodes, butterflies, etc). It sounded possible, just use AnnotationForge to build an org.Xxx.eg.db package, and then it should work. But, it seems like GOSemSim expects the ENTREZID to be the central ID for all the annotation, while AnnotationForge, to keep things generic, uses something called GID. It seems like even adding an ENTREZID as an extra field does not get GOSemSim to work.

I have described the problem in another post, thinking perhaps the problem was with AnnotationForge: [Question: AnnotationForge not working for building custom org packages][1]

A quick recap of a worked example that highlights the problem: build the example package for makeOrgPackage, install, then try to use godata() on the new package and it fails. Code follows:

library(AnnotationForge)
example(makeOrgPackage)
install.packages("./org.Tguttata.eg.db",type = "source", repos = NULL)
library(org.Tguttata.eg.db)
library(GOSemSim)
tgGO <- godata('org.Tguttata.eg.db', ont="BP")

Error in testForValidKeytype(x, keytype) : Invalid keytype: ENTREZID. Please use the keytypes method to see a listing of valid arguments.

Since I seem to have reached a dead end, I'm opening a new question to you. Hopefully there might be a way of accepting a user specified key instead of ENTREZID?

Many thanks,

Cei

sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] GOSemSim_2.12.1        org.Tguttata.eg.db_0.1 AnnotationForge_1.28.0 AnnotationDbi_1.48.0   IRanges_2.20.2        
[6] S4Vectors_0.24.4       Biobase_2.46.0         BiocGenerics_0.32.0   

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4      GO.db_3.10.0    XML_3.99-0.3    digest_0.6.25   bitops_1.0-6    DBI_1.1.0       RSQLite_2.2.0  
 [8] rlang_0.4.5     blob_1.2.1      vctrs_0.2.4     tools_3.6.2     bit64_0.9-7     RCurl_1.98-1.2  bit_1.1-15.2   
[15] compiler_3.6.2  pkgconfig_2.0.3 memoise_1.1.0  

  [1]: https://support.bioconductor.org/p/130160/

Guangchuang Yu GOSemSim AnnotationForge • 1.3k views

ADD COMMENT • link updated 4.0 years ago by Kevin Blighe ★ 3.9k • written 4.0 years ago by Cei • 0

score 1 · Answer 1 · 2020-04-21

Edit April 27, 2020: another workaround was posted here:

https://support.bioconductor.org/p/130441/#130452

------------

Hey, this is an unfortunate circumstance where 2 different packages are not working in harmony.

You can attempt to get around the initial error by trying this,, but it throws a diffferent error:

tgGO <- godata('org.Tguttata.eg.db', ont="BP", keytype = 'GID')

preparing gene to GO mapping data...
Error in FUN(X[[i]], ...) : 2
  Two fields in the source DB have the same name.

That new error is being thrown by AnnotationDbi as a result of a call from inside the GOSemSim::godata() function. Here are the lines causing the error, in order of when they are called:

You can get around it manually, to some degree. The keys in the org.Tguttata.eg.db database that are causing the errors are GO, ONTOLOGY, and EVIDENCE, as they are string subsets of other keys.

keytypes(org.Tguttata.eg.db)
 [1] "CHROMOSOME"  "EVIDENCE"    "EVIDENCEALL" "GENENAME"    "GID"        
 [6] "GO"          "GOALL"       "ONTOLOGY"    "ONTOLOGYALL" "SYMBOL"

We can get it working by not selecting those keys:

OrgDb <- load_OrgDb(org.Tguttata.eg.db)

kk <- keys(OrgDb, keytype='GID')

head(select(OrgDb,
  keys = kk,
  keytype = 'GID',
  columns = columns(OrgDb)[-which(columns(OrgDb) %in% c('GO','GOALL','ONTOLOGY','EVIDENCE'))]))

'select()' returned 1:many mapping between keys and columns
     GID CHROMOSOME EVIDENCEALL
1 751582          4         IEA
2 751582          4         IEA
3 751582          4         IEA
4 751583          2         IEA
5 751584          5         IEA
6 751584          5         IEA
                                                  GENENAME ONTOLOGYALL SYMBOL
1 synuclein, alpha (non A4 component of amyloid precursor)          BP   SNCA
2 synuclein, alpha (non A4 component of amyloid precursor)          CC   SNCA
3 synuclein, alpha (non A4 component of amyloid precursor)          MF   SNCA
4                                        neurocalcin delta          MF  NCALD
5                        brain-derived neurotrophic factor          BP   BDNF
6                        brain-derived neurotrophic factor          CC   BDNF

This looks like something that the author(s) will have to fix.

Kevin