Entering edit mode
hi Vince,
this key generator would be a good way to provide an easy solution for
the user who how doesn't know how to use string-matching solutions
such
as grep, specially if this function would be smart enough to generate
some sensible partial matching strings. i understand that we still
would
need to access the GO terms as keys in GO.db
in the meantime, note that the approximate matching problem is just
the
same as with, let's say, GENENAME in org.Hs.eg.db.
in org.Hs.eg.db, however, GENENAME is a key, and therefore, currently
is
up to the regular expression skills of the user to find the way to
match
the desired string. with GENENAME as key i can for instance quickly
interrogate what genes are nuclear receptors using select:
allkeys <- keys(org.Hs.eg.db, keytype="GENENAME")
select(org.Hs.eg.db, keys=allkeys[grep("nuclear receptor", allkeys),
cols="SYMBOL", keytype="GENENAME")
but i cannot do the same to pull, let's say, "RNA binding" genes
using GO.
cheers,
robert.
On 4/30/13 7:25 PM, Vincent Carey wrote:
> i made a similar suggestion privately some time ago. perhaps it
will
> be taken up, but it might be better if we left select alone and
> created a key generator for GO terms, to feed into select. part of
> the resistance to taking the terms on is, i believe, the need for
any
> practically useful solution to deal with approximate matching, which
> is a sort of scope creep for select.
>
> so you'd have select(..., keytype="GOID", keys=got2i("RNA binding"),
...
>
> and you can define how got2i maps from strings, say, to GOids
>
> On Tue, Apr 30, 2013 at 11:50 AM, Robert Castelo
> <robert.castelo@upf.edu <mailto:robert.castelo@upf.edu="">> wrote:
>
> hi,
>
> i was about to fetch GO identifiers (IDs) matching certain GO
> terms using the GO.db package, but i've found out that GO.db
only
> considers GO IDs as possible keys:
>
> suppressStartupMessages(library(GO.db))
>
> keytypes(GO.db)
> [1] "GOID"
>
> in section 0.4 of the AnnotationDbi vignette on "Using select
with
> GO.db" an example is given with using GO IDs as keys but i think
> it would be handy to interrogate also what GO IDs match or
contain
> a particular term such as "rna binding", for example, doing
either:
>
> * for matching
>
> select(GO.db, keys="RNA binding", cols="GOID", keytype="TERM")
>
> * for containing
>
> allTerms <- keys(GO.db, keytype="TERM")
> rnabindingterms <- allTerms[grep("RNA binding", allTerms)]
> select(GO.db, keys=rnabindingterms, cols="GOID", keytype="TERM")
>
> once you got the GO IDs you can interrogate what genes have such
a
> GO term annotated to them.
>
> currently this is not possible because the only key allowed is
GOID:
>
> head(keys(GO.db, keytype="TERM"))
> [1] "GO:0000001" "GO:0000002" "GO:0000003" "GO:0000006"
"GO:0000007"
> [6] "GO:0000009"
> head(keys(GO.db, keytype="DEFINITION"))
> [1] "GO:0000001" "GO:0000002" "GO:0000003" "GO:0000006"
"GO:0000007"
> [6] "GO:0000009"
> head(keys(GO.db, keytype="ONTOLOGY"))
> [1] "GO:0000001" "GO:0000002" "GO:0000003" "GO:0000006"
"GO:0000007"
> [6] "GO:0000009"
>
> while in other packages, such as org.Hs.eg.db, basically all
> columns of information can be used as keys:
>
> library(org.Hs.eg.db)
> keytypes(org.Hs.eg.db)
> [1] "ENTREZID" "PFAM" "IPI" "PROSITE"
"ACCNUM"
> [6] "ALIAS" "CHR" "CHRLOC" "CHRLOCEND" "ENZYME"
> [11] "MAP" "PATH" "PMID" "REFSEQ"
"SYMBOL"
> [16] "UNIGENE" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
> "GENENAME"
> [21] "UNIPROT" "GO" "EVIDENCE" "ONTOLOGY"
> "GOALL"
> [26] "EVIDENCEALL" "ONTOLOGYALL" "OMIM" "UCSCKG"
>
>
> i'm also aware that GO.db defines several hash tables, among
them
> GOTERM, which can be used in the following way for my purpose:
>
> goterms <- unlist(eapply(GOTERM, function(x) x@Term))
> which(goterms == "RNA binding")
> GO:0003723
> 2714
>
> but the first step is much slower than using the 'select' method
> and i would prefer to use a more homogeneous way to pull all
data
> in GO.db
>
>
> i look forward to your comments on this.
>
>
>
> best regards,
>
> robert.
> ps: sessionInfo()
> R version 3.0.0 (2013-04-03)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF8 LC_COLLATE=en_US.UTF8
> [5] LC_MONETARY=en_US.UTF8 LC_MESSAGES=en_US.UTF8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets
methods
> [8] base
>
> other attached packages:
> [1] org.Hs.eg.db_2.9.0 GO.db_2.9.0 RSQLite_0.11.3
> [4] DBI_0.2-6 AnnotationDbi_1.22.3 Biobase_2.20.0
> [7] BiocGenerics_0.6.0 vimcom_0.9-8 setwidth_1.0-3
> [10] colorout_1.0-0
>
> loaded via a namespace (and not attached):
> [1] IRanges_1.18.0 stats4_3.0.0
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@r-project.org <mailto:bioconductor@r-project.org>
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
>
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
[[alternative HTML version deleted]]