Question

A look inside annotation packages

0

Entering edit mode

Ed Siefker ▴ 230

@ed-siefker-5136

Last seen 2.0 years ago

United States

Let's say I'm looking at EnsDb.Hsapiens.v86 and want to pull out gene IDs.

If I load the object and display it, I get this:

> EnsDb.Hsapiens.v86
An object of class "EnsDb"
Slot "ensdb":
<SQLiteConnection>
  Path: C:\Users\edward.siefker\AppData\Local\Programs\R\R-4.2.2\library\EnsDb.Hsapiens.v86\extdata\EnsDb.Hsapiens.v86.sqlite
  Extensions: TRUE

Slot "tables":
$chromosome
[1] "seq_name"    "seq_length"  "is_circular"

$entrezgene
[1] "gene_id"  "entrezid"

{etc...}

Great, so the object has a slot with tables, and there is a table $entrezgene that contains gene_id.

Let's get that table

> ens_tables<-EnsDb.Hsapiens.v86@tables
> ens_tables
$chromosome
[1] "seq_name"    "seq_length"  "is_circular"

$entrezgene
[1] "gene_id"  "entrezid"

{etc...}

So far so good.

> ens_tables$entrezgene
[1] "gene_id"  "entrezid"
> class(ens_tables$entrezgene)
[1] "character"
> ens_tables$entrezgene[1]
[1] "gene_id"
> ens_tables$entrezgene[[1]]
[1] "gene_id"

Wait. So the $entrezgene "table" is really just a character vector? There's no table of gene_id in there?

Where are the actual gene ids?

Note that I'm not asking how to get the gene IDs. I'm aware of the 'genes()' method.

I'm asking for a conceptual explanation of where that data is stored in the object, and why I can't pull it out with the usual methods for selecting data inside data structures. Why are the objects in the "tables" slot character vectors and not tables? What is the 'genes()' method doing?

Is there documentation that discusses the concepts going on here? It seems like all the documentation is very practical and doesn't discuss the theory of operation.

Annotation • 1.5k views

ADD COMMENT • link 2.9 years ago Ed Siefker ▴ 230

score 0 · Answer 1 · 2022-12-09

You should pretty much never need to use the @ accessor to extract any data from any S4 object. There is no documentation on the internal workings because it's not important for any end users to know that sort of thing (what you are doing is not 'the usual method(s) for selecting data inside data structures', but instead you are ignoring the existing accessors and poking around.

All of the annotation packages have a select method that you can use to get things if you like, and they all respect any of the commands you can find in the AnnotationDbi vignette. For example

> library(AnnotationHub)
> hub <- AnnotationHub()
> query(hub, c("homo","ensdb"))
AnnotationHub with 22 records
# snapshotDate(): 2022-10-26
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH53211"]]' 

             title                             
  AH53211  | Ensembl 87 EnsDb for Homo Sapiens 
  AH53715  | Ensembl 88 EnsDb for Homo Sapiens 
  AH56681  | Ensembl 89 EnsDb for Homo Sapiens 
  AH57757  | Ensembl 90 EnsDb for Homo Sapiens 
  AH60773  | Ensembl 91 EnsDb for Homo Sapiens 
  ...        ...                               
  AH89426  | Ensembl 103 EnsDb for Homo sapiens
  AH95744  | Ensembl 104 EnsDb for Homo sapiens
  AH98047  | Ensembl 105 EnsDb for Homo sapiens
  AH100643 | Ensembl 106 EnsDb for Homo sapiens
  AH104864 | Ensembl 107 EnsDb for Homo sapiens
> ensdb <- hub[["AH104864"]]
loading from cache
require("ensembldb")
> keytypes(ensdb)
 [1] "ENTREZID"            "EXONID"              "GENEBIOTYPE"        
 [4] "GENEID"              "GENENAME"            "PROTDOMID"          
 [7] "PROTEINDOMAINID"     "PROTEINDOMAINSOURCE" "PROTEINID"          
[10] "SEQNAME"             "SEQSTRAND"           "SYMBOL"             
[13] "TXBIOTYPE"           "TXID"                "TXNAME"             
[16] "UNIPROTID"          
> columns(ensdb)
 [1] "CANONICALTRANSCRIPT" "DESCRIPTION"         "ENTREZID"           
 [4] "EXONID"              "EXONIDX"             "EXONSEQEND"         
 [7] "EXONSEQSTART"        "GCCONTENT"           "GENEBIOTYPE"        
[10] "GENEID"              "GENEIDVERSION"       "GENENAME"           
[13] "GENESEQEND"          "GENESEQSTART"        "INTERPROACCESSION"  
[16] "ISCIRCULAR"          "PROTDOMEND"          "PROTDOMSTART"       
[19] "PROTEINDOMAINID"     "PROTEINDOMAINSOURCE" "PROTEINID"          
[22] "PROTEINSEQUENCE"     "SEQCOORDSYSTEM"      "SEQLENGTH"          
[25] "SEQNAME"             "SEQSTRAND"           "SYMBOL"             
[28] "TXBIOTYPE"           "TXCDSSEQEND"         "TXCDSSEQSTART"      
[31] "TXEXTERNALNAME"      "TXID"                "TXIDVERSION"        
[34] "TXISCANONICAL"       "TXNAME"              "TXSEQEND"           
[37] "TXSEQSTART"          "TXSUPPORTLEVEL"      "UNIPROTDB"          
[40] "UNIPROTID"           "UNIPROTMAPPINGTYPE" 
> head(keys(ensdb, "GENEID"))
[1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457"
[5] "ENSG00000000460" "ENSG00000000938"
> head(keys(ensdb, "ENTREZID"))
[1]  1  2  3  9 10 12
> select(ensdb, head(keys(ensdb, "GENEID")), "ENTREZID","GENEID")
           GENEID ENTREZID
1 ENSG00000000003     7105
2 ENSG00000000005    64102
3 ENSG00000000419     8813
4 ENSG00000000457    57147
5 ENSG00000000460    55732
6 ENSG00000000938     2268