A look inside annotation packages
1
0
Entering edit mode
Ed Siefker ▴ 230
@ed-siefker-5136
Last seen 12 months ago
United States

Let's say I'm looking at EnsDb.Hsapiens.v86 and want to pull out gene IDs.

If I load the object and display it, I get this:

> EnsDb.Hsapiens.v86
An object of class "EnsDb"
Slot "ensdb":
<SQLiteConnection>
  Path: C:\Users\edward.siefker\AppData\Local\Programs\R\R-4.2.2\library\EnsDb.Hsapiens.v86\extdata\EnsDb.Hsapiens.v86.sqlite
  Extensions: TRUE

Slot "tables":
$chromosome
[1] "seq_name"    "seq_length"  "is_circular"

$entrezgene
[1] "gene_id"  "entrezid"

{etc...}

Great, so the object has a slot with tables, and there is a table $entrezgene that contains gene_id.

Let's get that table

> ens_tables<-EnsDb.Hsapiens.v86@tables
> ens_tables
$chromosome
[1] "seq_name"    "seq_length"  "is_circular"

$entrezgene
[1] "gene_id"  "entrezid"

{etc...}

So far so good.

> ens_tables$entrezgene
[1] "gene_id"  "entrezid"
> class(ens_tables$entrezgene)
[1] "character"
> ens_tables$entrezgene[1]
[1] "gene_id"
> ens_tables$entrezgene[[1]]
[1] "gene_id"

Wait. So the $entrezgene "table" is really just a character vector? There's no table of gene_id in there?

Where are the actual gene ids?

Note that I'm not asking how to get the gene IDs. I'm aware of the 'genes()' method.

I'm asking for a conceptual explanation of where that data is stored in the object, and why I can't pull it out with the usual methods for selecting data inside data structures. Why are the objects in the "tables" slot character vectors and not tables? What is the 'genes()' method doing?

Is there documentation that discusses the concepts going on here? It seems like all the documentation is very practical and doesn't discuss the theory of operation.

Annotation • 1.1k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 10 hours ago
United States

You should pretty much never need to use the @ accessor to extract any data from any S4 object. There is no documentation on the internal workings because it's not important for any end users to know that sort of thing (what you are doing is not 'the usual method(s) for selecting data inside data structures', but instead you are ignoring the existing accessors and poking around.

All of the annotation packages have a select method that you can use to get things if you like, and they all respect any of the commands you can find in the AnnotationDbi vignette. For example

> library(AnnotationHub)
> hub <- AnnotationHub()
> query(hub, c("homo","ensdb"))
AnnotationHub with 22 records
# snapshotDate(): 2022-10-26
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH53211"]]' 

             title                             
  AH53211  | Ensembl 87 EnsDb for Homo Sapiens 
  AH53715  | Ensembl 88 EnsDb for Homo Sapiens 
  AH56681  | Ensembl 89 EnsDb for Homo Sapiens 
  AH57757  | Ensembl 90 EnsDb for Homo Sapiens 
  AH60773  | Ensembl 91 EnsDb for Homo Sapiens 
  ...        ...                               
  AH89426  | Ensembl 103 EnsDb for Homo sapiens
  AH95744  | Ensembl 104 EnsDb for Homo sapiens
  AH98047  | Ensembl 105 EnsDb for Homo sapiens
  AH100643 | Ensembl 106 EnsDb for Homo sapiens
  AH104864 | Ensembl 107 EnsDb for Homo sapiens
> ensdb <- hub[["AH104864"]]
loading from cache
require("ensembldb")
> keytypes(ensdb)
 [1] "ENTREZID"            "EXONID"              "GENEBIOTYPE"        
 [4] "GENEID"              "GENENAME"            "PROTDOMID"          
 [7] "PROTEINDOMAINID"     "PROTEINDOMAINSOURCE" "PROTEINID"          
[10] "SEQNAME"             "SEQSTRAND"           "SYMBOL"             
[13] "TXBIOTYPE"           "TXID"                "TXNAME"             
[16] "UNIPROTID"          
> columns(ensdb)
 [1] "CANONICALTRANSCRIPT" "DESCRIPTION"         "ENTREZID"           
 [4] "EXONID"              "EXONIDX"             "EXONSEQEND"         
 [7] "EXONSEQSTART"        "GCCONTENT"           "GENEBIOTYPE"        
[10] "GENEID"              "GENEIDVERSION"       "GENENAME"           
[13] "GENESEQEND"          "GENESEQSTART"        "INTERPROACCESSION"  
[16] "ISCIRCULAR"          "PROTDOMEND"          "PROTDOMSTART"       
[19] "PROTEINDOMAINID"     "PROTEINDOMAINSOURCE" "PROTEINID"          
[22] "PROTEINSEQUENCE"     "SEQCOORDSYSTEM"      "SEQLENGTH"          
[25] "SEQNAME"             "SEQSTRAND"           "SYMBOL"             
[28] "TXBIOTYPE"           "TXCDSSEQEND"         "TXCDSSEQSTART"      
[31] "TXEXTERNALNAME"      "TXID"                "TXIDVERSION"        
[34] "TXISCANONICAL"       "TXNAME"              "TXSEQEND"           
[37] "TXSEQSTART"          "TXSUPPORTLEVEL"      "UNIPROTDB"          
[40] "UNIPROTID"           "UNIPROTMAPPINGTYPE" 
> head(keys(ensdb, "GENEID"))
[1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457"
[5] "ENSG00000000460" "ENSG00000000938"
> head(keys(ensdb, "ENTREZID"))
[1]  1  2  3  9 10 12
> select(ensdb, head(keys(ensdb, "GENEID")), "ENTREZID","GENEID")
           GENEID ENTREZID
1 ENSG00000000003     7105
2 ENSG00000000005    64102
3 ENSG00000000419     8813
4 ENSG00000000457    57147
5 ENSG00000000460    55732
6 ENSG00000000938     2268
ADD COMMENT
0
Entering edit mode

If you are really enthused about poking around, note that all of the annotation packages are just simple wrappers around a SQLite database, and if you want to see what's what, you can just do SQL queries.

> con <- dbconn(ensdb)
> library(DBI)
> dbListTables(con)
 [1] "chromosome"     "entrezgene"     "exon"           "gene"          
 [5] "metadata"       "protein"        "protein_domain" "tx"            
 [9] "tx2exon"        "uniprot"       
> dbGetQuery(con, "select * from entrezgene limit 5;")
          gene_id entrezid
1 ENSG00000000003     7105
2 ENSG00000000005    64102
3 ENSG00000000419     8813
4 ENSG00000000457    57147
5 ENSG00000000460    55732
> dbGetQuery(con, "select * from tx limit 5;")
            tx_id           tx_biotype tx_seq_start tx_seq_end tx_cds_seq_start
1 ENST00000373020       protein_coding    100627108  100636806        100630798
2 ENST00000612152       protein_coding    100627109  100637104        100630798
3 ENST00000614008       protein_coding    100632063  100637104        100632063
4 ENST00000496771 processed_transcript    100632541  100636689               NA
5 ENST00000494424 processed_transcript    100633442  100639991               NA
  tx_cds_seq_end         gene_id tx_support_level     tx_id_version gc_content
1      100636694 ENSG00000000003                1 ENST00000373020.9   38.50849
2      100635569 ENSG00000000003                5 ENST00000612152.4   38.56691
3      100635569 ENSG00000000003                5 ENST00000614008.4   43.77778
4             NA ENSG00000000003                3 ENST00000496771.5   50.34146
5             NA ENSG00000000003                2 ENST00000494424.1   44.26829
  tx_external_name tx_is_canonical
1       TSPAN6-201               1
2       TSPAN6-204               0
3       TSPAN6-205               0
4       TSPAN6-203               0
5       TSPAN6-202               0

But the existing accessors create the correct SQL queries on the fly, and generate the correct output data format, so unless you A) really know what you are doing and B) cannot do the things you want to do using the existing accessors (highly unlikely), it's easier to just use the packages as intended.

ADD REPLY
0
Entering edit mode

This is really cool and helps me wrap my head around what's going on internally. Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 558 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6