Question

TxDb.Hsapiens.UCSC.hg38.ensGene doesn't exist

0

Entering edit mode

Aditya ▴ 160

@aditya-7667

Last seen 22 months ago

Germany

For mouse, BioC offers both entrezg- and ensembl-centric TxDbs:

    # Entrezg
        txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene::TxDb.Mmusculus.UCSC.mm10.knownGene
        GenomicFeatures::genes(txdb)[c('100009600', '99889', '99982')]

    # Ensembl
        txdb <- TxDb.Mmusculus.UCSC.mm10.ensGene::TxDb.Mmusculus.UCSC.mm10.ensGene
        GenomicFeatures::genes(txdb)[c('ENSMUSG00000000001', 'ENSMUSG00000000003')]

For human, BioC has entrezg- but not ensembl-centric TxDbs, why is this?

   # Entrezg
        txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene::TxDb.Hsapiens.UCSC.hg38.knownGene
        GenomicFeatures::genes(txdb)[c('1', '10', '100')]

   # Ensembl
       # No TxDb.Hsapiens.UCSC.hg38.ensGene::TxDb.Hsapiens.UCSC.hg38.ensGene

TxDb.Hsapiens.UCSC.hg38.knownGene TxDb.Mmusculus.UCSC.mm10.ensGene • 2.7k views

ADD COMMENT • link updated 4.5 years ago by James W. MacDonald 65k • written 4.5 years ago by Aditya ▴ 160

score 2 · Accepted Answer · 2019-10-15

2

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 5 hours ago

United States

There isn't an ensGene table for hg38. If you hit the Track dropdown, you won't see an Ensembl Track. If you change to hg37, it's there.

And you should probably not use UCSC data for Ensembl anyway. Although both UCSC and Ensembl start with the GRCh38 genome, they use different pipelines to say what is and isn't a gene, and their ending products have many discrepancies. If you were to use the ensGene table (if it existed), you would be in essence asking for just those genes that both Ensembl and NCBI/UCSC agree are the same thing. And anything for which they disagree gets dropped.

You are better off using Johannes Rainier's EnsDb packages.

> library(AnnotationHub)

> hub <- AnnotationHub()
snapshotDate(): 2019-10-08
> query(hub, c("Homo sapiens","EnsDb"))
AnnotationHub with 13 records
# snapshotDate(): 2019-10-08 
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH53211"]]' 

            title                            
  AH53211 | Ensembl 87 EnsDb for Homo Sapiens
  AH53715 | Ensembl 88 EnsDb for Homo Sapiens
  AH56681 | Ensembl 89 EnsDb for Homo Sapiens
  AH57757 | Ensembl 90 EnsDb for Homo Sapiens
  AH60773 | Ensembl 91 EnsDb for Homo Sapiens
  ...       ...                              
  AH67950 | Ensembl 95 EnsDb for Homo sapiens
  AH69187 | Ensembl 96 EnsDb for Homo sapiens
  AH73881 | Ensembl 97 EnsDb for Homo sapiens
  AH73986 | Ensembl 79 EnsDb for Homo sapiens
  AH75011 | Ensembl 98 EnsDb for Homo sapiens
> ensdb <- hub[["AH75011"]]
downloading 1 resources
retrieving 1 resource
  |======================================================================| 100%

loading from cache
require("ensembldb")
> ensdb
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.4
|Creation time: Mon Sep 30 20:28:27 2019
|ensembl_version: 98
|ensembl_host: localhost
|Organism: Homo sapiens
|taxonomy_id: 9606
|genome_build: GRCh38
|DBSCHEMAVERSION: 2.1
| No. of genes: 67946.
| No. of transcripts: 250194.
|Protein data available.

ADD COMMENT • link 4.5 years ago James W. MacDonald 65k

0

Entering edit mode

Thank you James - very enlightening!

I found an alternative in GenomicFeatures::makeTxDbFromEnsembl(organism = 'Homo sapiens'), but that doesn't get the seqinfo right (I wonder why):

txdb <- GenomicFeatures::makeTxDbFromEnsembl(organism = 'Homo sapiens')
GenomeInfoDb::seqinfo(txdb)
      Seqinfo object with 1498 sequences (1 circular) from an unspecified genome:
      seqnames         seqlengths isCircular genome
      CHR_HG107_PATCH   135088590       <NA>   <NA>
      CHR_HG109_PATCH    58617934       <NA>   <NA>
      ...
      LRG_998              120962       <NA>   <NA>
      LRG_999               34260       <NA>   <NA>

Your suggestion to use Johannes Rainer's Annotation package indeed proves well:

hub <- AnnotationHub::AnnotationHub()
ensdb <- hub[["AH75011"]]
GenomeInfoDb::seqinfo(ensdb)
     Seqinfo object with 454 sequences from GRCh38 genome:
     seqnames seqlengths isCircular genome
         1         248956422      FALSE GRCh38
        10         133797422      FALSE GRCh38
        ...             ...        ...    ...
         X         156040895      FALSE GRCh38
         Y          57227415      FALSE GRCh38

ADD REPLY • link 4.5 years ago Aditya ▴ 160

0

Entering edit mode

Two separate questions

I've noticed that BioC is moving away from annnotation packages (old style) to the AnnotationHub interface (new style). Is anyone willing to comment on this evolution?
UCSC uses the NCBI gene models, right (or does it have its own, third set, besides Ensembl and NCBI?)

ADD REPLY • link 4.5 years ago Aditya ▴ 160

0

Entering edit mode

What exactly are you asking? Comment how?
There is a FAQ.

ADD REPLY • link 4.5 years ago James W. MacDonald 65k

0

Entering edit mode

Thankyou James.

I was wondering about the design concept as to what is packaged and what is hubbed. I notice, e.g. that the UCSC TxDbs are packaged, whereas the Ensembl EnsDbs are hubbed.

From this I can infer two alternative design rules, and I am wondering which one is correct:

BioC packages NCBI/UCSC resources, but hubs Ensembl resources
BioC packages what was historically packaged, but hubs new data

ADD REPLY • link 4.5 years ago Aditya ▴ 160

0

Entering edit mode

hubs == new data; in some ideal world we'd like the package-based TxDb transitioned to the hubs, too...

ADD REPLY • link 4.5 years ago Martin Morgan 25k

0

Entering edit mode

Owkies, clear now, thx Martin!

ADD REPLY • link 4.5 years ago Aditya ▴ 160

0

Entering edit mode

It makes dependency management a bit trickier, though, since a Hub resource cannot be imported in a package (can it?). The price to pay for the increased volumes and sizes of data? Or has a new workflow paradigm superseeded the need for data packages alltogether?

ADD REPLY • link 4.5 years ago Aditya ▴ 160

0

Entering edit mode

Several packages make use of AnnotationHub or ExperimentHub resources in a way that is transparent to the user, e.g., including a function like

myfun <- function() {
    AnnotationHub()[["AH123"]]
}

See the manual Creating an AnnotationHub Package (But ask questions about package development on the bioc-devel mailing list!)