Question: TxDb.Hsapiens.UCSC.hg38.ensGene doesn't exist
0
gravatar for Aditya
5 weeks ago by
Aditya120
Germany
Aditya120 wrote:

For mouse, BioC offers both entrezg- and ensembl-centric TxDbs:

    # Entrezg
        txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene::TxDb.Mmusculus.UCSC.mm10.knownGene
        GenomicFeatures::genes(txdb)[c('100009600', '99889', '99982')]

    # Ensembl
        txdb <- TxDb.Mmusculus.UCSC.mm10.ensGene::TxDb.Mmusculus.UCSC.mm10.ensGene
        GenomicFeatures::genes(txdb)[c('ENSMUSG00000000001', 'ENSMUSG00000000003')]

For human, BioC has entrezg- but not ensembl-centric TxDbs, why is this?

   # Entrezg
        txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene::TxDb.Hsapiens.UCSC.hg38.knownGene
        GenomicFeatures::genes(txdb)[c('1', '10', '100')]

   # Ensembl
       # No TxDb.Hsapiens.UCSC.hg38.ensGene::TxDb.Hsapiens.UCSC.hg38.ensGene
ADD COMMENTlink modified 5 weeks ago by James W. MacDonald51k • written 5 weeks ago by Aditya120
Answer: TxDb.Hsapiens.UCSC.hg38.ensGene doesn't exist
2
gravatar for James W. MacDonald
5 weeks ago by
United States
James W. MacDonald51k wrote:

There isn't an ensGene table for hg38. If you hit the Track dropdown, you won't see an Ensembl Track. If you change to hg37, it's there.

And you should probably not use UCSC data for Ensembl anyway. Although both UCSC and Ensembl start with the GRCh38 genome, they use different pipelines to say what is and isn't a gene, and their ending products have many discrepancies. If you were to use the ensGene table (if it existed), you would be in essence asking for just those genes that both Ensembl and NCBI/UCSC agree are the same thing. And anything for which they disagree gets dropped.

You are better off using Johannes Rainier's EnsDb packages.

> library(AnnotationHub)

> hub <- AnnotationHub()
snapshotDate(): 2019-10-08
> query(hub, c("Homo sapiens","EnsDb"))
AnnotationHub with 13 records
# snapshotDate(): 2019-10-08 
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH53211"]]' 

            title                            
  AH53211 | Ensembl 87 EnsDb for Homo Sapiens
  AH53715 | Ensembl 88 EnsDb for Homo Sapiens
  AH56681 | Ensembl 89 EnsDb for Homo Sapiens
  AH57757 | Ensembl 90 EnsDb for Homo Sapiens
  AH60773 | Ensembl 91 EnsDb for Homo Sapiens
  ...       ...                              
  AH67950 | Ensembl 95 EnsDb for Homo sapiens
  AH69187 | Ensembl 96 EnsDb for Homo sapiens
  AH73881 | Ensembl 97 EnsDb for Homo sapiens
  AH73986 | Ensembl 79 EnsDb for Homo sapiens
  AH75011 | Ensembl 98 EnsDb for Homo sapiens
> ensdb <- hub[["AH75011"]]
downloading 1 resources
retrieving 1 resource
  |======================================================================| 100%

loading from cache
require("ensembldb")
> ensdb
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.4
|Creation time: Mon Sep 30 20:28:27 2019
|ensembl_version: 98
|ensembl_host: localhost
|Organism: Homo sapiens
|taxonomy_id: 9606
|genome_build: GRCh38
|DBSCHEMAVERSION: 2.1
| No. of genes: 67946.
| No. of transcripts: 250194.
|Protein data available.
ADD COMMENTlink written 5 weeks ago by James W. MacDonald51k

Thank you James - very enlightening!

I found an alternative in GenomicFeatures::makeTxDbFromEnsembl(organism = 'Homo sapiens'), but that doesn't get the seqinfo right (I wonder why):

txdb <- GenomicFeatures::makeTxDbFromEnsembl(organism = 'Homo sapiens')
GenomeInfoDb::seqinfo(txdb)
      Seqinfo object with 1498 sequences (1 circular) from an unspecified genome:
      seqnames         seqlengths isCircular genome
      CHR_HG107_PATCH   135088590       <NA>   <NA>
      CHR_HG109_PATCH    58617934       <NA>   <NA>
      ...
      LRG_998              120962       <NA>   <NA>
      LRG_999               34260       <NA>   <NA>

Your suggestion to use Johannes Rainer's Annotation package indeed proves well:

hub <- AnnotationHub::AnnotationHub()
ensdb <- hub[["AH75011"]]
GenomeInfoDb::seqinfo(ensdb)
     Seqinfo object with 454 sequences from GRCh38 genome:
     seqnames seqlengths isCircular genome
         1         248956422      FALSE GRCh38
        10         133797422      FALSE GRCh38
        ...             ...        ...    ...
         X         156040895      FALSE GRCh38
         Y          57227415      FALSE GRCh38
ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Aditya120

Two separate questions

  1. I've noticed that BioC is moving away from annnotation packages (old style) to the AnnotationHub interface (new style). Is anyone willing to comment on this evolution?
  2. UCSC uses the NCBI gene models, right (or does it have its own, third set, besides Ensembl and NCBI?)
ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Aditya120
  1. What exactly are you asking? Comment how?
  2. There is a FAQ.
ADD REPLYlink written 4 weeks ago by James W. MacDonald51k

Thankyou James.

I was wondering about the design concept as to what is packaged and what is hubbed. I notice, e.g. that the UCSC TxDbs are packaged, whereas the Ensembl EnsDbs are hubbed.

From this I can infer two alternative design rules, and I am wondering which one is correct:

  1. BioC packages NCBI/UCSC resources, but hubs Ensembl resources
  2. BioC packages what was historically packaged, but hubs new data
ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Aditya120

hubs == new data; in some ideal world we'd like the package-based TxDb transitioned to the hubs, too...

ADD REPLYlink written 4 weeks ago by Martin Morgan ♦♦ 24k

Owkies, clear now, thx Martin!

ADD REPLYlink written 4 weeks ago by Aditya120

It makes dependency management a bit trickier, though, since a Hub resource cannot be imported in a package (can it?). The price to pay for the increased volumes and sizes of data? Or has a new workflow paradigm superseeded the need for data packages alltogether?

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Aditya120

Several packages make use of AnnotationHub or ExperimentHub resources in a way that is transparent to the user, e.g., including a function like

myfun <- function() {
    AnnotationHub()[["AH123"]]
}

See the manual Creating an AnnotationHub Package (But ask questions about package development on the bioc-devel mailing list!)

ADD REPLYlink written 4 weeks ago by Martin Morgan ♦♦ 24k

Aha, so that's the new paradigm - thx - will use that!

ADD REPLYlink written 4 weeks ago by Aditya120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 437 users visited in the last hour