ensembldb support for Ensembl's transcript_name column
2
3
Entering edit mode
@mikelove
Last seen 15 hours ago
United States

I've been looking at mouse transcripts from Ensembl, e.g.:

> edb <- ah[["AH89211"]]
> txps <- transcripts(edb)
> txps[82575,]
GRanges object with 1 range and 9 metadata columns:
                     seqnames              ranges strand |              tx_id
                        <Rle>           <IRanges>  <Rle> |        <character>
  ENSMUST00000029812        3 135584655-135691546      - | ENSMUST00000029812
                         tx_biotype tx_cds_seq_start tx_cds_seq_end
                        <character>        <integer>      <integer>
  ENSMUST00000029812 protein_coding        135585355      135667785
                                gene_id tx_support_level         tx_id_version
                            <character>        <integer>           <character>
  ENSMUST00000029812 ENSMUSG00000028163                1 ENSMUST00000029812.13
                     gc_content            tx_name
                      <numeric>        <character>
  ENSMUST00000029812    53.0969 ENSMUST00000029812
  -------
  seqinfo: 118 sequences from GRCm38 genome

Is it possible to obtain the transcript_name from the GTF? The one that has the gene symbol + dash + a number. E.g. this Nfkb1-201 from the GTF. They seem to prioritize this on the Ensembl gene viewer so it's convenient to use for cross-referencing.

3   havana  transcript  135584655   135691546   .   -   .   gene_id "ENSMUSG00000028163"; gene_version "17"; 
transcript_id "ENSMUST00000029812"; transcript_version "13"; gene_name "Nfkb1"; gene_source "ensembl_havana"; 
gene_biotype "protein_coding"; havana_gene "OTTMUSG00000016668"; havana_gene_version "5"; transcript_name "Nfkb1-201"; 
transcript_source "havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS17858"; 
havana_transcript "OTTMUST00000040338"; havana_transcript_version "1"; tag "basic"; transcript_support_level "1";

I looked over vignette and man pages, but I may have missed something.

ensembldb • 3.4k views
ADD COMMENT
2
Entering edit mode
Johannes Rainer ★ 2.1k
@johannes-rainer-6987
Last seen 5 weeks ago
Italy

The transcript names were not available in EnsDb databases, the column "tx_name" (and the TxNameFilter) contain the Ensembl transcript IDs (to be compliant with GenomicFeatures TxDb databases, that contain an internal identifier in the column "tx_id" and the actual transcript ID in the "tx_name" column.

ensembldb version 2.18.1 introduces the possibility to create EnsDb databases with an additional column "tx_external_name" that contains the "transcript_name" field from GTF files or the transcript's external name from Ensembl core databases. Thus, EnsDb created with that ensembldb version will have support for transcript names. Also, a TxExternalNameFilter can be used to filter the database on that new column.

I'll also update all EnsDb databases for Ensembl release 104 on AnnotationHub with the new versions that provide the tx external name. These databases are fully backward compatible and the additional column "tx_external_name" will also be retrieved by older versions of the ensembldb package.

ADD COMMENT
1
Entering edit mode

This is great! thanks Johannes. This will really help when connecting results to gene models from the Ensembl genome browser.

ADD REPLY
1
Entering edit mode

Oops it looks like my request may have caused some issues in release and devel. tximeta builds with error because this resource cannot load (edit: I've dealt with the error for now just by commenting out the part that would pull the resource from Ahub).

> ah <- AnnotationHub()
snapshotDate(): 2021-10-20
> ah[["AH74985"]]
loading from cache
require(“ensembldb”)
Error: failed to load resource
  name: AH74985
  title: Ensembl 98 EnsDb for Drosophila melanogaster
  reason: Table gene is missing required columns canonical_transcript!
> ?AnnotationHub
> removeCache(ah)
remove cache and 16 resource(s)? (yes/no): yes
[1] TRUE
Warning message:
call dbDisconnect() when finished working with a connection 
> ah <- AnnotationHub()
/home/love/.cache/R/AnnotationHub
  does not exist, create directory? (yes/no): yes
  |======================================================================| 100%

snapshotDate(): 2021-10-20
> ah[["AH74985"]]
downloading 1 resources
retrieving 1 resource
  |======================================================================| 100%

loading from cache
Error: failed to load resource
  name: AH74985
  title: Ensembl 98 EnsDb for Drosophila melanogaster
  reason: Table gene is missing required columns canonical_transcript!

My session info:

> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices datasets  utils     methods  
[8] base     

other attached packages:
 [1] ensembldb_2.18.1        AnnotationFilter_1.18.0 GenomicFeatures_1.46.1 
 [4] AnnotationDbi_1.56.1    Biobase_2.54.0          GenomicRanges_1.46.0   
 [7] GenomeInfoDb_1.30.0     IRanges_2.28.0          S4Vectors_0.32.1       
[10] AnnotationHub_3.2.0     BiocFileCache_2.2.0     dbplyr_2.1.1           
[13] BiocGenerics_0.40.0     rmarkdown_2.11          testthat_3.1.0         
[16] devtools_2.4.2          usethis_2.1.3          

loaded via a namespace (and not attached):
 [1] ProtGenerics_1.26.0           matrixStats_0.61.0           
 [3] bitops_1.0-7                  fs_1.5.0                     
 [5] bit64_4.0.5                   filelock_1.0.2               
 [7] progress_1.2.2                httr_1.4.2                   
 [9] rprojroot_2.0.2               tools_4.1.2                  
[11] utf8_1.2.2                    R6_2.5.1                     
[13] lazyeval_0.2.2                DBI_1.1.1                    
[15] withr_2.4.2                   tidyselect_1.1.1             
[17] prettyunits_1.1.1             processx_3.5.2               
[19] bit_4.0.4                     curl_4.3.2                   
[21] compiler_4.1.2                cli_3.1.0                    
[23] xml2_1.3.2                    DelayedArray_0.20.0          
[25] desc_1.4.0                    rtracklayer_1.54.0           
[27] callr_3.7.0                   rappdirs_0.3.3               
[29] stringr_1.4.0                 digest_0.6.28                
[31] Rsamtools_2.10.0              XVector_0.34.0               
[33] pkgconfig_2.0.3               htmltools_0.5.2              
[35] sessioninfo_1.2.1             MatrixGenerics_1.6.0         
[37] fastmap_1.1.0                 rlang_0.4.12                 
[39] RSQLite_2.2.8                 shiny_1.7.1                  
[41] BiocIO_1.4.0                  generics_0.1.1               
[43] BiocParallel_1.28.0           dplyr_1.0.7                  
[45] RCurl_1.98-1.5                magrittr_2.0.1               
[47] GenomeInfoDbData_1.2.7        Matrix_1.3-4                 
[49] Rcpp_1.0.7                    fansi_0.5.0                  
[51] lifecycle_1.0.1               stringi_1.7.5                
[53] yaml_2.2.1                    SummarizedExperiment_1.24.0  
[55] zlibbioc_1.40.0               pkgbuild_1.2.0               
[57] grid_4.1.2                    blob_1.2.2                   
[59] parallel_4.1.2                promises_1.2.0.1             
[61] crayon_1.4.2                  lattice_0.20-45              
[63] Biostrings_2.62.0             hms_1.1.1                    
[65] KEGGREST_1.34.0               knitr_1.36                   
[67] ps_1.6.0                      pillar_1.6.4                 
[69] rjson_0.2.20                  biomaRt_2.50.0               
[71] pkgload_1.2.3                 XML_3.99-0.8                 
[73] glue_1.4.2                    BiocVersion_3.14.0           
[75] evaluate_0.14                 remotes_2.4.1                
[77] BiocManager_1.30.16           png_0.1-7                    
[79] vctrs_0.3.8                   httpuv_1.6.3                 
[81] purrr_0.3.4                   assertthat_0.2.1             
[83] cachem_1.0.6                  xfun_0.28                    
[85] mime_0.12                     xtable_1.8-4                 
[87] restfulr_0.0.13               later_1.3.0                  
[89] tibble_3.1.5                  GenomicAlignments_1.30.0     
[91] memoise_2.0.0                 ellipsis_0.3.2               
[93] interactiveDisplayBase_1.32.0
ADD REPLY
1
Entering edit mode

Thanks Mike for reporting. I've opened an issue in ensembldb and will have a look at it.

ADD REPLY
0
Entering edit mode

For what it's worth, the same problem (package build error) applies for my package satuRn with the exact same error message as flagged by Michael Love above.

Error: failed to load resource
  name: AH74985
  title: Ensembl 98 EnsDb for Drosophila melanogaster
  reason: Table gene is missing required columns canonical_transcript!

(https://master.bioconductor.org/checkResults/3.14/bioc-LATEST/satuRn/nebbiolo2-buildsrc.html)

ADD REPLY
0
Entering edit mode

Thanks for reporting. I submitted the fix yesterday - so I hope all is fine again after the next build round.

ADD REPLY
0
Entering edit mode

Thanks for fixing Johannes!

ADD REPLY
0
Entering edit mode

I am still having this issue with my package tximeta(). I have updated. I am very new to this so I am not sure what "next buiild round" means or when I should check to see if this has been resolved.

ADD REPLY
1
Entering edit mode

This is resolved for me with 2.18.2 from Bioconductor:

ah[["AH74985"]]
loading from cache
require(“ensembldb”)
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.5
|Creation time: Tue Nov 19 08:33:44 2019
|ensembl_version: 98
|ensembl_host: localhost
|Organism: Drosophila melanogaster
|taxonomy_id: 7227
|genome_build: BDGP6.22
|DBSCHEMAVERSION: 2.1
| No. of genes: 17753.
| No. of transcripts: 34802.
|Protein data available.
ADD REPLY
0
Entering edit mode

You will have to wait until his git commit propagates to the Bioconductor builds that you download with BiocManager::install(). Usually the commits are taken up at 5pm US East and show up on the website in the afternoon on the following day. You can track by following here:

https://bioconductor.org/packages/release/bioc/html/ensembldb.html

when version > 2.18.1 and also here:

https://bioconductor.org/packages/release/bioc/news/ensembldb/NEWS

ADD REPLY
0
Entering edit mode

Ah I see, Thank you!

ADD REPLY
0
Entering edit mode

Thx Johannes. FYI this also caused problems to the OSCA book:

Hopefully this will clear out on the next book build report tomorrow.

H.

ADD REPLY
0
Entering edit mode

Hi Herve,

seems the second book was fixed. The first one still has an error, but that does not seem to be related to ensembldb or the recent changes. But please let me know if it is in fact due to ensembldb and I'll investigate/fix.

cheers, jo

ADD REPLY
0
Entering edit mode

Good! Seems to be a different error. Hard to tell at first sight if this new error is still related to ensembldb. Since it took more than 1 hour for R CMD build OSCA.multisample to reach the point of failure, troubleshooting this is probably not going to be easy. I pass ;-)

H.

ADD REPLY
0
Entering edit mode

Also fixed for satuRn, thank you!

ADD REPLY
0
Entering edit mode
Johannes Rainer ★ 2.1k
@johannes-rainer-6987
Last seen 5 weeks ago
Italy

Hi Mike,

the EnsDbs you get from e.g. AnnotationHub are not created from GTF files but from the Ensembl databases using the Ensembl Perl API. Creating EnsDbs from GTF will not result in the same amount of information/annotation because certain fields are not present in all of the GTF dialects.

What you want (please correct me if I misunderstood) is to get transcript names such as "Nfkb-201" instead of (as currently reported) the transcript ID in the official EnsDbs (i.e. the ones distributed through AnnotationHub?

ADD COMMENT
0
Entering edit mode

Thanks Johannes, I see.

I'd actually like to have access to (here referring to key name from GTF) transcript_id "ENSMUST00000029812" and the transcript_name "Nfkb-201". For me, the column names in the EnsDb don't matter so much, I'd just like to bring those extra identifiers along for matching with the genome viewer on their website.

Is it possible to obtain these extra names with additional arguments to ensDbFromGtf? Or could I process the GTF to a GRanges and then bring them along?

> edb_gtf <- EnsDb("Mus_musculus.GRCm38.102.sqlite")
> txps <- transcripts(edb_gtf)
> txps[82575,]
GRanges object with 1 range and 6 metadata columns:
                     seqnames              ranges strand |              tx_id
                        <Rle>           <IRanges>  <Rle> |        <character>
  ENSMUST00000029812        3 135584655-135691546      - | ENSMUST00000029812
                         tx_biotype tx_cds_seq_start tx_cds_seq_end
                        <character>        <integer>      <integer>
  ENSMUST00000029812 protein_coding        135585355      135667785
                                gene_id            tx_name
                            <character>        <character>
  ENSMUST00000029812 ENSMUSG00000028163 ENSMUST00000029812
  -------
  seqinfo: 22 sequences from GRCm38 genome
>
ADD REPLY
1
Entering edit mode

I'll look into that and add that functionality. I've opened an issue here: https://github.com/jorainer/ensembldb/issues/121

ADD REPLY

Login before adding your answer.

Traffic: 419 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6