annotating gene for transcript type/biotype
Entering edit mode
sjmonkley ▴ 30
Last seen 5 weeks ago

I am trying to annotate a table of genes with transcript type (TXTYPE) using Homo.sapiens annotation package within Annotation.Dbi . I want to do this so I can select for only rotein coding genes and remove all the RNA encoding and psuedogenes

I m using the following code:

all_genes$TxType <- mapIds(Homo.sapiens,

However in the Tx.type column is just full of n/a's. The code works with GeneName as column. 

Is the problem that only transcripts rather than genes can be annotated with TXTYPE information?


R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

[1] LC_COLLATE=Swedish_Sweden.1252  LC_CTYPE=Swedish_Sweden.1252    LC_MONETARY=Swedish_Sweden.1252
[4] LC_NUMERIC=C                    LC_TIME=Swedish_Sweden.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] dplyr_0.4.3                             Homo.sapiens_1.3.1                     
 [3] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2                     
 [5] GO.db_3.2.2                             RSQLite_1.0.0                          
 [7] DBI_0.3.1                               OrganismDbi_1.12.0                     
 [9] GenomicFeatures_1.22.2                  GenomicRanges_1.22.0                   
[11] GenomeInfoDb_1.6.0                      AnnotationDbi_1.32.0                   
[13] IRanges_2.4.0                           S4Vectors_0.8.0                        
[15] Biobase_2.30.0                          BiocGenerics_0.16.0                    

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.1                graph_1.48.0               magrittr_1.5              
 [4] XVector_0.10.0             zlibbioc_1.16.0            GenomicAlignments_1.6.1   
 [7] BiocParallel_1.4.0         R6_2.1.1                   tools_3.2.2               
[10] SummarizedExperiment_1.0.0 lambda.r_1.1.7             futile.logger_1.4.1       
[13] lazyeval_0.1.10            assertthat_0.1             RBGL_1.46.0               
[16] rtracklayer_1.30.1         futile.options_1.0.0       bitops_1.0-6              
[19] RCurl_1.95-4.7             biomaRt_2.26.0             BiocInstaller_1.20.0      
[22] Biostrings_2.38.0          Rsamtools_1.22.0           XML_3.98-1.3   

annotation homo.sapiens • 4.4k views
Entering edit mode
Johannes Rainer ★ 2.0k
Last seen 12 days ago

Alternatively, you could use ensembldb based annotation packages, as e.g. the EnsDb.Hsapiens.v75 which provides annotations from Ensembl version 75. In the code below we retrieve the mapping between gene ids, gene names, gene biotypes, transcript biotypes and transcript ids.

> library(EnsDb.Hsapiens.v75)

> edb <- EnsDb.Hsapiens.v75> txtypes <- genes(edb, columns=c("gene_name", "gene_biotype", "tx_biotype", "tx_id"))
> head(txtypes)
GRanges object with 6 ranges and 5 metadata columns:
                  seqnames               ranges strand |         gene_id
                     <Rle>            <IRanges>  <Rle> |     <character>
  ENSG00000000003        X [99883667, 99894988]      - | ENSG00000000003
  ENSG00000000003        X [99883667, 99894988]      - | ENSG00000000003
  ENSG00000000003        X [99883667, 99894988]      - | ENSG00000000003
  ENSG00000000005        X [99839799, 99854882]      + | ENSG00000000005
  ENSG00000000005        X [99839799, 99854882]      + | ENSG00000000005
  ENSG00000000419       20 [49551404, 49575092]      - | ENSG00000000419
                    gene_name   gene_biotype           tx_biotype
                  <character>    <character>          <character>
  ENSG00000000003      TSPAN6 protein_coding       protein_coding
  ENSG00000000003      TSPAN6 protein_coding processed_transcript
  ENSG00000000003      TSPAN6 protein_coding processed_transcript
  ENSG00000000005        TNMD protein_coding       protein_coding
  ENSG00000000005        TNMD protein_coding processed_transcript
  ENSG00000000419        DPM1 protein_coding       protein_coding
  ENSG00000000003 ENST00000373020
  ENSG00000000003 ENST00000494424
  ENSG00000000003 ENST00000496771
  ENSG00000000005 ENST00000373031
  ENSG00000000005 ENST00000485971
  ENSG00000000419 ENST00000371582
  seqinfo: 273 sequences from GRCh37 genome


Using filters you can also retrieve just protein coding genes from the database:

> protGenes <- genes(edb, filter=GenebiotypeFilter("protein_coding"))
> protGenes
GRanges object with 22810 ranges and 5 metadata columns:
                               seqnames                 ranges strand   |
                                  <Rle>              <IRanges>  <Rle>   |
  ENSG00000000003                     X [ 99883667,  99894988]      -   |
  ENSG00000000005                     X [ 99839799,  99854882]      +   |
  ENSG00000000419                    20 [ 49551404,  49575092]      -   |
  ENSG00000000457                     1 [169818772, 169863408]      -   |
  ENSG00000000460                     1 [169631245, 169823221]      +   |
              ...                   ...                    ...    ... ...
  ENSG00000273467 HSCHR19LRC_LRC_S_CTG1   [54704726, 54711511]      +   |
  ENSG00000273469  HSCHR19LRC_COX1_CTG1   [54618837, 54635140]      +   |
  ENSG00000273470 HSCHR19LRC_LRC_T_CTG1   [54677107, 54693733]      -   |
  ENSG00000273482           HG957_PATCH   [53190025, 53226733]      +   |
  ENSG00000273490 HSCHR19LRC_LRC_J_CTG1   [54693789, 54697585]      +   |
                          gene_id   gene_name    entrezid   gene_biotype
                      <character> <character> <character>    <character>
  ENSG00000000003 ENSG00000000003      TSPAN6        7105 protein_coding
  ENSG00000000005 ENSG00000000005        TNMD       64102 protein_coding
  ENSG00000000419 ENSG00000000419        DPM1        8813 protein_coding
  ENSG00000000457 ENSG00000000457       SCYL3       57147 protein_coding
  ENSG00000000460 ENSG00000000460    C1orf112       55732 protein_coding
              ...             ...         ...         ...            ...
  ENSG00000273467 ENSG00000273467        RPS9        6203 protein_coding
  ENSG00000273469 ENSG00000273469      PRPF31       26121 protein_coding
  ENSG00000273470 ENSG00000273470      MBOAT7       79143 protein_coding
  ENSG00000273482 ENSG00000273482       PRKCD        5580 protein_coding
  ENSG00000273490 ENSG00000273490      TSEN34       79042 protein_coding
  ENSG00000000003       chromosome
  ENSG00000000005       chromosome
  ENSG00000000419       chromosome
  ENSG00000000457       chromosome
  ENSG00000000460       chromosome
              ...              ...
  ENSG00000273467       chromosome
  ENSG00000273469       chromosome
  ENSG00000273470       chromosome
  ENSG00000273482       chromosome
  ENSG00000273490       chromosome
  seqinfo: 221 sequences from GRCh37 genome


If you need more recent Ensembl based annotations you can check the vignette of the ensembldb package on how to create such annotation packages.

hope that helps.

cheers, jo



Entering edit mode
Last seen 16 hours ago
United States

This looks like there might be a bug somewhere, but I haven't got it tracked down just yet. But first, do note that there isn't a TXTYPE for the UCSC-based TxDb packages. This is because UCSC doesn't provide that sort of information - the underlying transcripts table in e.g., TxDb.Hsapiens.UCSC.hg19.knownGene has all NA values under the tx_type column, which is where those data come from.

However, Ensembl does have this sort of information, so if you build a TxDb based on their data, then it should in theory work. I sort of went the long way round, but here is the jist:

> txdb <- makeTxDbFromBiomart()
Download and preprocess the 'transcripts' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... FAILED! (=> skipped)
Download and preprocess the 'splicings' data frame ... OK
Download and preprocess the 'genes' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .normarg_makeTxDb_chrominfo(chrominfo, transcripts$tx_chrom,  :
  chromosome lengths and circularity flags are not available for this TxDb object

> makeTxDbPackage(txdb, "0.1.1", "me <>", "me")
Creating package in ./TxDb.Hsapiens.BioMart.ensembl.GRCh38.p3

> install.packages("TxDb.Hsapiens.BioMart.ensembl.GRCh38.p3/", repos = NULL)
* installing *source* package  TxDb.Hsapiens.BioMart.ensembl.GRCh38.p3  ...
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (TxDb.Hsapiens.BioMart.ensembl.GRCh38.p3)

> library(Homo.sapiens)
Loading required package: OrganismDbi
Loading required package: GO.db
Loading required package: DBI

Loading required package:

Loading required package: TxDb.Hsapiens.UCSC.hg19.knownGene
Now getting the GODb Object directly
Now getting the OrgDb Object directly
Now getting the TxDb Object directly

Note that this Homo.sapiens is using the TxDb.Hsapiens.UCSC.hg19.knownGene package, but we can over-ride

> TxDb(Homo.sapiens) <- txdb
Now getting the GODb Object directly
Now getting the OrgDb Object directly
Now loading the TxDb Object from a package

> Homo.sapiens
OrganismDb Object:
# Includes GODb Object:  GO.db
# With data about:  Gene Ontology
# Includes OrgDb Object:
# Gene data about:  Homo sapiens
# Taxonomy Id:  9606
# Includes TxDb Object:  TxDb.Hsapiens.BioMart.ensembl.GRCh38.p3
# Transcriptome data about:  Homo sapiens
# Based on genome:   
# The OrgDb gene id ENTREZID is mapped to the TxDb gene id GENEID .

> z <- select(Homo.sapiens, keys(Homo.sapiens, "ENSEMBL")[1:500], "TXTYPE","ENSEMBL")
'select()' returned 1:1 mapping between keys and columns
> head(z)
1 ENSG00000121410   <NA>
2 ENSG00000175899   <NA>
3 ENSG00000256069   <NA>
4 ENSG00000171428   <NA>
5 ENSG00000156006   <NA>
6 ENSG00000196136   <NA>

Which is a bummer. But the transcripts table does have this info!

> con <- dbConnect(SQLite(), paste0(path.package('TxDb.Hsapiens.BioMart.ensembl.GRCh38.p3'), '/extdata/TxDb.Hsapiens.BioMart.ensembl.GRCh38.p3.sqlite'))
> dbGetQuery(con, "select _tx_id, tx_name, tx_type from transcript limit 5;")
  _tx_id         tx_name              tx_type
1      1 ENST00000627793                snRNA
2      2 ENST00000628518 processed_pseudogene
3      3 ENST00000628983       protein_coding
4      4 ENST00000629165              lincRNA
5      5 ENST00000630948                 rRNA

So evidently there is a problem extracting the tx_type data correctly. I don't have the time just now to track this down, but unless somebody else volunteers I will try to get to it ASAP.

Entering edit mode

But do note that you CAN get these data via either transcripts() or transcriptsBy():

> transcriptsBy(Homo.sapiens, by = "gene", columns = c("TXTYPE","TXNAME"))
'select()' returned 1:1 mapping between keys and columns
GRangesList object of length 65988:
GRanges object with 5 ranges and 3 metadata columns:
      seqnames                 ranges strand |         tx_name          TXNAME
         <Rle>              <IRanges>  <Rle> |     <character> <CharacterList>
  [1]        X [100627109, 100637104]      - | ENST00000612152 ENST00000612152
  [2]        X [100628670, 100636806]      - | ENST00000373020 ENST00000373020
  [3]        X [100632063, 100637104]      - | ENST00000614008 ENST00000614008
  [4]        X [100632541, 100636689]      - | ENST00000496771 ENST00000496771
  [5]        X [100633442, 100639991]      - | ENST00000494424 ENST00000494424
  [1]       protein_coding
  [2]       protein_coding
  [3]       protein_coding
  [4] processed_transcript
  [5] processed_transcript

GRanges object with 2 ranges and 3 metadata columns:
      seqnames                 ranges strand |         tx_name          TXNAME
  [1]        X [100584802, 100599885]      + | ENST00000373031 ENST00000373031
  [2]        X [100593624, 100597531]      + | ENST00000485971 ENST00000485971
  [1]       protein_coding
  [2] processed_transcript

GRanges object with 6 ranges and 3 metadata columns:
      seqnames               ranges strand |         tx_name          TXNAME
  [1]       20 [50934867, 50958550]      - | ENST00000371588 ENST00000371588
  [2]       20 [50934867, 50958550]      - | ENST00000466152 ENST00000466152
  [3]       20 [50934867, 50958555]      - | ENST00000371582 ENST00000371582
  [4]       20 [50934896, 50945861]      - | ENST00000494752 ENST00000494752
  [5]       20 [50934945, 50958521]      - | ENST00000371584 ENST00000371584
  [6]       20 [50936148, 50958532]      - | ENST00000413082 ENST00000413082
  [1]       protein_coding
  [2] processed_transcript
  [3]       protein_coding
  [4] processed_transcript
  [5]       protein_coding
  [6]       protein_coding

<65985 more elements>
seqinfo: 331 sequences from an unspecified genome; no seqlengths
> transcripts(Homo.sapiens, columns = c("TXTYPE","TXNAME"))
'select()' returned 1:1 mapping between keys and columns
GRanges object with 216133 ranges and 2 metadata columns:
                  seqnames               ranges strand   |          TXNAME
                     <Rle>            <IRanges>  <Rle>   | <CharacterList>
       [1] CHR_HG126_PATCH [72577040, 72577203]      +   | ENST00000627793
       [2] CHR_HG126_PATCH [72583719, 72583884]      +   | ENST00000628518
       [3] CHR_HG126_PATCH [72374621, 72447493]      -   | ENST00000628983
       [4] CHR_HG126_PATCH [72505075, 72550889]      -   | ENST00000629165
       [5] CHR_HG126_PATCH [72692054, 72692160]      -   | ENST00000630948
       ...             ...                  ...    ... ...             ...
  [216129]      KI270750.1     [148668, 148843]      +   | ENST00000612925
  [216130]      KI270752.1     [   144,    268]      +   | ENST00000618580
  [216131]      KI270752.1     [ 21813,  21944]      +   | ENST00000620980
  [216132]      KI270752.1     [  3497,   3623]      -   | ENST00000620677
  [216133]      KI270752.1     [  9943,  10067]      -   | ENST00000611978
       [1]                snRNA
       [2] processed_pseudogene
       [3]       protein_coding
       [4]              lincRNA
       [5]                 rRNA
       ...                  ...
  [216129]                snRNA
  [216130]                miRNA
  [216131]                miRNA
  [216132]                miRNA
  [216133]                miRNA
  seqinfo: 331 sequences from an unspecified genome; no seqlengths
Entering edit mode

Hi Jim, sjmonkley,

Looks like a bug in the SQL generated by select(). This is good timing because I actually started to revisit/refactor all the SQL generation in GenomicFeatures a couple of weeks ago, with some interesting speed improvements. All the extractors (except select()) now use the new SQL generator. I was going to do select() next but got distracted by other things. Moving it back close to the top of my TODO list.


Edit: After closer examination, the problem seems to be in the "select" method for OrganismDb objects (Homo.sapiens), and not in the method for TxDb objects as I thought initially. This one seems to work as expected:

txdb <- makeTxDbFromBiomart(dataset="celegans_gene_ensembl")
head(select(txdb, head(keys(txdb, "TXID"))[1:500], "TXTYPE", "TXID"))
# 'select()' returned many:1 mapping between keys and columns
#   TXID         TXTYPE
# 1    1 protein_coding
# 2    2 protein_coding
# 3    3 protein_coding
# 4    4 protein_coding
# 5    5 protein_coding
# 6    6 protein_coding

So it's not clear that my current work on SQL generation in GenomicFeatures will help address this.

Entering edit mode

Hi Hervé,

It looks like the problem comes up in OrganismDbi, specifically OrganismDbi:::.getSelects(), which makes the assumption (probably warranted) that the underlying data sources are the same for all packages wrapped up under Homo.sapiens. In other words, in OrganismDbi:::.getSelects(), the TxDb is inspected for the correct keytype to be used, but it isn't inspected to ensure that the 'fromKeys' match up with the type of keys in 'toKey'. And then select() is run with skipValidKeysTest = TRUE, so no error is generated (e.g., Entrez Gene Ids are passed into select(), using GENEID as the keytype, but for the TxDb.Hsapiens.BioMart.ensembl.GRCh38.p3 package, the GENEIDs are Ensembl Gene IDs, not Entrez Gene IDs).

It may well be that mixing and matching annotation sources under an OrganismDbi object is a non-starter, and if the OP wants to use a Homo.sapiens package to do this sort of thing, then it should be constructed entirely of Ensembl based objects. Or, probably easier, just use the EnsDb objects instead.


Entering edit mode

Thanks Jim for taking the time to dig into this. That's helpful. The ability for the user to plug his/her own TxDb into an OrganismDb object is a feature that Marc had on his list and he actually started to do some work on this a few months ago. Don't know how much progress was made but it's definitely a useful feature and something we want to pursue. Some compatibility checks would be needed so the user can't plug anything with anything, or at least s/he gets a warning when s/he tries to put together Mouse, Pig, and Human in his/her OrganismDb object for Frankenstein. But plugging a TxDb from Ensembl into Homo.sapiens should be supported.


Entering edit mode

The above doesn't work for me. That is,

transcripts(TxDb.Hsapiens.UCSC.hg38.knownGene, columns="TXTYPE")

just yields a column of NAs.

<h2>Package Info</h2>
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Users/mfansler/miniconda3/envs/sce_r40/lib/libopenblasp-r0.3.10.dylib

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] TxDb.Hsapiens.UCSC.hg38.knownGene_3.10.0 GenomicFeatures_1.40.0                  
[3] AnnotationDbi_1.50.0                     Biobase_2.48.0                          
[5] GenomicRanges_1.40.0                     GenomeInfoDb_1.24.0                     
[7] IRanges_2.22.1                           S4Vectors_0.26.0                        
[9] BiocGenerics_0.34.0                     

loaded via a namespace (and not attached):
 [1] SummarizedExperiment_1.18.1 progress_1.2.2              tidyselect_1.1.0           
 [4] purrr_0.3.4                 lattice_0.20-41             vctrs_0.3.2                
 [7] generics_0.0.2              BiocFileCache_1.12.0        rtracklayer_1.48.0         
[10] blob_1.2.1                  XML_3.99-0.3                rlang_0.4.7                
[13] pillar_1.4.6                glue_1.4.1                  DBI_1.1.0                  
[16] BiocParallel_1.22.0         rappdirs_0.3.1              bit64_4.0.5                
[19] dbplyr_1.4.4                matrixStats_0.56.0          GenomeInfoDbData_1.2.3     
[22] lifecycle_0.2.0             stringr_1.4.0               zlibbioc_1.34.0            
[25] Biostrings_2.56.0           memoise_1.1.0               biomaRt_2.44.0             
[28] curl_4.3                    Rcpp_1.0.5                  openssl_1.4.2              
[31] DelayedArray_0.14.0         XVector_0.28.0              bit_4.0.4                  
[34] Rsamtools_2.4.0             hms_0.5.3                   askpass_1.1                
[37] digest_0.6.25               stringi_1.4.6               dplyr_1.0.2                
[40] grid_4.0.2                  tools_4.0.2                 bitops_1.0-6               
[43] magrittr_1.5                RCurl_1.98-1.2              RSQLite_2.2.0              
[46] tibble_3.0.3                crayon_1.3.4                pkgconfig_2.0.3            
[49] Matrix_1.2-18               ellipsis_0.3.1              prettyunits_1.1.1          
[52] assertthat_0.2.1            httr_1.4.2                  rstudioapi_0.11            
[55] R6_2.4.1                    GenomicAlignments_1.24.0    compiler_4.0.2
Entering edit mode

Please don't just add a comment to a five year old post. If you have a question, please make a new post.

Entering edit mode
sjmonkley ▴ 30
Last seen 5 weeks ago

Hi all

First can I say thank you all so much for all the effort you have all gone to on my behalf. I have to admit that with my limited level of understanding of annotation databases I did not follow all the explanations/suggestions given but I got the gist of it. I approached the problem as I did (Homo.sapiens/AnnotationDbi) because that was based on the way it was done in the RNA Seq workflow I used previously and didn't realise there were so many alternative databases/annotation tools.

In the end I used Johannes solution as it was simple and didn't involve accessing Biomart (which is problematic due to firewalls). It didn't allow me to annotate my dataset as such but did allow me to subset out the protein coding genes which was all I really needed.

 thanks again



Entering edit mode

Actually, you could use the EnsDb objects just like you would use TxDb objects for e.g. feature counting and RNA-seq annotation.

cheers, jo


Login before adding your answer.

Traffic: 465 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6