Search
Question: annotating gene for transcript type/biotype
0
2.6 years ago by
sjmonkley20
Sweden
sjmonkley20 wrote:

I am trying to annotate a table of genes with transcript type (TXTYPE) using Homo.sapiens annotation package within Annotation.Dbi . I want to do this so I can select for only rotein coding genes and remove all the RNA encoding and psuedogenes

I m using the following code:

all_genes$TxType <- mapIds(Homo.sapiens, keys=row.names(all_genes), column="TXTYPE", keytype="SYMBOL", multiVals="first"). However in the Tx.type column is just full of n/a's. The code works with GeneName as column. Is the problem that only transcripts rather than genes can be annotated with TXTYPE information? sessionInfo() R version 3.2.2 (2015-08-14) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 locale: [1] LC_COLLATE=Swedish_Sweden.1252 LC_CTYPE=Swedish_Sweden.1252 LC_MONETARY=Swedish_Sweden.1252 [4] LC_NUMERIC=C LC_TIME=Swedish_Sweden.1252 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] dplyr_0.4.3 Homo.sapiens_1.3.1 [3] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 org.Hs.eg.db_3.2.3 [5] GO.db_3.2.2 RSQLite_1.0.0 [7] DBI_0.3.1 OrganismDbi_1.12.0 [9] GenomicFeatures_1.22.2 GenomicRanges_1.22.0 [11] GenomeInfoDb_1.6.0 AnnotationDbi_1.32.0 [13] IRanges_2.4.0 S4Vectors_0.8.0 [15] Biobase_2.30.0 BiocGenerics_0.16.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.1 graph_1.48.0 magrittr_1.5 [4] XVector_0.10.0 zlibbioc_1.16.0 GenomicAlignments_1.6.1 [7] BiocParallel_1.4.0 R6_2.1.1 tools_3.2.2 [10] SummarizedExperiment_1.0.0 lambda.r_1.1.7 futile.logger_1.4.1 [13] lazyeval_0.1.10 assertthat_0.1 RBGL_1.46.0 [16] rtracklayer_1.30.1 futile.options_1.0.0 bitops_1.0-6 [19] RCurl_1.95-4.7 biomaRt_2.26.0 BiocInstaller_1.20.0 [22] Biostrings_2.38.0 Rsamtools_1.22.0 XML_3.98-1.3 ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by sjmonkley20 2 2.6 years ago by Johannes Rainer1.3k Italy Johannes Rainer1.3k wrote: Alternatively, you could use ensembldb based annotation packages, as e.g. the EnsDb.Hsapiens.v75 which provides annotations from Ensembl version 75. In the code below we retrieve the mapping between gene ids, gene names, gene biotypes, transcript biotypes and transcript ids. > library(EnsDb.Hsapiens.v75) > edb <- EnsDb.Hsapiens.v75> txtypes <- genes(edb, columns=c("gene_name", "gene_biotype", "tx_biotype", "tx_id")) > head(txtypes) GRanges object with 6 ranges and 5 metadata columns: seqnames ranges strand | gene_id <Rle> <IRanges> <Rle> | <character> ENSG00000000003 X [99883667, 99894988] - | ENSG00000000003 ENSG00000000003 X [99883667, 99894988] - | ENSG00000000003 ENSG00000000003 X [99883667, 99894988] - | ENSG00000000003 ENSG00000000005 X [99839799, 99854882] + | ENSG00000000005 ENSG00000000005 X [99839799, 99854882] + | ENSG00000000005 ENSG00000000419 20 [49551404, 49575092] - | ENSG00000000419 gene_name gene_biotype tx_biotype <character> <character> <character> ENSG00000000003 TSPAN6 protein_coding protein_coding ENSG00000000003 TSPAN6 protein_coding processed_transcript ENSG00000000003 TSPAN6 protein_coding processed_transcript ENSG00000000005 TNMD protein_coding protein_coding ENSG00000000005 TNMD protein_coding processed_transcript ENSG00000000419 DPM1 protein_coding protein_coding tx_id <character> ENSG00000000003 ENST00000373020 ENSG00000000003 ENST00000494424 ENSG00000000003 ENST00000496771 ENSG00000000005 ENST00000373031 ENSG00000000005 ENST00000485971 ENSG00000000419 ENST00000371582 ------- seqinfo: 273 sequences from GRCh37 genome Using filters you can also retrieve just protein coding genes from the database: > protGenes <- genes(edb, filter=GenebiotypeFilter("protein_coding")) > protGenes GRanges object with 22810 ranges and 5 metadata columns: seqnames ranges strand | <Rle> <IRanges> <Rle> | ENSG00000000003 X [ 99883667, 99894988] - | ENSG00000000005 X [ 99839799, 99854882] + | ENSG00000000419 20 [ 49551404, 49575092] - | ENSG00000000457 1 [169818772, 169863408] - | ENSG00000000460 1 [169631245, 169823221] + | ... ... ... ... ... ENSG00000273467 HSCHR19LRC_LRC_S_CTG1 [54704726, 54711511] + | ENSG00000273469 HSCHR19LRC_COX1_CTG1 [54618837, 54635140] + | ENSG00000273470 HSCHR19LRC_LRC_T_CTG1 [54677107, 54693733] - | ENSG00000273482 HG957_PATCH [53190025, 53226733] + | ENSG00000273490 HSCHR19LRC_LRC_J_CTG1 [54693789, 54697585] + | gene_id gene_name entrezid gene_biotype <character> <character> <character> <character> ENSG00000000003 ENSG00000000003 TSPAN6 7105 protein_coding ENSG00000000005 ENSG00000000005 TNMD 64102 protein_coding ENSG00000000419 ENSG00000000419 DPM1 8813 protein_coding ENSG00000000457 ENSG00000000457 SCYL3 57147 protein_coding ENSG00000000460 ENSG00000000460 C1orf112 55732 protein_coding ... ... ... ... ... ENSG00000273467 ENSG00000273467 RPS9 6203 protein_coding ENSG00000273469 ENSG00000273469 PRPF31 26121 protein_coding ENSG00000273470 ENSG00000273470 MBOAT7 79143 protein_coding ENSG00000273482 ENSG00000273482 PRKCD 5580 protein_coding ENSG00000273490 ENSG00000273490 TSEN34 79042 protein_coding seq_coord_system <character> ENSG00000000003 chromosome ENSG00000000005 chromosome ENSG00000000419 chromosome ENSG00000000457 chromosome ENSG00000000460 chromosome ... ... ENSG00000273467 chromosome ENSG00000273469 chromosome ENSG00000273470 chromosome ENSG00000273482 chromosome ENSG00000273490 chromosome ------- seqinfo: 221 sequences from GRCh37 genome If you need more recent Ensembl based annotations you can check the vignette of the ensembldb package on how to create such annotation packages. hope that helps. cheers, jo ADD COMMENTlink written 2.6 years ago by Johannes Rainer1.3k 1 2.6 years ago by United States James W. MacDonald46k wrote: This looks like there might be a bug somewhere, but I haven't got it tracked down just yet. But first, do note that there isn't a TXTYPE for the UCSC-based TxDb packages. This is because UCSC doesn't provide that sort of information - the underlying transcripts table in e.g., TxDb.Hsapiens.UCSC.hg19.knownGene has all NA values under the tx_type column, which is where those data come from. However, Ensembl does have this sort of information, so if you build a TxDb based on their data, then it should in theory work. I sort of went the long way round, but here is the jist: > txdb <- makeTxDbFromBiomart() Download and preprocess the 'transcripts' data frame ... OK Download and preprocess the 'chrominfo' data frame ... FAILED! (=> skipped) Download and preprocess the 'splicings' data frame ... OK Download and preprocess the 'genes' data frame ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... OK Warning message: In .normarg_makeTxDb_chrominfo(chrominfo, transcripts$tx_chrom,  :
chromosome lengths and circularity flags are not available for this TxDb object

> makeTxDbPackage(txdb, "0.1.1", "me <me@mine.org>", "me")
Creating package in ./TxDb.Hsapiens.BioMart.ensembl.GRCh38.p3

> install.packages("TxDb.Hsapiens.BioMart.ensembl.GRCh38.p3/", repos = NULL)
* installing *source* package  TxDb.Hsapiens.BioMart.ensembl.GRCh38.p3  ...
** R
** inst
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (TxDb.Hsapiens.BioMart.ensembl.GRCh38.p3)

> library(Homo.sapiens)

Now getting the GODb Object directly
Now getting the OrgDb Object directly
Now getting the TxDb Object directly

Note that this Homo.sapiens is using the TxDb.Hsapiens.UCSC.hg19.knownGene package, but we can over-ride

> TxDb(Homo.sapiens) <- txdb
Now getting the GODb Object directly
Now getting the OrgDb Object directly

> Homo.sapiens
OrganismDb Object:
# Includes GODb Object:  GO.db
# With data about:  Gene Ontology
# Includes OrgDb Object:  org.Hs.eg.db
# Gene data about:  Homo sapiens
# Taxonomy Id:  9606
# Includes TxDb Object:  TxDb.Hsapiens.BioMart.ensembl.GRCh38.p3
# Transcriptome data about:  Homo sapiens
# Based on genome:
# The OrgDb gene id ENTREZID is mapped to the TxDb gene id GENEID .

> z <- select(Homo.sapiens, keys(Homo.sapiens, "ENSEMBL")[1:500], "TXTYPE","ENSEMBL")
'select()' returned 1:1 mapping between keys and columns
ENSEMBL TXTYPE
1 ENSG00000121410   <NA>
2 ENSG00000175899   <NA>
3 ENSG00000256069   <NA>
4 ENSG00000171428   <NA>
5 ENSG00000156006   <NA>
6 ENSG00000196136   <NA>

Which is a bummer. But the transcripts table does have this info!

> con <- dbConnect(SQLite(), paste0(path.package('TxDb.Hsapiens.BioMart.ensembl.GRCh38.p3'), '/extdata/TxDb.Hsapiens.BioMart.ensembl.GRCh38.p3.sqlite'))
> dbGetQuery(con, "select _tx_id, tx_name, tx_type from transcript limit 5;")
_tx_id         tx_name              tx_type
1      1 ENST00000627793                snRNA
2      2 ENST00000628518 processed_pseudogene
3      3 ENST00000628983       protein_coding
4      4 ENST00000629165              lincRNA
5      5 ENST00000630948                 rRNA


So evidently there is a problem extracting the tx_type data correctly. I don't have the time just now to track this down, but unless somebody else volunteers I will try to get to it ASAP.

But do note that you CAN get these data via either transcripts() or transcriptsBy():

> transcriptsBy(Homo.sapiens, by = "gene", columns = c("TXTYPE","TXNAME"))
'select()' returned 1:1 mapping between keys and columns
GRangesList object of length 65988:
$ENSG00000000003 GRanges object with 5 ranges and 3 metadata columns: seqnames ranges strand | tx_name TXNAME <Rle> <IRanges> <Rle> | <character> <CharacterList> [1] X [100627109, 100637104] - | ENST00000612152 ENST00000612152 [2] X [100628670, 100636806] - | ENST00000373020 ENST00000373020 [3] X [100632063, 100637104] - | ENST00000614008 ENST00000614008 [4] X [100632541, 100636689] - | ENST00000496771 ENST00000496771 [5] X [100633442, 100639991] - | ENST00000494424 ENST00000494424 TXTYPE <CharacterList> [1] protein_coding [2] protein_coding [3] protein_coding [4] processed_transcript [5] processed_transcript$ENSG00000000005
GRanges object with 2 ranges and 3 metadata columns:
seqnames                 ranges strand |         tx_name          TXNAME
[1]        X [100584802, 100599885]      + | ENST00000373031 ENST00000373031
[2]        X [100593624, 100597531]      + | ENST00000485971 ENST00000485971
TXTYPE
[1]       protein_coding
[2] processed_transcript

\$ENSG00000000419
GRanges object with 6 ranges and 3 metadata columns:
seqnames               ranges strand |         tx_name          TXNAME
[1]       20 [50934867, 50958550]      - | ENST00000371588 ENST00000371588
[2]       20 [50934867, 50958550]      - | ENST00000466152 ENST00000466152
[3]       20 [50934867, 50958555]      - | ENST00000371582 ENST00000371582
[4]       20 [50934896, 50945861]      - | ENST00000494752 ENST00000494752
[5]       20 [50934945, 50958521]      - | ENST00000371584 ENST00000371584
[6]       20 [50936148, 50958532]      - | ENST00000413082 ENST00000413082
TXTYPE
[1]       protein_coding
[2] processed_transcript
[3]       protein_coding
[4] processed_transcript
[5]       protein_coding
[6]       protein_coding

...
<65985 more elements>
-------
seqinfo: 331 sequences from an unspecified genome; no seqlengths
> transcripts(Homo.sapiens, columns = c("TXTYPE","TXNAME"))
'select()' returned 1:1 mapping between keys and columns
GRanges object with 216133 ranges and 2 metadata columns:
seqnames               ranges strand   |          TXNAME
<Rle>            <IRanges>  <Rle>   | <CharacterList>
[1] CHR_HG126_PATCH [72577040, 72577203]      +   | ENST00000627793
[2] CHR_HG126_PATCH [72583719, 72583884]      +   | ENST00000628518
[3] CHR_HG126_PATCH [72374621, 72447493]      -   | ENST00000628983
[4] CHR_HG126_PATCH [72505075, 72550889]      -   | ENST00000629165
[5] CHR_HG126_PATCH [72692054, 72692160]      -   | ENST00000630948
...             ...                  ...    ... ...             ...
[216129]      KI270750.1     [148668, 148843]      +   | ENST00000612925
[216130]      KI270752.1     [   144,    268]      +   | ENST00000618580
[216131]      KI270752.1     [ 21813,  21944]      +   | ENST00000620980
[216132]      KI270752.1     [  3497,   3623]      -   | ENST00000620677
[216133]      KI270752.1     [  9943,  10067]      -   | ENST00000611978
TXTYPE
<CharacterList>
[1]                snRNA
[2] processed_pseudogene
[3]       protein_coding
[4]              lincRNA
[5]                 rRNA
...                  ...
[216129]                snRNA
[216130]                miRNA
[216131]                miRNA
[216132]                miRNA
[216133]                miRNA
-------
seqinfo: 331 sequences from an unspecified genome; no seqlengths
1

Hi Jim, sjmonkley,

Looks like a bug in the SQL generated by select(). This is good timing because I actually started to revisit/refactor all the SQL generation in GenomicFeatures a couple of weeks ago, with some interesting speed improvements. All the extractors (except select()) now use the new SQL generator. I was going to do select() next but got distracted by other things. Moving it back close to the top of my TODO list.

H.

Edit: After closer examination, the problem seems to be in the "select" method for OrganismDb objects (Homo.sapiens), and not in the method for TxDb objects as I thought initially. This one seems to work as expected:

txdb <- makeTxDbFromBiomart(dataset="celegans_gene_ensembl")
# 'select()' returned many:1 mapping between keys and columns
#   TXID         TXTYPE
# 1    1 protein_coding
# 2    2 protein_coding
# 3    3 protein_coding
# 4    4 protein_coding
# 5    5 protein_coding
# 6    6 protein_coding

So it's not clear that my current work on SQL generation in GenomicFeatures will help address this.

Hi Hervé,

It looks like the problem comes up in OrganismDbi, specifically OrganismDbi:::.getSelects(), which makes the assumption (probably warranted) that the underlying data sources are the same for all packages wrapped up under Homo.sapiens. In other words, in OrganismDbi:::.getSelects(), the TxDb is inspected for the correct keytype to be used, but it isn't inspected to ensure that the 'fromKeys' match up with the type of keys in 'toKey'. And then select() is run with skipValidKeysTest = TRUE, so no error is generated (e.g., Entrez Gene Ids are passed into select(), using GENEID as the keytype, but for the TxDb.Hsapiens.BioMart.ensembl.GRCh38.p3 package, the GENEIDs are Ensembl Gene IDs, not Entrez Gene IDs).

It may well be that mixing and matching annotation sources under an OrganismDbi object is a non-starter, and if the OP wants to use a Homo.sapiens package to do this sort of thing, then it should be constructed entirely of Ensembl based objects. Or, probably easier, just use the EnsDb objects instead.

Jim

Thanks Jim for taking the time to dig into this. That's helpful. The ability for the user to plug his/her own TxDb into an OrganismDb object is a feature that Marc had on his list and he actually started to do some work on this a few months ago. Don't know how much progress was made but it's definitely a useful feature and something we want to pursue. Some compatibility checks would be needed so the user can't plug anything with anything, or at least s/he gets a warning when s/he tries to put together Mouse, Pig, and Human in his/her OrganismDb object for Frankenstein. But plugging a TxDb from Ensembl into Homo.sapiens should be supported.

H.

0
2.6 years ago by
sjmonkley20
Sweden
sjmonkley20 wrote:

Hi all

First can I say thank you all so much for all the effort you have all gone to on my behalf. I have to admit that with my limited level of understanding of annotation databases I did not follow all the explanations/suggestions given but I got the gist of it. I approached the problem as I did (Homo.sapiens/AnnotationDbi) because that was based on the way it was done in the RNA Seq workflow I used previously and didn't realise there were so many alternative databases/annotation tools.

In the end I used Johannes solution as it was simple and didn't involve accessing Biomart (which is problematic due to firewalls). It didn't allow me to annotate my dataset as such but did allow me to subset out the protein coding genes which was all I really needed.

thanks again

Sue

1

Actually, you could use the EnsDb objects just like you would use TxDb objects for e.g. feature counting and RNA-seq annotation.

cheers, jo