I am trying to annotate a table of genes with transcript type (TXTYPE) using Homo.sapiens annotation package within Annotation.Dbi . I want to do this so I can select for only rotein coding genes and remove all the RNA encoding and psuedogenes
I m using the following code:
all_genes$TxType <- mapIds(Homo.sapiens,
keys=row.names(all_genes),
column="TXTYPE",
keytype="SYMBOL",
multiVals="first").
However in the Tx.type column is just full of n/a's. The code works with GeneName as column.
Is the problem that only transcripts rather than genes can be annotated with TXTYPE information?
sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=Swedish_Sweden.1252 LC_CTYPE=Swedish_Sweden.1252 LC_MONETARY=Swedish_Sweden.1252
[4] LC_NUMERIC=C LC_TIME=Swedish_Sweden.1252
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.4.3 Homo.sapiens_1.3.1
[3] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 org.Hs.eg.db_3.2.3
[5] GO.db_3.2.2 RSQLite_1.0.0
[7] DBI_0.3.1 OrganismDbi_1.12.0
[9] GenomicFeatures_1.22.2 GenomicRanges_1.22.0
[11] GenomeInfoDb_1.6.0 AnnotationDbi_1.32.0
[13] IRanges_2.4.0 S4Vectors_0.8.0
[15] Biobase_2.30.0 BiocGenerics_0.16.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.1 graph_1.48.0 magrittr_1.5
[4] XVector_0.10.0 zlibbioc_1.16.0 GenomicAlignments_1.6.1
[7] BiocParallel_1.4.0 R6_2.1.1 tools_3.2.2
[10] SummarizedExperiment_1.0.0 lambda.r_1.1.7 futile.logger_1.4.1
[13] lazyeval_0.1.10 assertthat_0.1 RBGL_1.46.0
[16] rtracklayer_1.30.1 futile.options_1.0.0 bitops_1.0-6
[19] RCurl_1.95-4.7 biomaRt_2.26.0 BiocInstaller_1.20.0
[22] Biostrings_2.38.0 Rsamtools_1.22.0 XML_3.98-1.3

But do note that you CAN get these data via either transcripts() or transcriptsBy():
> transcriptsBy(Homo.sapiens, by = "gene", columns = c("TXTYPE","TXNAME")) 'select()' returned 1:1 mapping between keys and columns GRangesList object of length 65988: $ENSG00000000003 GRanges object with 5 ranges and 3 metadata columns: seqnames ranges strand | tx_name TXNAME <Rle> <IRanges> <Rle> | <character> <CharacterList> [1] X [100627109, 100637104] - | ENST00000612152 ENST00000612152 [2] X [100628670, 100636806] - | ENST00000373020 ENST00000373020 [3] X [100632063, 100637104] - | ENST00000614008 ENST00000614008 [4] X [100632541, 100636689] - | ENST00000496771 ENST00000496771 [5] X [100633442, 100639991] - | ENST00000494424 ENST00000494424 TXTYPE <CharacterList> [1] protein_coding [2] protein_coding [3] protein_coding [4] processed_transcript [5] processed_transcript $ENSG00000000005 GRanges object with 2 ranges and 3 metadata columns: seqnames ranges strand | tx_name TXNAME [1] X [100584802, 100599885] + | ENST00000373031 ENST00000373031 [2] X [100593624, 100597531] + | ENST00000485971 ENST00000485971 TXTYPE [1] protein_coding [2] processed_transcript $ENSG00000000419 GRanges object with 6 ranges and 3 metadata columns: seqnames ranges strand | tx_name TXNAME [1] 20 [50934867, 50958550] - | ENST00000371588 ENST00000371588 [2] 20 [50934867, 50958550] - | ENST00000466152 ENST00000466152 [3] 20 [50934867, 50958555] - | ENST00000371582 ENST00000371582 [4] 20 [50934896, 50945861] - | ENST00000494752 ENST00000494752 [5] 20 [50934945, 50958521] - | ENST00000371584 ENST00000371584 [6] 20 [50936148, 50958532] - | ENST00000413082 ENST00000413082 TXTYPE [1] protein_coding [2] processed_transcript [3] protein_coding [4] processed_transcript [5] protein_coding [6] protein_coding ... <65985 more elements> ------- seqinfo: 331 sequences from an unspecified genome; no seqlengths > transcripts(Homo.sapiens, columns = c("TXTYPE","TXNAME")) 'select()' returned 1:1 mapping between keys and columns GRanges object with 216133 ranges and 2 metadata columns: seqnames ranges strand | TXNAME <Rle> <IRanges> <Rle> | <CharacterList> [1] CHR_HG126_PATCH [72577040, 72577203] + | ENST00000627793 [2] CHR_HG126_PATCH [72583719, 72583884] + | ENST00000628518 [3] CHR_HG126_PATCH [72374621, 72447493] - | ENST00000628983 [4] CHR_HG126_PATCH [72505075, 72550889] - | ENST00000629165 [5] CHR_HG126_PATCH [72692054, 72692160] - | ENST00000630948 ... ... ... ... ... ... [216129] KI270750.1 [148668, 148843] + | ENST00000612925 [216130] KI270752.1 [ 144, 268] + | ENST00000618580 [216131] KI270752.1 [ 21813, 21944] + | ENST00000620980 [216132] KI270752.1 [ 3497, 3623] - | ENST00000620677 [216133] KI270752.1 [ 9943, 10067] - | ENST00000611978 TXTYPE <CharacterList> [1] snRNA [2] processed_pseudogene [3] protein_coding [4] lincRNA [5] rRNA ... ... [216129] snRNA [216130] miRNA [216131] miRNA [216132] miRNA [216133] miRNA ------- seqinfo: 331 sequences from an unspecified genome; no seqlengthsHi Jim, sjmonkley,
Looks like a bug in the SQL generated by
select(). This is good timing because I actually started to revisit/refactor all the SQL generation in GenomicFeatures a couple of weeks ago, with some interesting speed improvements. All the extractors (exceptselect()) now use the new SQL generator. I was going to doselect()next but got distracted by other things. Moving it back close to the top of my TODO list.H.
Edit: After closer examination, the problem seems to be in the
"select"method for OrganismDb objects (Homo.sapiens), and not in the method for TxDb objects as I thought initially. This one seems to work as expected:So it's not clear that my current work on SQL generation in GenomicFeatures will help address this.
Hi Hervé,
It looks like the problem comes up in OrganismDbi, specifically
OrganismDbi:::.getSelects(), which makes the assumption (probably warranted) that the underlying data sources are the same for all packages wrapped up under Homo.sapiens. In other words, inOrganismDbi:::.getSelects(), the TxDb is inspected for the correct keytype to be used, but it isn't inspected to ensure that the 'fromKeys' match up with the type of keys in 'toKey'. And thenselect()is run with skipValidKeysTest = TRUE, so no error is generated (e.g., Entrez Gene Ids are passed into select(), using GENEID as the keytype, but for the TxDb.Hsapiens.BioMart.ensembl.GRCh38.p3 package, the GENEIDs are Ensembl Gene IDs, not Entrez Gene IDs).It may well be that mixing and matching annotation sources under an OrganismDbi object is a non-starter, and if the OP wants to use a Homo.sapiens package to do this sort of thing, then it should be constructed entirely of Ensembl based objects. Or, probably easier, just use the EnsDb objects instead.
Jim
Thanks Jim for taking the time to dig into this. That's helpful. The ability for the user to plug his/her own TxDb into an OrganismDb object is a feature that Marc had on his list and he actually started to do some work on this a few months ago. Don't know how much progress was made but it's definitely a useful feature and something we want to pursue. Some compatibility checks would be needed so the user can't plug anything with anything, or at least s/he gets a warning when s/he tries to put together Mouse, Pig, and Human in his/her OrganismDb object for Frankenstein. But plugging a TxDb from Ensembl into
Homo.sapiensshould be supported.H.
The above doesn't work for me. That is,
transcripts(TxDb.Hsapiens.UCSC.hg38.knownGene, columns="TXTYPE")just yields a column of
<h2>Package Info</h2>NAs.Please don't just add a comment to a five year old post. If you have a question, please make a new post.