Entering edit mode
goldberg.jm
▴
10
@goldbergjm-9751
Last seen 7.3 years ago
Hi All,
How do I map TxDb.Hsapiens.UCSC.hg19.knownGene TXID keys to ENSEMBL transcript IDs or Refseq IDs? I know this must be a basic task, but I have been playing around with it and googling all afternoon and have not cracked it. I am sure Valerie Obenchain and many others can answer this easily.
Thank you,
Jon Goldberg

But perhaps you really meant the TXNAME?
> select(Homo.sapiens, head(keys(TxDb.Hsapiens.UCSC.hg19.knownGene, "TXNAME")), "ENSEMBLTRANS", "TXNAME") 'select()' returned 1:many mapping between keys and columns TXNAME ENSEMBLTRANS 1 uc001aaa.3 ENST00000456328 2 uc001aaa.3 ENST00000450305 3 uc010nxq.1 ENST00000456328 4 uc010nxq.1 ENST00000450305 5 uc010nxr.1 ENST00000456328 6 uc010nxr.1 ENST00000450305 7 uc001aal.1 ENST00000335137 8 uc001aaq.2 <NA> 9 uc001aar.2 <NA>Thank you... works like a charm! I really do want TXID-ENST pairings but it's good to see other examples as well.
In theory TXID/TXNAM - ENST mappings should be 1-to-1, right? Do you know why they are not?
That isn't really a valid assumption. Hypothetically, there should be some known set of transcripts that everybody agrees upon, and the only difference between UCSC and Ensembl would be what they called them. But that's not the case.
As an example, let's take Entrez Gene ID 1. Both UCSC and Ensembl agree that this is a gene:
> select(org.Hs.eg.db, "1", c("SYMBOL", "ENSEMBL")) 'select()' returned 1:1 mapping between keys and columns ENTREZID SYMBOL ENSEMBL 1 1 A1BG ENSG00000121410But they don't agree on the transcripts:
> select(Homo.sapiens, "1", c("TXNAME", "ENSEMBLTRANS"), "ENTREZID") 'select()' returned 1:many mapping between keys and columns ENTREZID ENSEMBLTRANS TXNAME 1 1 <NA> uc002qsd.4 2 1 <NA> uc002qsf.2We can see where they differ, using both versions of GRCh37:
> txuc <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene) > txens <- transcriptsBy(EnsDb.Hsapiens.v75) > txuc[1] GRangesList object of length 1: $1 GRanges object with 2 ranges and 2 metadata columns: seqnames ranges strand | tx_id <Rle> <IRanges> <Rle> | <integer> <character> [1] chr19 [58858172, 58864865] - | 70455 [2] chr19 [58859832, 58874214] - | 70456 !> txens["ENSG00000121410"] GRangesList object of length 1: $ENSG00000121410 GRanges object with 5 ranges and 6 metadata columns: seqnames ranges strand | tx_id <Rle> <IRanges> <Rle> | <character> [1] 19 [58858216, 58864865] - | ENST00000263100 [2] 19 [58858224, 58864857] - | ENST00000595014 [3] 19 [58861960, 58864495] - | ENST00000600966 [4] 19 [58858226, 58859023] - | ENST00000598345 [5] 19 [58856544, 58859000] - | ENST00000596924Here UCSC says there are two transcripts for this gene, and UCSC says there are five(!), and none of them are the same as what UCSC says. And even if there is a 'mapping' between the two, it's not really a 1-1 correspondence:
> select(Homo.sapiens, "10", c("TXNAME", "ENSEMBLTRANS","ENSEMBL"), "ENTREZID") 'select()' returned 1:many mapping between keys and columns ENTREZID ENSEMBL ENSEMBLTRANS TXNAME 1 10 ENSG00000156006 ENST00000286479 uc003wyw.1 2 10 ENSG00000156006 ENST00000520116 uc003wyw.1 > txuc["10"] GRangesList object of length 1: $10 GRanges object with 1 range and 2 metadata columns: seqnames ranges strand | tx_id <Rle> <IRanges> <Rle> | <integer> [1] chr8 [18248755, 18258723] + | 31944 !> txens["ENSG00000156006"] GRangesList object of length 1: $ENSG00000156006 GRanges object with 2 ranges and 6 metadata columns: seqnames ranges strand | tx_id <Rle> <IRanges> <Rle> | <character> <character> [1] 8 [18248755, 18258728] + | ENST00000286479 [2] 8 [18248797, 18258503] + | ENST00000520116