Map TxDb.Hsapiens.UCSC.hg19.knownGene TXID keys to ENSEMBL transcript ID or Refseq ID
1
0
Entering edit mode
goldberg.jm ▴ 10
@goldbergjm-9751
Last seen 6.4 years ago

Hi All,

How do I map TxDb.Hsapiens.UCSC.hg19.knownGene TXID keys to ENSEMBL transcript IDs or Refseq IDs?  I know this must be a basic task, but I have been playing around with it and googling all afternoon and have not cracked it. I am sure Valerie Obenchain and many others can answer this easily.

Thank you,

Jon Goldberg

annotationdbi • 3.3k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 6 minutes ago
United States

Use the Homo.sapiens package, which will do the mapping.

> select(Homo.sapiens, head(keys(TxDb.Hsapiens.UCSC.hg19.knownGene, "TXID")), "ENSEMBLTRANS", "TXID")
'select()' returned 1:many mapping between keys and columns
  TXID    ENSEMBLTRANS
1    1 ENST00000456328
2    1 ENST00000450305
3    2 ENST00000456328
4    2 ENST00000450305
5    3 ENST00000456328
6    3 ENST00000450305
7    4 ENST00000335137
8    5            <NA>
9    6            <NA>
ADD COMMENT
1
Entering edit mode

But perhaps you really meant the TXNAME?

> select(Homo.sapiens, head(keys(TxDb.Hsapiens.UCSC.hg19.knownGene, "TXNAME")), "ENSEMBLTRANS", "TXNAME")
'select()' returned 1:many mapping between keys and columns
      TXNAME    ENSEMBLTRANS
1 uc001aaa.3 ENST00000456328
2 uc001aaa.3 ENST00000450305
3 uc010nxq.1 ENST00000456328
4 uc010nxq.1 ENST00000450305
5 uc010nxr.1 ENST00000456328
6 uc010nxr.1 ENST00000450305
7 uc001aal.1 ENST00000335137
8 uc001aaq.2            <NA>
9 uc001aar.2            <NA>
ADD REPLY
0
Entering edit mode

Thank you... works like a charm! I really do want TXID-ENST pairings but it's good to see other examples as well.

ADD REPLY
0
Entering edit mode

In theory TXID/TXNAM - ENST mappings should be 1-to-1, right? Do you know why they are not?

ADD REPLY
1
Entering edit mode

That isn't really a valid assumption. Hypothetically, there should be some known set of transcripts that everybody agrees upon, and the only difference between UCSC and Ensembl would be what they called them. But that's not the case.

As an example, let's take Entrez Gene ID 1. Both UCSC and Ensembl agree that this is a gene:

> select(org.Hs.eg.db, "1", c("SYMBOL", "ENSEMBL"))
 'select()' returned 1:1 mapping between keys and columns
   ENTREZID SYMBOL         ENSEMBL
 1        1   A1BG ENSG00000121410

But they don't agree on the transcripts:

> select(Homo.sapiens, "1", c("TXNAME", "ENSEMBLTRANS"), "ENTREZID")
 'select()' returned 1:many mapping between keys and columns
   ENTREZID ENSEMBLTRANS     TXNAME
 1        1         <NA> uc002qsd.4
 2        1         <NA> uc002qsf.2

We can see where they differ, using both versions of GRCh37:

 > txuc <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene)

 > txens <- transcriptsBy(EnsDb.Hsapiens.v75)

> txuc[1]
 GRangesList object of length 1:
 $1
 GRanges object with 2 ranges and 2 metadata columns:
       seqnames               ranges strand |     tx_id     
          <Rle>            <IRanges>  <Rle> | <integer> <character>
   [1]    chr19 [58858172, 58864865]      - |     70455  
   [2]    chr19 [58859832, 58874214]      - |     70456  

!> txens["ENSG00000121410"]
 GRangesList object of length 1:
 $ENSG00000121410
 GRanges object with 5 ranges and 6 metadata columns:
       seqnames               ranges strand |           tx_id
          <Rle>            <IRanges>  <Rle> |     <character>
   [1]       19 [58858216, 58864865]      - | ENST00000263100
   [2]       19 [58858224, 58864857]      - | ENST00000595014
   [3]       19 [58861960, 58864495]      - | ENST00000600966
   [4]       19 [58858226, 58859023]      - | ENST00000598345
   [5]       19 [58856544, 58859000]      - | ENST00000596924


Here UCSC says there are two transcripts for this gene, and UCSC says there are five(!), and none of them are the same as what UCSC says. And even if there is a 'mapping' between the two, it's not really a 1-1 correspondence:

 > select(Homo.sapiens, "10", c("TXNAME", "ENSEMBLTRANS","ENSEMBL"), "ENTREZID")
 'select()' returned 1:many mapping between keys and columns
   ENTREZID         ENSEMBL    ENSEMBLTRANS     TXNAME
 1       10 ENSG00000156006 ENST00000286479 uc003wyw.1
 2       10 ENSG00000156006 ENST00000520116 uc003wyw.1

 > txuc["10"]
 GRangesList object of length 1:
 $10
 GRanges object with 1 range and 2 metadata columns:
       seqnames               ranges strand |     tx_id     
          <Rle>            <IRanges>  <Rle> | <integer> 
   [1]     chr8 [18248755, 18258723]      + |     31944  

 
!> txens["ENSG00000156006"]
 GRangesList object of length 1:
 $ENSG00000156006
 GRanges object with 2 ranges and 6 metadata columns:
       seqnames               ranges strand |           tx_id     
          <Rle>            <IRanges>  <Rle> |     <character>    <character>
   [1]        8 [18248755, 18258728]      + | ENST00000286479 
   [2]        8 [18248797, 18258503]      + | ENST00000520116 


 

ADD REPLY

Login before adding your answer.

Traffic: 990 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6