Question

Map TxDb.Hsapiens.UCSC.hg19.knownGene TXID keys to ENSEMBL transcript ID or Refseq ID

0

Entering edit mode

goldberg.jm ▴ 10

@goldbergjm-9751

Last seen 6.5 years ago

Hi All,

How do I map TxDb.Hsapiens.UCSC.hg19.knownGene TXID keys to ENSEMBL transcript IDs or Refseq IDs? I know this must be a basic task, but I have been playing around with it and googling all afternoon and have not cracked it. I am sure Valerie Obenchain and many others can answer this easily.

Thank you,

Jon Goldberg

annotationdbi • 3.3k views

ADD COMMENT • link updated 8.1 years ago by James W. MacDonald 67k • written 8.1 years ago by goldberg.jm ▴ 10

score 1 · Answer 1 · 2016-12-15

1

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 1 day ago

United States

Use the Homo.sapiens package, which will do the mapping.

> select(Homo.sapiens, head(keys(TxDb.Hsapiens.UCSC.hg19.knownGene, "TXID")), "ENSEMBLTRANS", "TXID")
'select()' returned 1:many mapping between keys and columns
  TXID    ENSEMBLTRANS
1    1 ENST00000456328
2    1 ENST00000450305
3    2 ENST00000456328
4    2 ENST00000450305
5    3 ENST00000456328
6    3 ENST00000450305
7    4 ENST00000335137
8    5            <NA>
9    6            <NA>

ADD COMMENT • link 8.1 years ago James W. MacDonald 67k

1

Entering edit mode

But perhaps you really meant the TXNAME?

> select(Homo.sapiens, head(keys(TxDb.Hsapiens.UCSC.hg19.knownGene, "TXNAME")), "ENSEMBLTRANS", "TXNAME")
'select()' returned 1:many mapping between keys and columns
      TXNAME    ENSEMBLTRANS
1 uc001aaa.3 ENST00000456328
2 uc001aaa.3 ENST00000450305
3 uc010nxq.1 ENST00000456328
4 uc010nxq.1 ENST00000450305
5 uc010nxr.1 ENST00000456328
6 uc010nxr.1 ENST00000450305
7 uc001aal.1 ENST00000335137
8 uc001aaq.2            <NA>
9 uc001aar.2            <NA>

ADD REPLY • link 8.1 years ago James W. MacDonald 67k

0

Entering edit mode

Thank you... works like a charm! I really do want TXID-ENST pairings but it's good to see other examples as well.

ADD REPLY • link 8.1 years ago goldberg.jm ▴ 10

0

Entering edit mode

In theory TXID/TXNAM - ENST mappings should be 1-to-1, right? Do you know why they are not?

ADD REPLY • link 8.1 years ago goldberg.jm ▴ 10

1

Entering edit mode

That isn't really a valid assumption. Hypothetically, there should be some known set of transcripts that everybody agrees upon, and the only difference between UCSC and Ensembl would be what they called them. But that's not the case.

As an example, let's take Entrez Gene ID 1. Both UCSC and Ensembl agree that this is a gene:

> select(org.Hs.eg.db, "1", c("SYMBOL", "ENSEMBL"))
 'select()' returned 1:1 mapping between keys and columns
   ENTREZID SYMBOL         ENSEMBL
 1        1   A1BG ENSG00000121410

But they don't agree on the transcripts:

> select(Homo.sapiens, "1", c("TXNAME", "ENSEMBLTRANS"), "ENTREZID")
 'select()' returned 1:many mapping between keys and columns
   ENTREZID ENSEMBLTRANS     TXNAME
 1        1         <NA> uc002qsd.4
 2        1         <NA> uc002qsf.2

We can see where they differ, using both versions of GRCh37:

 > txuc <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene)

 > txens <- transcriptsBy(EnsDb.Hsapiens.v75)

> txuc[1]
 GRangesList object of length 1:
 $1
 GRanges object with 2 ranges and 2 metadata columns:
       seqnames               ranges strand |     tx_id     
          <Rle>            <IRanges>  <Rle> | <integer> <character>
   [1]    chr19 [58858172, 58864865]      - |     70455  
   [2]    chr19 [58859832, 58874214]      - |     70456  

!> txens["ENSG00000121410"]
 GRangesList object of length 1:
 $ENSG00000121410
 GRanges object with 5 ranges and 6 metadata columns:
       seqnames               ranges strand |           tx_id
          <Rle>            <IRanges>  <Rle> |     <character>
   [1]       19 [58858216, 58864865]      - | ENST00000263100
   [2]       19 [58858224, 58864857]      - | ENST00000595014
   [3]       19 [58861960, 58864495]      - | ENST00000600966
   [4]       19 [58858226, 58859023]      - | ENST00000598345
   [5]       19 [58856544, 58859000]      - | ENST00000596924

Here UCSC says there are two transcripts for this gene, and UCSC says there are five(!), and none of them are the same as what UCSC says. And even if there is a 'mapping' between the two, it's not really a 1-1 correspondence:

 > select(Homo.sapiens, "10", c("TXNAME", "ENSEMBLTRANS","ENSEMBL"), "ENTREZID")
 'select()' returned 1:many mapping between keys and columns
   ENTREZID         ENSEMBL    ENSEMBLTRANS     TXNAME
 1       10 ENSG00000156006 ENST00000286479 uc003wyw.1
 2       10 ENSG00000156006 ENST00000520116 uc003wyw.1

 > txuc["10"]
 GRangesList object of length 1:
 $10
 GRanges object with 1 range and 2 metadata columns:
       seqnames               ranges strand |     tx_id     
          <Rle>            <IRanges>  <Rle> | <integer> 
   [1]     chr8 [18248755, 18258723]      + |     31944  

 
!> txens["ENSG00000156006"]
 GRangesList object of length 1:
 $ENSG00000156006
 GRanges object with 2 ranges and 6 metadata columns:
       seqnames               ranges strand |           tx_id     
          <Rle>            <IRanges>  <Rle> |     <character>    <character>
   [1]        8 [18248755, 18258728]      + | ENST00000286479 
   [2]        8 [18248797, 18258503]      + | ENST00000520116

ADD REPLY • link 8.1 years ago James W. MacDonald 67k