how to get transcript annotation (unique)
Entering edit mode
K ▴ 50
Last seen 3.7 years ago
United States


I have list of transcripts (From an RNA-seq quantification output), and I would like to annotate. I would like the annotation to have the following information (Gene name, NM id or some information about the mRNA, chromosome number, chr start and end)

ucscid gene mrna refseq ucscid chr beg end
uc001yee.1 AK127179 AK127179   uc001yee.1 chr14 95643819 95646270
uc010hxc.3 MFN1 U95822 NM_033540 uc010hxc.3 chr3 179080145 179111008
uc021xcy.1 GOLGB1 AB593126   uc021xcy.1 chr3 121382047 121468602
uc010jai.3 LOC644936 NR_004845   uc010jai.3 chr5 79594916 79596297
uc001lkt.3 PPP2R2D BC045531   uc001lkt.3 chr10 133747959 133770053
uc002wgt.4 EBF4 NM_001110514 NM_001110514 uc002wgt.4 chr20 2673523 2740754
uc001kxg.4 CALHM3 NM_001129742 NM_001129742 uc001kxg.4 chr10 105232560 105238997

I tried using the txdb object and sql-like query, but the annotation: (a) only has gene ID and does not have Gene name (b) Its returning multiple rows instead of one row.

Question : Should I be using another hg19 object or another type of query ? Any advice appreciated. 

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

columns(txdb) #checking what all columns are output

x=c("uc001aal.1","uc001aaa.3", "uc001aae.4") # example input
cols = columns(txdb)
m = select(txdb, keys = x, columns=cols, keytype="TXNAME") #returns one to many rows

Thanks, K

txdb • 815 views
Entering edit mode
Last seen 44 minutes ago
United States

You should probably be using the Homo.sapiens package, which wraps up a bunch of annotation packages to make this easier. Also, if you just want one row per transcript, you should use mapIds rather than select. Something like this seems appropriate:

> library(Homo.sapiens)
> keys <- keys(Homo.sapiens, "TXNAME")[1:500] ## let's not get crazy here
> z1 <-, function(x) mapIds(Homo.sapiens, keys, x, "TXNAME")))
> names(z1) <-  colstoget
> head(z1)
uc001aaa.3 DDX11L1 100287102 AM992871    NR_046018    chr1   11874  14409
uc010nxq.1 DDX11L1 100287102 AM992871    NR_046018    chr1   11874  14409
uc010nxr.1 DDX11L1 100287102 AM992871    NR_046018    chr1   11874  14409
uc001aal.1   OR4F5     79501 BAC05820 NM_001005484    chr1   69091  70008
uc001aaq.2    <NA>      <NA>     <NA>         <NA>    chr1  321084 321115
uc001aar.2    <NA>      <NA>     <NA>         <NA>    chr1  321146 321207

But this only takes the first of any duplicate values, which is sort of naive and stuff. If you can handle ambiguity, you can do something slightly different.

> z2 <- as(lapply(colstoget, function(x) mapIds(Homo.sapiens, keys, x, "TXNAME", multiVals = "CharacterList")), "DataFrame")

> names(z2) <-  colstoget
> z2
DataFrame with 500 rows and 7 columns
                    SYMBOL        ENTREZID
           <CharacterList> <CharacterList>
uc001aaa.3         DDX11L1       100287102
uc010nxq.1         DDX11L1       100287102
uc010nxr.1         DDX11L1       100287102
uc001aal.1           OR4F5           79501
uc001aaq.2              NA              NA
...                    ...             ...
uc001bap.3       ARHGEF10L           55160
uc010ocr.1       ARHGEF10L           55160
uc001baq.3       ARHGEF10L           55160
uc001bar.3       ARHGEF10L           55160
uc010ocs.2       ARHGEF10L           55160
uc001aaa.3         AM992871,BC032353,BC070227,...
uc010nxq.1         AM992871,BC032353,BC070227,...
uc010nxr.1         AM992871,BC032353,BC070227,...
uc001aal.1 BAC05820,NM_001005484,NP_001005484,...
uc001aaq.2                                     NA
...                                           ...
uc001bap.3         AAH65561,AAH80596,AAI17172,...
uc010ocr.1         AAH65561,AAH80596,AAI17172,...
uc001baq.3         AAH65561,AAH80596,AAI17172,...
uc001bar.3         AAH65561,AAH80596,AAI17172,...
uc010ocs.2         AAH65561,AAH80596,AAI17172,...
                                            REFSEQ         TXCHROM
                                   <CharacterList> <CharacterList>
uc001aaa.3                               NR_046018            chr1
uc010nxq.1                               NR_046018            chr1
uc010nxr.1                               NR_046018            chr1
uc001aal.1               NM_001005484,NP_001005484            chr1
uc001aaq.2                                      NA            chr1
...                                            ...             ...
uc001bap.3 NM_001011722,NM_018125,NP_001011722,...            chr1
uc010ocr.1 NM_001011722,NM_018125,NP_001011722,...            chr1
uc001baq.3 NM_001011722,NM_018125,NP_001011722,...            chr1
uc001bar.3 NM_001011722,NM_018125,NP_001011722,...            chr1
uc010ocs.2 NM_001011722,NM_018125,NP_001011722,...            chr1
                   TXSTART           TXEND
           <CharacterList> <CharacterList>
uc001aaa.3           11874           14409
uc010nxq.1           11874           14409
uc010nxr.1           11874           14409
uc001aal.1           69091           70008
uc001aaq.2          321084          321115
...                    ...             ...
uc001bap.3        17907048        18024370
uc010ocr.1        17914911        17966476
uc001baq.3        17941583        18024370
uc001bar.3        17944811        18024370
uc010ocs.2        17944811        18024370


Entering edit mode
K ▴ 50
Last seen 3.7 years ago
United States

Thank you ! This is great - I will decide what to do about the duplicates. 


Login before adding your answer.

Traffic: 702 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6