Question

how to get transcript annotation (unique)

0

Entering edit mode

KB ▴ 50

@k-8495

Last seen 15 months ago

United States

Hello,

I have list of transcripts (From an RNA-seq quantification output), and I would like to annotate. I would like the annotation to have the following information (Gene name, NM id or some information about the mRNA, chromosome number, chr start and end)

ucscid	gene	mrna	refseq	ucscid	chr	beg	end
uc001yee.1	AK127179	AK127179		uc001yee.1	chr14	95643819	95646270
uc010hxc.3	MFN1	U95822	NM_033540	uc010hxc.3	chr3	179080145	179111008
uc021xcy.1	GOLGB1	AB593126		uc021xcy.1	chr3	121382047	121468602
uc010jai.3	LOC644936	NR_004845		uc010jai.3	chr5	79594916	79596297
uc001lkt.3	PPP2R2D	BC045531		uc001lkt.3	chr10	133747959	133770053
uc002wgt.4	EBF4	NM_001110514	NM_001110514	uc002wgt.4	chr20	2673523	2740754
uc001kxg.4	CALHM3	NM_001129742	NM_001129742	uc001kxg.4	chr10	105232560	105238997

I tried using the txdb object and sql-like query, but the annotation: (a) only has gene ID and does not have Gene name (b) Its returning multiple rows instead of one row.

Question : Should I be using another hg19 object or another type of query ? Any advice appreciated.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

columns(txdb) #checking what all columns are output
keytypes(txdb)

x=c("uc001aal.1","uc001aaa.3", "uc001aae.4") # example input
cols = columns(txdb)
m = select(txdb, keys = x, columns=cols, keytype="TXNAME") #returns one to many rows

Thanks, K

txdb • 1.1k views

ADD COMMENT • link 8.0 years ago KB ▴ 50

score 0 · Answer 1 · 2016-04-26

You should probably be using the Homo.sapiens package, which wraps up a bunch of annotation packages to make this easier. Also, if you just want one row per transcript, you should use mapIds rather than select. Something like this seems appropriate:

> library(Homo.sapiens)
> keys <- keys(Homo.sapiens, "TXNAME")[1:500] ## let's not get crazy here
> colstoget <- c("SYMBOL","ENTREZID","ACCNUM","REFSEQ","TXCHROM","TXSTART","TXEND")
> z1 <- as.data.frame(lapply(colstoget, function(x) mapIds(Homo.sapiens, keys, x, "TXNAME")))
> names(z1) <-  colstoget
> head(z1)
            SYMBOL  ENTREZID   ACCNUM       REFSEQ TXCHROM TXSTART  TXEND
uc001aaa.3 DDX11L1 100287102 AM992871    NR_046018    chr1   11874  14409
uc010nxq.1 DDX11L1 100287102 AM992871    NR_046018    chr1   11874  14409
uc010nxr.1 DDX11L1 100287102 AM992871    NR_046018    chr1   11874  14409
uc001aal.1   OR4F5     79501 BAC05820 NM_001005484    chr1   69091  70008
uc001aaq.2    <NA>      <NA>     <NA>         <NA>    chr1  321084 321115
uc001aar.2    <NA>      <NA>     <NA>         <NA>    chr1  321146 321207

But this only takes the first of any duplicate values, which is sort of naive and stuff. If you can handle ambiguity, you can do something slightly different.

> z2 <- as(lapply(colstoget, function(x) mapIds(Homo.sapiens, keys, x, "TXNAME", multiVals = "CharacterList")), "DataFrame")

> names(z2) <-  colstoget
> z2
DataFrame with 500 rows and 7 columns
                    SYMBOL        ENTREZID
           <CharacterList> <CharacterList>
uc001aaa.3         DDX11L1       100287102
uc010nxq.1         DDX11L1       100287102
uc010nxr.1         DDX11L1       100287102
uc001aal.1           OR4F5           79501
uc001aaq.2              NA              NA
...                    ...             ...
uc001bap.3       ARHGEF10L           55160
uc010ocr.1       ARHGEF10L           55160
uc001baq.3       ARHGEF10L           55160
uc001bar.3       ARHGEF10L           55160
uc010ocs.2       ARHGEF10L           55160
                                           ACCNUM
                                  <CharacterList>
uc001aaa.3         AM992871,BC032353,BC070227,...
uc010nxq.1         AM992871,BC032353,BC070227,...
uc010nxr.1         AM992871,BC032353,BC070227,...
uc001aal.1 BAC05820,NM_001005484,NP_001005484,...
uc001aaq.2                                     NA
...                                           ...
uc001bap.3         AAH65561,AAH80596,AAI17172,...
uc010ocr.1         AAH65561,AAH80596,AAI17172,...
uc001baq.3         AAH65561,AAH80596,AAI17172,...
uc001bar.3         AAH65561,AAH80596,AAI17172,...
uc010ocs.2         AAH65561,AAH80596,AAI17172,...
                                            REFSEQ         TXCHROM
                                   <CharacterList> <CharacterList>
uc001aaa.3                               NR_046018            chr1
uc010nxq.1                               NR_046018            chr1
uc010nxr.1                               NR_046018            chr1
uc001aal.1               NM_001005484,NP_001005484            chr1
uc001aaq.2                                      NA            chr1
...                                            ...             ...
uc001bap.3 NM_001011722,NM_018125,NP_001011722,...            chr1
uc010ocr.1 NM_001011722,NM_018125,NP_001011722,...            chr1
uc001baq.3 NM_001011722,NM_018125,NP_001011722,...            chr1
uc001bar.3 NM_001011722,NM_018125,NP_001011722,...            chr1
uc010ocs.2 NM_001011722,NM_018125,NP_001011722,...            chr1
                   TXSTART           TXEND
           <CharacterList> <CharacterList>
uc001aaa.3           11874           14409
uc010nxq.1           11874           14409
uc010nxr.1           11874           14409
uc001aal.1           69091           70008
uc001aaq.2          321084          321115
...                    ...             ...
uc001bap.3        17907048        18024370
uc010ocr.1        17914911        17966476
uc001baq.3        17941583        18024370
uc001bar.3        17944811        18024370
uc010ocs.2        17944811        18024370

score 0 · Answer 2 · 2016-04-26

0