Question

probe annotation in hugene20sttranscriptcluster.db

0

Entering edit mode

sylvia ▴ 10

@sylvia-5630

Last seen 6.5 years ago

Hello,

I'm currently working on hugene20sttranscriptcluster.db and realized some probe can match to multiple genes for example:

> select(hugene20sttranscriptcluster.db, '17080408', c("SYMBOL","ENTREZID"))
PROBEID SYMBOL ENTREZID
1 17080408 RAD21 5885
2 17080408 MIR3610 100500914

I was wondering if there is a specific factor that determines the order of the gene symbol ? or it's just random. If I wish to annotate the mRNA profile, would you recommend I collapse all the possible gene symbol for one probe or just use the first entry for each probe?

Best,

Sylvia

hugene20sttranscriptcluster.db • 1.3k views

ADD COMMENT • link updated 8.5 years ago by James W. MacDonald 66k • written 8.5 years ago by sylvia ▴ 10

score 1 · Answer 1 · 2016-01-20

The ChipDb packages are really just a SQLite database with an API wrapper that allows you to make queries without having to know SQL. But since the underlying functions are generating SQL queries, and in general the returned values from a DB are unordered, I don't think there is any particular order to the returned data. However, the order of the input values is guaranteed to be the same in the output (e.g., the probeset IDs will be in the same order).

How to deal with the multiple mapping probes is up to you and whomever you are working with, and how you plan to present the data. I tend to use mapIds(), which by default will just take the first entry, because I am simple like that. You can however return a list, or possibly better, a CharacterList. But if you are using things like limma and/or ReportingTools for analysis and presentation, the list structure isn't as pleasant to deal with.