I am trying to figure out why thousands of transcript clusters are not mapped in ragene20sttranscriptcluster.db
while they are perfectly present in the manufacturer's materials. Just one example: transcriptcluster 17610996 is well described in the manufacturer's annotation (support files at https://www.thermofisher.com/order/catalog/product/902124), with chromosome coordinates given (chr1+:23093113-23164282), gene symbol (Enpp3), transcript ID (NM_019370) etc. But in ragene20sttranscriptcluster.db
this cluster is not mapped in any of the sets besides ragene20sttranscriptclusterACCNUM
- at what stage was it 'lost'?
And what would you recommend as a solution to map those missing transcripts? (in my case, I need to get gene symbol for each transcript)
library("ragene20sttranscriptcluster.db")
# example 1: transcript is found and mapped, all OK
'17736437' %in% mappedkeys(ragene20sttranscriptclusterCHR) # TRUE
# example 2: transcript is mapped only in ragene20sttranscriptclusterACCNUM
'17610996' %in% mappedkeys(ragene20sttranscriptclusterACCNUM) # TRUE
'17610996' %in% mappedkeys(ragene20sttranscriptclusterCHR) # FALSE
'17610996' %in% keys(ragene20sttranscriptclusterCHR) # TRUE
# example 3: transcript is not mapped even in ragene20sttranscriptclusterACCNUM
'17707702' %in% mappedkeys(ragene20sttranscriptclusterACCNUM) # FALSE
'17707702' %in% mappedkeys(ragene20sttranscriptclusterCHR) # FALSE
'17707702' %in% keys(ragene20sttranscriptclusterCHR) # TRUE
Or if you just want the RefSeq ID:
James, thank you so much for your fast and informative reply! If I may post couple additional questions:
1) I was actually using
mget()
in my real code - it is also considered old interface? I usedata.table
, and my command was as following:Would it be the right way just to replace
mget()
withmapIds()
? For now I replaced like this:2) Could you explain (or point me to the right source) what happens when we call
select(ragene20sttranscriptcluster.db, "17610996", "SYMBOL")
- what conversions does it make?3) there are still cases like 17707702:
select(ragene20sttranscriptcluster.db, "17707702", "SYMBOL")
gives nothing while in fact this transcript is well annotated (see on affy site, or Fisher page) - any recommendations how to handle it?Thanks again, I really appreciate your help
1.) Yes,
mget
is old tech as well. Use eitherselect
ormapIds
.If you are just trying to annotate an ExpressionSet, you might consider using
annotateEset
from my affycoretools package. Why exactly are you usingdata.table
? I can't imagine that is particularly useful in this context (where by 'this context' I mean reading in Affy data and then fitting a model using limma, which is the normal course of action for this sort of data). I would normally do something likeand then proceed with fitting a model - the
topTable
output will now contain all the annotations thatannotateEset
got for you.2.) Have you looked at the help page? That seems like the best thing to try if you don't know what a function does. The basic idea is what I already told you - the ChipDb package is really just a SQLite database that you can query. But it only has one table that maps probeset IDs to Gene IDs. Any other mapping is done by the org.Rn.eg.db, package which is a SQLite database with multiple different tables. Mapping ID to SYMBOL is just an innner join between the ChipDb table and the table in the OrgDb that has Gene IDs and SYMBOLs.
Do note that the 'eg' in org.Rn.eg.db is shorthand for Entrez Gene - these mappings are all based on data from NCBI, which brings me to the next point.
3.) As I mentioned before, we start with Affy ID -> Entrez Gene IDs and then do all the mappings from there. If you look closely at the netaffx page you link to, there are no Entrez Gene IDs on that page. There is an Ensembl Transcript ID (ENSRNOT00000081634), but that is all.
To get a bit further down into the weeds, up until recently I generated the ChipDb packages for each release. These Gene ST arrays are pretty nasty to parse, because Affy just jams a bunch of info in there, and trying to programmatically get stuff back out is non trivial. As an example, here is the line I would parse for this particular probeset:
The first entry there is the probeset ID, and the second bit, starting with ENSRNOT00000046053 is the gene annotation I would parse, looking for any NCBI type IDs that I could use. As you can see there, they do provide the Ensembl Transcript ID as well as the Ensembl Gene ID, but the main ID comes at the beginning of the string, or after any /// delimiter. So I could get out the Ensembl Transcript ID or that Noncode ID, neither of which would be particularly helpful, unless I map them to Entrez Gene IDs! I actually did at one time try to recapture all the Ensembl-only mappings in there, but the gain in mappings wasn't worth the work it took to get them out.
Thanks again!
1) In this particular case I don't have raw data, I have to work with what I have. I will definitely consider using
annotateEset
if I get .cel files!2) Sure, I looked at the help page - but it was not explaining how exactly the mapping is done in this case. From your explanation, now my understanding is that
select(ragene20sttranscriptcluster.db, "17610996", "SYMBOL")
does mapping in 2 steps: first, it usesprobes
table ofragene20sttranscriptcluster.sqlite
database to mapprobe_id
togene_id
. Then, in theorg.Rn.eg.sqlite
database,gene_id
is mapped tosymbol
field ofgene_info
table (not directly but via internal_id
found throughgenes
table). Is it correct?3) You're right, now I see that netaffx is not so helpful. I will then use support files from Fisher page. I used
RaGene-2_0-st-v1.na36.rn5.transcript.csv
which seems to be a next release comparing to what you mention (36 vs 35) and has updated information about that record in themrna_assignment
field, whilegene_assignment
is still empty. But there's actually something they called 'Design Time Annotation Files', and in the archive 'RaGene-2_0-st-v1-design-time-20120706.zip' there'sRaGene-2_0-st-v1.design-time.20120706.transcript.csv
- and this one contains bothmrna_assignment
andgene_assignment
, with valid 'NM_' accessions. I did not have time to figure out why these are different (I mean,.na36.rn5.transcript.csv
vs.design-time.20120706.transcript.csv
) but I will probably just parse the latter one for my conversions.2.) Not really. It attaches the two databases and then does an inner join query between the tables. But I am not sure you would get any different results if you did the two-step procedure you outline.
3.) The two files are different because Affy goes back on a semi-regular basis and ensures that the original mappings (which were based on what we thought about the genome in 2012 for this array) are still correct. The genome and annotations are still quite fluid, and what we thought was true in 2012 may no longer hold.