Why so many entries are 'not mapped' in ragene20sttranscriptcluster.db?
1
0
Entering edit mode
aush ▴ 40
@aush-9128
Last seen 6 months ago
United States

I am trying to figure out why thousands of transcript clusters are not mapped in ragene20sttranscriptcluster.db while they are perfectly present in the manufacturer's materials. Just one example: transcriptcluster 17610996 is well described in the manufacturer's annotation (support files at https://www.thermofisher.com/order/catalog/product/902124), with chromosome coordinates given (chr1+:23093113-23164282), gene symbol (Enpp3), transcript ID (NM_019370) etc. But in ragene20sttranscriptcluster.db this cluster is not mapped in any of the sets besides ragene20sttranscriptclusterACCNUM - at what stage was it 'lost'?

And what would you recommend as a solution to map those missing transcripts? (in my case, I need to get gene symbol for each transcript)

library("ragene20sttranscriptcluster.db")

# example 1: transcript is found and mapped, all OK
'17736437' %in% mappedkeys(ragene20sttranscriptclusterCHR) # TRUE

# example 2: transcript is mapped only in ragene20sttranscriptclusterACCNUM
'17610996' %in% mappedkeys(ragene20sttranscriptclusterACCNUM) # TRUE
'17610996' %in% mappedkeys(ragene20sttranscriptclusterCHR) # FALSE
'17610996' %in%       keys(ragene20sttranscriptclusterCHR) # TRUE

# example 3: transcript is not mapped even in ragene20sttranscriptclusterACCNUM
'17707702' %in% mappedkeys(ragene20sttranscriptclusterACCNUM) # FALSE
'17707702' %in% mappedkeys(ragene20sttranscriptclusterCHR) # FALSE
'17707702' %in%       keys(ragene20sttranscriptclusterCHR) # TRUE

ragene20sttranscriptcluster.db • 672 views
1
Entering edit mode
@james-w-macdonald-5106
Last seen 2 hours ago
United States

The answer is a bit more complicated than what you might think. First, using the old BiMap interface (e.g., using things like ragene20sttranscriptclusterCHR directly) is probably suboptimal at this point, and you should be using either select or mapIds instead.

Second, these days the ChipDb objects only contain one table that maps the chip ID to an Entrez Gene ID, and all of the heavy lifting is done by (in this case) the org.Rn.eg.db package. That's a convoluted way of saying that the chr/pos information that Affy (or more correctly ThermoFisherAllScientificCompaniesInOne) supply are not used at all. What you get from the annotation csv from Affy is the actual position for the set of probes, whereas what you would get from the ChipDb would be the position of the gene being measured:

> select(ragene20sttranscriptcluster.db, "17610996", c("CHR", "CHRLOC","CHRLOCEND"))
'select()' returned 1:many mapping between keys and columns
PROBEID CHR   CHRLOC CHRLOCCHR CHRLOCEND
1 17610996   1 21613147         1  21684482
2 17610996   1       NA      <NA>        NA
Warning message:
In .deprecatedColsMessage() :
Accessing gene location information via 'CHR','CHRLOC','CHRLOCEND' is
deprecated. Please use a range based accessor like genes(), or select()
with columns values like TXCHROM and TXSTART on a TxDb or OrganismDb
object instead.

And third, as you can see from the warning, we have split the annotation data (things like what the gene is, what it does, the protein name, etc) off from the positional information (where the gene is in the genome, where its exons are , the CDS, etc). People in general don't particularly care where the probes for an affy probeset align, but instead want to know where the gene it measures comes from.

But in the end you seem to just want the symbol and/or transcript ID:

> select(ragene20sttranscriptcluster.db, "17610996", "SYMBOL")
'select()' returned 1:many mapping between keys and columns
PROBEID       SYMBOL
1 17610996        Enpp3
2 17610996 LOC103690935
> select(ragene20sttranscriptcluster.db, "17610996", "ACCNUM")
'select()' returned 1:many mapping between keys and columns
PROBEID       ACCNUM
1  17610996     AAB61535
2  17610996     AAB61536
3  17610996     AAH97326
4  17610996     AJ250374
5  17610996     BAA06333
6  17610996     BC097326
7  17610996     CAA88029
8  17610996     CAB59427
9  17610996       D30649
10 17610996     EDL87774
11 17610996     FQ218696
12 17610996    NM_019370
13 17610996    NP_062243
14 17610996       P97675
15 17610996       U78787
16 17610996       U78788
17 17610996       Z47987
18 17610996 XM_008758654
19 17610996 XP_008756876

0
Entering edit mode

Or if you just want the RefSeq ID:

select(ragene20sttranscriptcluster.db, "17610996", "REFSEQ")
'select()' returned 1:many mapping between keys and columns
PROBEID       REFSEQ
1 17610996    NM_019370
2 17610996    NP_062243
3 17610996 XM_008758654
4 17610996 XP_008756876

0
Entering edit mode

James, thank you so much for your fast and informative reply! If I may post couple additional questions:

1) I was actually using mget() in my real code - it is also considered old interface? I use data.table, and my command was as following:

dt.dat[, ratSym1:=mget(transcriptID,ragene20sttranscriptclusterSYMBOL,ifnotfound = NA)]


Would it be the right way just to replace mget() with mapIds()? For now I replaced like this:

dt.dat[,ratSym1a:=mapIds(ragene20sttranscriptcluster.db, transcriptID, column = "SYMBOL", keytype = 'PROBEID')]


2) Could you explain (or point me to the right source) what happens when we call select(ragene20sttranscriptcluster.db, "17610996", "SYMBOL") - what conversions does it make?

3) there are still cases like 17707702: select(ragene20sttranscriptcluster.db, "17707702", "SYMBOL") gives nothing while in fact this transcript is well annotated (see on affy site, or Fisher page) - any recommendations how to handle it?

Thanks again, I really appreciate your help

0
Entering edit mode

1.) Yes, mget is old tech as well. Use either select or mapIds.

If you are just trying to annotate an ExpressionSet, you might consider using annotateEset from my affycoretools package. Why exactly are you using data.table? I can't imagine that is particularly useful in this context (where by 'this context' I mean reading in Affy data and then fitting a model using limma, which is the normal course of action for this sort of data). I would normally do something like

dat <- read.celfiles(<vector of cel file names goes here>)
eset <- rma(dat)
eset <- annotateEset(eset, ragene20sttranscriptcluster.db)

and then proceed with fitting a model - the topTable output will now contain all the annotations that annotateEset got for you.

2.) Have you looked at the help page? That seems like the best thing to try if you don't know what a function does. The basic idea is what I already told you - the ChipDb package is really just a SQLite database that you can query. But it only has one table that maps probeset IDs to Gene IDs. Any other mapping is done by the org.Rn.eg.db, package which is a SQLite database with multiple different tables. Mapping ID to SYMBOL is just an innner join between the ChipDb table and the table in the OrgDb that has Gene IDs and SYMBOLs.

Do note that the 'eg' in org.Rn.eg.db is shorthand for Entrez Gene - these mappings are all based on data from NCBI, which brings me to the next point.

3.) As I mentioned before, we start with Affy ID -> Entrez Gene IDs and then do all the mappings from there. If you look closely at the netaffx page you link to, there are no Entrez Gene IDs on that page. There is an Ensembl Transcript ID (ENSRNOT00000081634), but that is all.

To get a bit further down into the weeds, up until recently I generated the ChipDb packages for each release. These Gene ST arrays are pretty nasty to parse, because Affy just jams a bunch of info in there, and trying to programmatically get stuff back out is non trivial. As an example, here is the line I would parse for this particular probeset:

\$ grep "17707702" RaGene-2_0-st-v1.na35.rn5.transcript.csv | cut -d, -f 1,9
"17707702","ENSRNOT00000046053 // ENSEMBL // ensembl:known chromosome:Rnor_5.0:16:37301123:37303599:1 gene:ENSRNOG00000038953 gene_biotype:protein_coding transcript_biotype:protein_coding // chr16 // 100 // 100 // 12 // 12 // 0 /// NONRATT022456 // NONCODE // Non-coding transcript identified by NONCODE // chr16 // 100 // 100 // 12 // 12 // 0"

The first entry there is the probeset ID, and the second bit, starting with ENSRNOT00000046053 is the gene annotation I would parse, looking for any NCBI type IDs that I could use. As you can see there, they do provide the Ensembl Transcript ID as well as the Ensembl Gene ID, but the main ID comes at the beginning of the string, or after any /// delimiter. So I could get out the Ensembl Transcript ID or that Noncode ID, neither of which would be particularly helpful, unless I map them to Entrez Gene IDs! I actually did at one time try to recapture all the Ensembl-only mappings in there, but the gain in mappings wasn't worth the work it took to get them out.

0
Entering edit mode

Thanks again!

1) In this particular case I don't have raw data, I have to work with what I have. I will definitely consider using annotateEset if I get .cel files!

2) Sure, I looked at the help page - but it was not explaining how exactly the mapping is done in this case. From your explanation, now my understanding is that select(ragene20sttranscriptcluster.db, "17610996", "SYMBOL") does mapping in 2 steps: first, it uses probes table of ragene20sttranscriptcluster.sqlite database to map probe_id to gene_id. Then, in the org.Rn.eg.sqlite database, gene_id is mapped to symbol field of gene_info table (not directly but via internal _id found through genes table). Is it correct?

3) You're right, now I see that netaffx is not so helpful. I will then use support files from Fisher page. I used RaGene-2_0-st-v1.na36.rn5.transcript.csv which seems to be a next release comparing to what you mention (36 vs 35) and has updated information about that record in the mrna_assignment field, while gene_assignment is still empty. But there's actually something they called 'Design Time Annotation Files', and in the archive 'RaGene-2_0-st-v1-design-time-20120706.zip' there's RaGene-2_0-st-v1.design-time.20120706.transcript.csv - and this one contains both  mrna_assignment and gene_assignment, with valid 'NM_' accessions. I did not have time to figure out why these are different (I mean, .na36.rn5.transcript.csv vs .design-time.20120706.transcript.csv) but I will probably just parse the latter one for my conversions.

0
Entering edit mode

2.) Not really. It attaches the two databases and then does an inner join query between the tables. But I am not sure you would get any different results if you did the two-step procedure you outline.

3.) The two files are different because Affy goes back on a semi-regular basis and ensures that the original mappings (which were based on what we thought about the genome in 2012 for this array) are still correct. The genome and annotations are still quite fluid, and what we thought was true in 2012 may no longer hold.