Search
Question: Obsolete REFSEQ annotation
0
gravatar for Ed Siefker
25 days ago by
Ed Siefker210
United States
Ed Siefker210 wrote:

I have reads counted with rna.fa.gz from:
ftp://ftp.ncbi.nih.gov/genomes/M_musculus/ARCHIVE/BUILD.37.1/RNA/

It's labeled with IDs that look like this:

gi|142348699|ref|NM_010886.2|

I've extracted the REFSEQs, and used org.Mm.eg.db to get the SYMBOLs.

                             X1      X2
1 gi|126352347|ref|NM_028260.2|  Immp1l
2 gi|142348699|ref|NM_010886.2|  Ndufa4


Unfortunately, a number of REFSEQs seem to be obsolete. 

> select(org.Mm.eg.db, keys="XM_001478046", keytype="REFSEQ", columns="SYMBOL")
Error in .testForValidKeys(x, keys, keytype, fks) :
  None of the keys entered are valid keys for 'REFSEQ'. Please use the keys method to see a listing of valid arguments.

https://www.ncbi.nlm.nih.gov/nuccore/XM_001478046

NCBI Reference Sequence: XM_001478046.1 (click to see this obsolete version)

How do I use Bioconductor to convert XM_001478046 to NM_001142441(and then Sap18b)?

ADD COMMENTlink written 25 days ago by Ed Siefker210

I could also try to annotate using the GI number.  Are there annotation packages that use the GI number?  It doesn't seem to be a column in org.Mm.eg.db. 

> columns(org.Mm.eg.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
 [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"
[11] "GO"           "GOALL"        "IPI"          "MGI"          "ONTOLOGY"
[16] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"         "PROSITE"
[21] "REFSEQ"       "SYMBOL"       "UNIGENE"      "UNIPROT"

 

ADD REPLYlink written 25 days ago by Ed Siefker210

Could you add the output of sessionInfo() to your question?

ADD REPLYlink modified 22 days ago • written 22 days ago by daniel.vantwisk30

This is something that has been asked for in the past. We provide the most current ID mappings from NCBI, but all of the revision history is opaque to the annotation packages.

I don't think it would be insurmountable to add the Gene ID history (it's provided at ftp.ncbi.nlm.nih.gov/gene/DATA/gene_history.gz) The revision history for RefSeq might be parseable from here (ftp://ftp.ncbi.nlm.nih.gov/refseq/removed/), although that's lots of data.

Anyway, there is some interest from people who have old data, but who want to annotate using their current Bioconductor installation. The alternative of trying to install successively older versions of R/BioC in order to match with the era of your underlying data is probably a suboptimal strategy.

The revision history might also be accessed using RCurl and querying nuccore, with a correctly formatted URI, like https://www.ncbi.nlm.nih.gov/nuccore/XM_001478046?report=girevhist, which shows the history for the RefSeq ID that Ed cares about.

ADD REPLYlink written 22 days ago by James W. MacDonald45k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 141 users visited in the last hour