Question

How to change XLOC ID to Gene symbol from Cuffdiff ?

0

Entering edit mode

mg.mahabad1365 • 0

@mgmahabad1365-23539

Last seen 3.6 years ago

Hi Dear All, When I did RNA-Seq analysis, the GTF file I used was from NCBI. The output of cuffdiff replaced the Gene symbol (official gene symbol) with XLOC's such as:

LOC110534079

LOC110534540

LOC110537830

LOC110485322

LOC110487655

LOC110491675

LOC110492686

LOC110498361

LOC110500236

LOC110502506

Example :

LOC110537830 (ID) = mknk1 (Gene symbol)

Is there any way to convert XLOCs back to Gene symbols?

I searched repeatedly in section Gene in NCBI, but the results Were obtained as LOCs.

I also tried the following sites but it was not successful:

https://biit.cs.ut.ee/gprofiler/convert
https://biodbnet-abcc.ncifcrf.gov/db/dbFind.php
https://www.uniprot.org/uniprot/?query=LOC110525276&sort=score
I also read all the relevant biostars content but did not get the required result: https://www.biostars.org/p/129299/

I also tried the following command in the R software cummeRbund package, but the output answers were as LOCs ID.

cuff <- readCufflinks()

#Retrive significant gene IDs (XLOC) with a pre-specified alpha
diffGeneIDs <- getSig(cuff,level="genes",alpha=0.05)

#Use returned identifiers to create a CuffGeneSet object with all relevant info for given genes
diffGenes<-getGenes(cuff,diffGeneIDs)

#gene_short_name values (and corresponding XLOC_* values) can be retrieved from the CuffGeneSet by using:
names<-featureNames(diffGenes)
row.names(names)=names$tracking_id
diffGenesNames<-as.matrix(names)
diffGenesNames<-diffGenesNames[,-1]

# get the data for the significant genes
diffGenesData<-diffData(diffGenes)
row.names(diffGenesData)=diffGenesData$gene_id
diffGenesData<-diffGenesData[,-1]

# merge the two matrices by row names
diffGenesOutput<-merge(diffGenesNames,diffGenesData,by="row.names")

Does anyone have a solution to this problem? Regards

RNA-seq / XLOC cuffdiff • 2.9k views

ADD COMMENT • link updated 2.2 years ago by Lan • 0 • written 3.9 years ago by mg.mahabad1365 • 0

0

Entering edit mode

Dear Dr. Kevin Blighe, I would appreciate your immediate attention to this matter. Thank you very much for the prompt reply and your information. sincerely regards.

ADD REPLY • link 3.9 years ago mg.mahabad1365 • 0

0

Entering edit mode

Try the solution by James, first. It will prove a lot easier.

ADD REPLY • link 3.9 years ago Kevin Blighe ★ 3.9k

score 1 · Answer 1 · 2020-05-18

Two other alternatives are

Use the AnnotationHub

> library(AnnotationHub)
> hub <- AnnotationHub()
  |======================================================================| 100%

snapshotDate(): 2020-04-27
> query(hub, "mykiss")
AnnotationHub with 4 records
# snapshotDate(): 2020-04-27
# $dataprovider: Ensembl
# $species: Oncorhynchus mykiss
# $rdataclass: GRanges, EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH79739"]]' 

            title                                        
  AH79739 | Ensembl 100 EnsDb for Oncorhynchus mykiss    
  AH80248 | Oncorhynchus_mykiss.Omyk_1.0.100.abinitio.gtf
  AH80249 | Oncorhynchus_mykiss.Omyk_1.0.100.chr.gtf     
  AH80250 | Oncorhynchus_mykiss.Omyk_1.0.100.gtf         
> ensdb <- hub[["AH79739"]]
downloading 1 resources
retrieving 1 resource
  |======================================================================| 100%

loading from cache
require("ensembldb")
> z <- scan("clipboard","c")
Read 10 items
> z
 [1] "LOC110534079" "LOC110534540" "LOC110537830" "LOC110485322" "LOC110487655"
 [6] "LOC110491675" "LOC110492686" "LOC110498361" "LOC110500236" "LOC110502506"
## LOCs are just the NCBI Gene ID prepended with a LOC
> z <- gsub("LOC", "", z)
> select(ensdb, z, "GENENAME","ENTREZID")
   ENTREZID GENENAME
1 110534079     her6
2 110534540         
3 110537830    mknk1
4 110485322     tal1
5 110502506  tax1bp3

The downside of doing that, IMO is you are relying on this mapping Ensembl ID -> Gene ID -> Gene symbol

And there are often technical reasons for genes to not map from Ensembl to NCBI, so you can lose lots of genes that way.

Make your own NCBI-based OrgDb package

> library(AnnotationForge)
> makeOrgPackageFromNCBI("0.0.1","me <me@mine.org>", "me",".","8022", "Oncorhynchus","mykiss")
If files are not cached locally this may take awhile to assemble a 12 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.
preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene2unigene
[5] gene_info.gz
[6] gene2go.gz
getting data for gene2pubmed.gz
rebuilding the cache
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
rebuilding the cache
<snip>

I only show the very beginning of the output, as this function downloads lots of data from NCBI and then parses it, so it can take quite a while. But the upside of doing this is that you are starting with NCBI identifiers, so getting the gene symbol doesn't require mapping between annotation services. After waiting for the package to build you then just do

install.packages("org.Omykiss.eg.db", repos = NULL) ## if you are on Windows, add type = "source" to that function call

score 0 · Answer 2 · 2020-05-17

It looks like your species is Rainbow Trout (Oncorhynchus mykiss) (?). It seems to have only relatively recently been sequenced and annotated (see HERE).

It does not yet appear to be included in Ensembl Biomart, but it's possible to search for these IDs on Ensembl's website. You can likely also use NCBI's e-Utils.

Another programmatic way, but somewhat cumbersome, is via cURL and Ensembl's REST server:

get the Ensembl ID:

curl 'https://rest.ensembl.org/xrefs/symbol/oncorhynchus_mykiss/LOC110537830' -H 'Content-type:text/xml'

<opt>
  <data id="ENSOMYG00000003938" type="gene"/>
</opt>

now get the gene name:

curl 'https://rest.ensembl.org/xrefs/id/ENSOMYG00000003938?' -H 'Content-type:text/xml'

<opt>
  <data db_display_name="NCBI gene (formerly Entrezgene)" dbname="EntrezGene" description="MAP kinase-interacting serine/threonine-protein kinase 1-like" display_id="LOC110537830" info_text="" info_type="DEPENDENT" primary_id="110537830" version="0">
  </data>
  <data db_display_name="WikiGene" dbname="WikiGene" description="MAP kinase-interacting serine/threonine-protein kinase 1-like" display_id="LOC110537830" info_text="" info_type="DEPENDENT" primary_id="110537830" version="0">
  </data>
  <data db_display_name="Projected ZFIN" dbname="ZFIN_ID" description="MAPK interacting serine/threonine kinase 1" display_id="mknk1" info_text="from danio_rerio gene ENSDARG00000018411" info_type="PROJECTION" primary_id="mknk1" version="0">
  </data>
</opt>

A bit cumbersome, but possible to code for all your 'LOC' IDs. It's also possible to do all of this within R - see here: https://rest.ensembl.org/

Kevin