How to change XLOC ID to Gene symbol from Cuffdiff ?
2
0
Entering edit mode
@mgmahabad1365-23539
Last seen 3.6 years ago

Hi Dear All, When I did RNA-Seq analysis, the GTF file I used was from NCBI. The output of cuffdiff replaced the Gene symbol (official gene symbol) with XLOC's such as:

LOC110534079

LOC110534540

LOC110537830

LOC110485322

LOC110487655

LOC110491675

LOC110492686

LOC110498361

LOC110500236

LOC110502506

Example :

LOC110537830 (ID) = mknk1 (Gene symbol)

Is there any way to convert XLOCs back to Gene symbols?

I searched repeatedly in section Gene in NCBI, but the results Were obtained as LOCs.

I also tried the following sites but it was not successful:

  1. https://biit.cs.ut.ee/gprofiler/convert

  2. https://biodbnet-abcc.ncifcrf.gov/db/dbFind.php

  3. https://www.uniprot.org/uniprot/?query=LOC110525276&sort=score

  4. I also read all the relevant biostars content but did not get the required result: https://www.biostars.org/p/129299/

I also tried the following command in the R software cummeRbund package, but the output answers were as LOCs ID.

cuff <- readCufflinks()

#Retrive significant gene IDs (XLOC) with a pre-specified alpha
diffGeneIDs <- getSig(cuff,level="genes",alpha=0.05)

#Use returned identifiers to create a CuffGeneSet object with all relevant info for given genes
diffGenes<-getGenes(cuff,diffGeneIDs)

#gene_short_name values (and corresponding XLOC_* values) can be retrieved from the CuffGeneSet by using:
names<-featureNames(diffGenes)
row.names(names)=names$tracking_id
diffGenesNames<-as.matrix(names)
diffGenesNames<-diffGenesNames[,-1]

# get the data for the significant genes
diffGenesData<-diffData(diffGenes)
row.names(diffGenesData)=diffGenesData$gene_id
diffGenesData<-diffGenesData[,-1]

# merge the two matrices by row names
diffGenesOutput<-merge(diffGenesNames,diffGenesData,by="row.names")

Does anyone have a solution to this problem? Regards

RNA-seq / XLOC cuffdiff • 2.9k views
ADD COMMENT
0
Entering edit mode

Dear Dr. Kevin Blighe, I would appreciate your immediate attention to this matter. Thank you very much for the prompt reply and your information. sincerely regards.

ADD REPLY
0
Entering edit mode

Try the solution by James, first. It will prove a lot easier.

ADD REPLY
1
Entering edit mode
@james-w-macdonald-5106
Last seen 40 minutes ago
United States

Two other alternatives are

Use the AnnotationHub

> library(AnnotationHub)
> hub <- AnnotationHub()
  |======================================================================| 100%

snapshotDate(): 2020-04-27
> query(hub, "mykiss")
AnnotationHub with 4 records
# snapshotDate(): 2020-04-27
# $dataprovider: Ensembl
# $species: Oncorhynchus mykiss
# $rdataclass: GRanges, EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH79739"]]' 

            title                                        
  AH79739 | Ensembl 100 EnsDb for Oncorhynchus mykiss    
  AH80248 | Oncorhynchus_mykiss.Omyk_1.0.100.abinitio.gtf
  AH80249 | Oncorhynchus_mykiss.Omyk_1.0.100.chr.gtf     
  AH80250 | Oncorhynchus_mykiss.Omyk_1.0.100.gtf         
> ensdb <- hub[["AH79739"]]
downloading 1 resources
retrieving 1 resource
  |======================================================================| 100%

loading from cache
require("ensembldb")
> z <- scan("clipboard","c")
Read 10 items
> z
 [1] "LOC110534079" "LOC110534540" "LOC110537830" "LOC110485322" "LOC110487655"
 [6] "LOC110491675" "LOC110492686" "LOC110498361" "LOC110500236" "LOC110502506"
## LOCs are just the NCBI Gene ID prepended with a LOC
> z <- gsub("LOC", "", z)
> select(ensdb, z, "GENENAME","ENTREZID")
   ENTREZID GENENAME
1 110534079     her6
2 110534540         
3 110537830    mknk1
4 110485322     tal1
5 110502506  tax1bp3

The downside of doing that, IMO is you are relying on this mapping Ensembl ID -> Gene ID -> Gene symbol

And there are often technical reasons for genes to not map from Ensembl to NCBI, so you can lose lots of genes that way.

Make your own NCBI-based OrgDb package

> library(AnnotationForge)
> makeOrgPackageFromNCBI("0.0.1","me <me@mine.org>", "me",".","8022", "Oncorhynchus","mykiss")
If files are not cached locally this may take awhile to assemble a 12 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.
preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene2unigene
[5] gene_info.gz
[6] gene2go.gz
getting data for gene2pubmed.gz
rebuilding the cache
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
rebuilding the cache
<snip>

I only show the very beginning of the output, as this function downloads lots of data from NCBI and then parses it, so it can take quite a while. But the upside of doing this is that you are starting with NCBI identifiers, so getting the gene symbol doesn't require mapping between annotation services. After waiting for the package to build you then just do

install.packages("org.Omykiss.eg.db", repos = NULL) ## if you are on Windows, add type = "source" to that function call

ADD COMMENT
0
Entering edit mode

Thanks. That's a lot easier than mine!

ADD REPLY
0
Entering edit mode

Dear Dr.James W. MacDonald These commands are very useful and wonderful, Thank you very much for your kind and your valuable information. I wish you success and health Best regards Mohammad

ADD REPLY
0
Entering edit mode

Just to follow up, the OrgDb package might be a bit less useful because for non-model organisms NCBI often likes to just prepend a LOC on the Gene ID, which you already have.

> z <- scan("clipboard","c")
Read 10 items
> z
 [1] "LOC110534079" "LOC110534540" "LOC110537830" "LOC110485322" "LOC110487655"
 [6] "LOC110491675" "LOC110492686" "LOC110498361" "LOC110500236" "LOC110502506"
> z <- gsub("LOC", "", z)
> select(org.Omykiss.eg.db, z, "SYMBOL")
'select()' returned 1:1 mapping between keys and columns
         GID       SYMBOL
1  110534079 LOC110534079
2  110534540 LOC110534540
3  110537830        mknk1
4  110485322 LOC110485322
5  110487655 LOC110487655
6  110491675 LOC110491675
7  110492686 LOC110492686
8  110498361 LOC110498361
9  110500236 LOC110500236
10 110502506 LOC110502506
> select(org.Omykiss.eg.db, z, "GENENAME")
'select()' returned 1:1 mapping between keys and columns
         GID                                              GENENAME
1  110534079                       transcription factor HES-1-like
2  110534540                                        claudin-1-like
3  110537830            MAPK interacting serine/threonine kinase 1
4  110485322   T-cell acute lymphocytic leukemia protein 1 homolog
5  110487655                                  late histone H1-like
6  110491675                          uncharacterized LOC110491675
7  110492686                       transcription factor jun-D-like
8  110498361                        cystathionine gamma-lyase-like
9  110500236 megakaryocyte-associated tyrosine-protein kinase-like
10 110502506                                tax1-binding protein 3

And I haven't yet found a good way to collapse the gene names to gene symbols.

ADD REPLY
0
Entering edit mode

Thanks so much for the valuable discussion on this issue! I am dealing with scRNA-seq for rainbow trout and having the same issue here. I got a data frame of gene symbol annotation from AnnotationHub according to your discussion, my questeion is how can I replace each of the LOCxxx in my seurat object with the gene symbol according to the data I got from AnnotationHub. (although there are only some of the gene ID have correspoding gene sybmols, but it's helpful to have whatever is available out there). Thanks very much for your help!

Here is what I have in my seurat object:

> head(RTC_1.0@assays[["RNA"]]@data@Dimnames[[1]], 20)  

 [1] "LOC110523613" "nit2"         "LOC110523066" "LOC110523873" "LOC110523811" "LOC110523943" "LOC100135976"

 [8] "LOC110523159" "LOC110523991" "dpt"          "LOC110524124" "LOC110524232" "LOC110524342" "tp63" 

[15] "zbtb11"       "LOC110524726" "LOC110525040" "LOC110525276" "LOC100653484" "LOC110502362"

And then I have got some gene symbols from AnnotationHub (it's a data.frame, and I screenshot a small part of it):

View(annotations)

enter image description here

ADD REPLY
0
Entering edit mode
Kevin Blighe ★ 3.9k
@kevin
Last seen 7 days ago
Republic of Ireland

It looks like your species is Rainbow Trout (Oncorhynchus mykiss) (?). It seems to have only relatively recently been sequenced and annotated (see HERE).

It does not yet appear to be included in Ensembl Biomart, but it's possible to search for these IDs on Ensembl's website. You can likely also use NCBI's e-Utils.

Another programmatic way, but somewhat cumbersome, is via cURL and Ensembl's REST server:

get the Ensembl ID:

curl 'https://rest.ensembl.org/xrefs/symbol/oncorhynchus_mykiss/LOC110537830' -H 'Content-type:text/xml'

<opt>
  <data id="ENSOMYG00000003938" type="gene"/>
</opt>

now get the gene name:

curl 'https://rest.ensembl.org/xrefs/id/ENSOMYG00000003938?' -H 'Content-type:text/xml'

<opt>
  <data db_display_name="NCBI gene (formerly Entrezgene)" dbname="EntrezGene" description="MAP kinase-interacting serine/threonine-protein kinase 1-like" display_id="LOC110537830" info_text="" info_type="DEPENDENT" primary_id="110537830" version="0">
  </data>
  <data db_display_name="WikiGene" dbname="WikiGene" description="MAP kinase-interacting serine/threonine-protein kinase 1-like" display_id="LOC110537830" info_text="" info_type="DEPENDENT" primary_id="110537830" version="0">
  </data>
  <data db_display_name="Projected ZFIN" dbname="ZFIN_ID" description="MAPK interacting serine/threonine kinase 1" display_id="mknk1" info_text="from danio_rerio gene ENSDARG00000018411" info_type="PROJECTION" primary_id="mknk1" version="0">
  </data>
</opt>

A bit cumbersome, but possible to code for all your 'LOC' IDs. It's also possible to do all of this within R - see here: https://rest.ensembl.org/

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 885 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6