Question

LOC gene IDs for characterised genes in annotation files

0

Entering edit mode

Chris • 0

@6c640a95

Last seen 2.9 years ago

United Kingdom

I've noticed that there are a lot of genes (>1/3 of total) which have names and/or symbols in NCBI that are not in the reference annotation files. The oviAri4.ncbiRefSeq.gtf and GCF_002742125.1_Oar_rambouillet_v1.0_genomic.gtf annotation files both have the problem.

To demonstrate, and to calculate the extent of the problem, I have used the following code:

First I import the gtf file and find the total number of genes, and the total that have a LOC gene ID.

txdb <- makeTxDbFromGFF(file=gffFile, format=c("gtf"))
txdf <- AnnotationDbi::select(txdb, keys(txdb,"GENEID"), 
        "TXNAME", "GENEID")
no_of_genes <- nrow(unique(txdf[1]))
LOC_only <- txdf[1] %>%
  filter(str_starts(GENEID, "LOC")) %>%
  unique()

Then I query each against the NCBI database and extract the gene symbol ("name") and the gene name ("description").

for (i in 1:nrow(LOC_only)){
  tryCatch({
    e_name <- 
      entrez_summary(db="gene",
           id=as.numeric(gsub("LOC","",
                    LOC_only$GENEID[i])))$name
    e_description <- 
      entrez_summary(db="gene",
            id=as.numeric(gsub("LOC","",
                    LOC_only$GENEID[i])))$description
    LOC_only$symbol[i] <- e_name
    LOC_only$description[i] <- e_description
  }, error=function(e){})
}

Doing this I have discovered that a large proportion of the genes using a LOC ID have names and symbols on the NCBI database. I quantified this using the following code:

LOC_uncharacterized<- LOC_only %>%
  filter(str_starts(description, "uncharacterized")) %>%
  nrow()
LOC_with_name <- LOC_only %>%
  filter(!str_starts(description, "uncharacterized")) %>%
  nrow()
LOC_with_Symbol <- LOC_only %>%
  filter(!str_starts(symbol, "LOC")) %>%
  nrow()
LOC_with_name_no_symbol <- LOC_only %>%
  filter(str_starts(symbol, "LOC")) %>%
  filter(!str_starts(description, "uncharacterized")) %>%
  nrow()

The results were:

For the oviAri4.ncbiRefSeq.gtf:

Out of the 24478 genes in the annotation, 8462 only have a LOC symbol = 8462/24478 = 34.6%.

3778 were uncharacterized = 3778/8462 = 44.6%. 4654 had a description containing a name = 4654/8462 = 55.0%. 526 also had an NCBI symbol = 526/8462 = 6.2%. 4128 had a description containing a name but no symbol = 4128/8462 = 48.8%.

For the Oar_rambouillet_v1.0.gtf:

Out of the 28524 genes in the annotation, 10441 only have a LOC symbol = 10441/28524 = 36.6%.

3153 were uncharacterized = 3153/10441 = 30.1%. 7288 had a description containing a name = 7288/10441 = 69.8%. 15 also had an NCBI symbol = 15/10441 = 0.1%. 7273 had a description containing a name but no symbol = 7273/10441 = 69.7%.

As you can see, over 6% of the genes in oviAri4.ncbiRefSeq.gtf that have a LOC ID actually have a gene symbol in NCBI. This is only 0.1% for Oar_rambouillet_v1.0.gtf, which is not too bad. However, of the genes that have a LOC ID, 55% in oviAri4.ncbiRefSeq.gtf and nearly 70% in Oar_rambouillet_v1.0.gtf have names in NCBI but no symbol.

Examples:

> head(LOC_only,20)
         GENEID       symbol                                       description
1  LOC100037664 LOC100037664                        60S ribosomal protein L35a
2  LOC100037668 LOC100037668                        40S ribosomal protein S15a
3  LOC100101238 LOC100101238                           regakine 1-like protein
4  LOC100125610 LOC100125610                       elongation factor 1-alpha 1
5  LOC100141295 LOC100141295 keratin, type II microfibrillar, component 5-like
6  LOC100144429 LOC100144429                                        tryptase-2
7  LOC100191064 LOC100191064                                     antigen WC1.1
8  LOC100192427 LOC100192427                               Fc gamma 2 receptor
9  LOC100526780 LOC100526780                                        keratin 83
10 LOC100526782 LOC100526782                                       keratin 33B
11 LOC100846990 LOC100846990                                                N2
12 LOC101101846 LOC101101846          pregnancy-associated glycoprotein 1-like
13 LOC101101848 LOC101101848                           olfactory receptor 10S1
14 LOC101101856 LOC101101856                    ceramide synthase 5 pseudogene
15 LOC101101861 LOC101101861                           olfactory receptor 2AE1
16 LOC101101862 LOC101101862                      uncharacterized LOC101101862
18 LOC101101875 LOC101101875     histone-lysine N-methyltransferase PRDM9-like
19 LOC101101878         AATK              apoptosis associated tyrosine kinase
20 LOC101101888 LOC101101888                  GTPase IMAP family member 7-like
21 LOC101101889 LOC101101889                              hyaluronidase-4-like

So, my question...is there a way this can be fixed, or is this something to do with gene symbols needing to be signed off before incorporated, or something? If this can't be fixed, is there a better way that I can correct the annotation files manually? The above code is very slow and takes a number of hours.

R session info below.

Thank you!

R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
 [1] data.table_1.14.0      rentrez_1.2.3          stringr_1.4.0         
 [4] GenomicFeatures_1.40.1 AnnotationDbi_1.50.3   Biobase_2.48.0        
 [7] GenomicRanges_1.40.0   GenomeInfoDb_1.24.2    IRanges_2.22.2        
[10] S4Vectors_0.26.1       BiocGenerics_0.34.0    dplyr_1.0.4           
[13] readr_1.4.0           

loaded via a namespace (and not attached):
 [1] httr_1.4.2                  edgeR_3.30.3               
 [3] bit64_4.0.5                 jsonlite_1.7.2             
 [5] assertthat_0.2.1            askpass_1.1                
 [7] BiocFileCache_1.12.1        blob_1.2.1                 
 [9] GenomeInfoDbData_1.2.3      Rsamtools_2.4.0            
[11] yaml_2.2.1                  progress_1.2.2             
[13] pillar_1.5.0                RSQLite_2.2.3              
[15] lattice_0.20-41             limma_3.44.3               
[17] glue_1.4.2                  digest_0.6.27              
[19] XVector_0.28.0              colorspace_2.0-0           
[21] plyr_1.8.6                  htmltools_0.5.1.9000       
[23] Matrix_1.3-2                XML_3.99-0.5               
[25] pkgconfig_2.0.3             biomaRt_2.44.4             
[27] zlibbioc_1.34.0             purrr_0.3.4                
[29] scales_1.1.1                BiocParallel_1.22.0        
[31] tibble_3.0.6                openssl_1.4.3              
[33] ggplot2_3.3.3               generics_0.1.0             
[35] ellipsis_0.3.1              cachem_1.0.4               
[37] SummarizedExperiment_1.18.2 magrittr_2.0.1             
[39] crayon_1.4.1                memoise_2.0.0              
[41] evaluate_0.14               fansi_0.4.2                
[43] xml2_1.3.2                  tools_4.0.2                
[45] prettyunits_1.1.1           hms_1.0.0                  
[47] lifecycle_1.0.0             matrixStats_0.58.0         
[49] munsell_0.5.0               locfit_1.5-9.4             
[51] DelayedArray_0.14.1         Biostrings_2.56.0          
[53] compiler_4.0.2              tinytex_0.29               
[55] rlang_0.4.10                grid_4.0.2                 
[57] RCurl_1.98-1.2              rstudioapi_0.13            
[59] DRIMSeq_1.16.1              rappdirs_0.3.3             
[61] bitops_1.0-6                rmarkdown_2.7              
[63] gtable_0.3.0                DBI_1.1.1                  
[65] curl_4.3                    reshape2_1.4.4             
[67] R6_2.5.0                    GenomicAlignments_1.24.0   
[69] knitr_1.31                  rtracklayer_1.48.0         
[71] fastmap_1.1.0               bit_4.0.4                  
[73] utf8_1.1.4                  stringi_1.5.3              
[75] Rcpp_1.0.6                  vctrs_0.3.6                
[77] dbplyr_2.1.0                tidyselect_1.1.0           
[79] xfun_0.21

Annotation RNAseq • 5.1k views

ADD COMMENT • link updated 2.2 years ago by Lan • 0 • written 3.0 years ago by Chris • 0

James W. MacDonald · Accepted Answer · 2021-05-12

2

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 9 hours ago

United States

The TxDb objects are intended to provide genomic positions for genes/exons/transcripts/etc. They are not intended to provide other non-positional annotations like gene symbols. Those data come from OrgDb objects, which you can get from the AnnotationHub

> library(AnnotationHub)
> hub <- AnnotationHub()
> 
  |======================================================================| 100%

snapshotDate(): 2020-10-27

> query(hub, c("ovis","orgdb"))
AnnotationHub with 4 records
# snapshotDate(): 2020-10-27
# $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Ovis ovis, Ovis orientalis_aries, Ovis aries, Ovis ammon_aries
# $rdataclass: OrgDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH85594"]]' 

            title                              
  AH85594 | org.Ovis_ammon_aries.eg.sqlite     
  AH85595 | org.Ovis_aries.eg.sqlite           
  AH85596 | org.Ovis_orientalis_aries.eg.sqlite
  AH85597 | org.Ovis_ovis.eg.sqlite            
> z <- hub[["AH85595"]]
downloading 1 resources
retrieving 1 resource
  |======================================================================| 100%

loading from cache
> z
OrgDb object:
| DBSCHEMAVERSION: 2.1
| DBSCHEMA: NOSCHEMA_DB
| ORGANISM: Ovis aries
| SPECIES: Ovis aries
| CENTRALID: GID
| Taxonomy ID: 9940
| Db type: OrgDb
| Supporting package: AnnotationDbi

Please see: help('select') for usage information
> length(keys(z))
[1] 34328
## How many have no symbol?
> sum(is.na(mapIds(z, keys(z), "ENTREZID", "GID")))
'select()' returned 1:1 mapping between keys and columns
[1] 0
#What do some of those symbols look like?
> select(z, head(keys(z)), "SYMBOL", "GID")
'select()' returned 1:1 mapping between keys and columns
     GID SYMBOL
1 442992  GLUT4
2 442993 UGT1A9
3 442996   GHRL
4 442997 CXCL10
5 442998   PYGM
6 442999   AQP1

The central ID for this OrgDb is listed as the NCBI GID, but it's actually the NCBI Gene ID (formerly known as the Entrez Gene ID). Your 'LOC' IDs are simply the NCBI Gene ID with a 'LOC' stuck on the front.

> zz <- select(z, keys(z), "SYMBOL")
'select()' returned 1:1 mapping between keys and columns

> head(zz[grep("^LOC", zz[,2]),])
       GID    SYMBOL
16  443015 LOC443015
31  443035 LOC443035
102 443144 LOC443144
110 443162 LOC443162
127 443187 LOC443187
128 443188 LOC443188

So if you need to map between gene locations that have a 'LOC' and actual Gene IDs, just strip the 'LOC' off.

ADD COMMENT • link 3.0 years ago James W. MacDonald 65k

0

Entering edit mode

Thanks James, that's great information. I didn't realise this fundamental difference and have modified my workflows to use OrgDb objects for annotation instead. But there's still lots of genes that have a GENENAME and no SYMBOL. I see the OrgDb object has aliases too, but the majority don't have any.

library(AnnotationHub)
hub <- AnnotationHub()
z <- hub[["AH85595"]]

zzz<- select(z, keys(z), c("SYMBOL","ALIAS","GENENAME")) %>% 
  dplyr::nest_by(GID, SYMBOL, GENENAME, .key="ALIASES")
nrow(zzz)  #34328 genes

LOC_zzz <- zzz[which(
  str_starts(zzz$SYMBOL, "LOC") == T),]

Examples of no SYMBOL but has GENENAME. Most have no ALIAS, others have one or more.

   GID       SYMBOL       GENENAME                                                         ALIASES
   <chr>     <chr>        <chr>                                                 <list<tibble[,1]>>
 1 100037664 LOC100037664 60S ribosomal protein L35a                                       [1 x 1]
 2 100037668 LOC100037668 40S ribosomal protein S15a                                       [1 x 1]
 3 100037688 LOC100037688 kappa light chain                                                [1 x 1]
 4 100037697 LOC100037697 galanin receptor 1                                               [1 x 1]
 5 100037698 LOC100037698 galanin receptor 2                                               [1 x 1]
 6 100037699 LOC100037699 galanin receptor 3                                               [1 x 1]
 7 100037700 LOC100037700 GALP                                                             [1 x 1]
 8 100101232 LOC100101232 glyceraldehyde-3-phosphate dehydrogenase                         [1 x 1]
 9 100101238 LOC100101238 regakine 1-like protein                                          [1 x 1]
10 100125610 LOC100125610 elongation factor 1-alpha 1                                      [1 x 1]
11 100134870 LOC100134870 beta-C globin                                                    [2 x 1]
12 100135684 LOC100135684 serum amyloid A3.2                                               [1 x 1]
13 100141295 LOC100141295 keratin, type II microfibrillar, component 5-like                [3 x 1]
14 100144429 LOC100144429 tryptase-2                                                       [1 x 1]
15 100144759 LOC100144759 T19 protein                                                      [1 x 1]
16 100188976 LOC100188976 BIIIB3 high-sulfur keratin pseudogene                            [1 x 1]
17 100191064 LOC100191064 antigen WC1.1                                                    [1 x 1]
18 100192427 LOC100192427 Fc gamma 2 receptor                                              [1 x 1]
19 100526780 LOC100526780 keratin 83                                                       [1 x 1]
20 100526781 LOC100526781 keratin, type I microfibrillar 48 kDa, component 8C-1            [2 x 1]

To quantify:

nrow(LOC_zzz)   #15541 with LOC/GID as SYMBOL = 45.3% total genes

LOC_zzz_unchar <- LOC_zzz[which(
  str_starts(LOC_zzz$GENENAME, "uncharacterized") == T),] 

nrow(LOC_zzz_unchar)   #3205 of these are uncharacterised = 20.6% of LOC/GID genes

nrow(LOC_zzz) - nrow(LOC_zzz_unchar) #12336 have GENENAMES but no SYMBOL

As you can see, nearly half of all the genes in that OrgDb have the LOC/GID for SYMBOL, but 79.4% of those have a GENENAME entry.

Is this a case of waiting for these genes to have symbols assigned officially, or similar, or is it something that I can do a work around for?

Having the aliases on hand is useful, so I can get them and incorporate as a list, or select the first one or something, but it's not really making a dent on the number that have a GENENAME but no SYMBOL.

library(tidyverse)
LOC_zzz <- LOC_zzz %>%
  rowwise() %>%
  mutate(num_aliases = nrow(ALIASES))

has_alias <- LOC_zzz[which(LOC_zzz$num_aliases>1),]
nrow(has_alias)   #690 of those with LOC/GUI for symbol have an alias

I ask because when we do RNA seq, there are always loads of genes that come up that have the LOC/GID. Then it's a laborious task of finding out what they all are, and some are really important genes. For instance:

> zzz[which(zzz$GENENAME=="cytochrome c"),]
# A tibble: 5 x 4
# Rowwise:  GID, SYMBOL, GENENAME
  GID       SYMBOL       GENENAME                ALIASES
  <chr>     <chr>        <chr>        <list<tibble[,1]>>
1 101107359 LOC101107359 cytochrome c            [1 x 1]
2 101107954 LOC101107954 cytochrome c            [3 x 1]
3 101117527 LOC101117527 cytochrome c            [1 x 1]
4 106990092 LOC106990092 cytochrome c            [1 x 1]
5 114108777 LOC114108777 cytochrome c            [1 x 1]

Thanks for having a look at this and humouring a fool! Every time I think I'm getting a grasp of it I find out I don't understand as much as I thought I did!

Edit:- Side question: the OrgDb doesn't seem to have transcript IDs either...how do you go about identifying differential splicing without these?

ADD REPLY • link 3.0 years ago Chris • 0

1

Entering edit mode

You are making the unwarranted assumption that the gene symbol for say Gene ID 100037644 isn't LOC100037644. But I'm here to tell you that it is. Or at least according to NCBI, which is where those data come from. And presumably NCBI is waiting for HUGO or whomever to come up with the officially designated gene symbol.

And to be clear, all of these packages are simply an attempt to make it easier for people to access available data. We don't do any annotating ourselves, but simply take what data we can get from various annotation services and repackage in a format that is more useful to our end users.

The OrgDb does have transcript names!

> select(z, head(keys(z)), "REFSEQ")
'select()' returned 1:many mapping between keys and columns
      GID         REFSEQ
1  442992           <NA>
2  442993 NP_001009189.1
3  442993 NM_001009189.1
4  442996 NP_001009721.1
5  442996 NM_001009721.3
6  442997 NP_001009191.1
7  442997 NM_001009191.1
8  442998 NP_001009192.1
9  442998 NM_001009192.2
10 442999 NP_001009194.1
11 442999 NM_001009194.1

Again, for those genes that NCBI says there is a transcript. If you use Ensembl instead you will almost certainly get a different answer.

Anyway, the NCBI GTF file should have what you need for transcript work.

> download.file("https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Ovis_aries/latest_assembly_versions/GCF_002742125.1_Oar_rambouillet_v1.0/GCF_002742125.1_Oar_rambouillet_v1.0_genomic.gtf.gz", "./Oar_rambouillet_v1.0_genomic.gtf.gz")
trying URL 'https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Ovis_aries/latest_assembly_versions/GCF_002742125.1_Oar_rambouillet_v1.0/GCF_002742125.1_Oar_rambouillet_v1.0_genomic.gtf.gz'
Content type 'application/x-gzip' length 14016343 bytes (13.4 MB)
downloaded 13.4 MB

> library(rtracklayer)
> zz <- import("Oar_rambouillet_v1.0_genomic.gtf.gz")

> library(GenomicFeatures)
> zzz <- makeTxDbFromGRanges(zz)

> transcriptsBy(zzz)
GRangesList object of length 28524:
$A1BG
GRanges object with 1 range and 2 metadata columns:
         seqnames            ranges strand |     tx_id        tx_name
            <Rle>         <IRanges>  <Rle> | <integer>    <character>
  [1] NC_040265.1 70903705-70908207      - |     34408 XM_027978951.1
  -------
  seqinfo: 395 sequences from an unspecified genome; no seqlengths

$A1CF
GRanges object with 3 ranges and 2 metadata columns:
         seqnames          ranges strand |     tx_id        tx_name
            <Rle>       <IRanges>  <Rle> | <integer>    <character>
  [1] NC_040273.1 9814449-9905829      + |     44757 XM_004020007.4
  [2] NC_040273.1 9814449-9905829      + |     44758 XM_004020008.4
  [3] NC_040273.1 9814451-9905829      + |     44759 XM_027960211.1
  -------
  seqinfo: 395 sequences from an unspecified genome; no seqlengths

$A2ML1
GRanges object with 4 ranges and 2 metadata columns:
         seqnames              ranges strand |     tx_id        tx_name
            <Rle>           <IRanges>  <Rle> | <integer>    <character>
  [1] NC_040254.1 221898659-221949146      - |     13777 XM_004006891.3
  [2] NC_040254.1 221898659-221949146      - |     13778 XM_012175470.2
  [3] NC_040254.1 221898659-221949146      - |     13779 XM_012175471.2
  [4] NC_040254.1 221898659-221949146      - |     13780 XM_012175472.3
  -------
  seqinfo: 395 sequences from an unspecified genome; no seqlengths

...
<28521 more elements>
>

There's probably other GTF files at NCBI you could use as well. I just grabbed the first one I saw.

ADD REPLY • link 3.0 years ago James W. MacDonald 65k

0

Entering edit mode

Brilliant, thank you! Just really wanted to make sure what I was seeing was in fact the current state of play and not just me being a fool! I shall continue waiting until NCBI / HUGO / anyone else feels it's time for some attention to the sheep annotation.

Thank you for your time and the illuminating insight!

ADD REPLY • link 3.0 years ago Chris • 0

0

Entering edit mode

Hi Chris and James, thanks so much for the valuable discussion on this issue! I am dealing with scRNA-seq for rainbow trout and having the same issue here. I got a data frame of gene symbol annotation from AnnotationHub according to your discussion, my questeion is how can I replace each of the LOCxxx in my seurat object with the gene symbol according to the data I got from AnnotationHub. (although there are only some of the gene ID have correspoding gene sybmols, but it's helpful to have whatever is available out there). Thanks very much for your help!

Here is what I have in my seurat object:

> head(RTC_1.0@assays[["RNA"]]@data@Dimnames[[1]], 20)  

 [1] "LOC110523613" "nit2"         "LOC110523066" "LOC110523873" "LOC110523811" "LOC110523943" "LOC100135976"

 [8] "LOC110523159" "LOC110523991" "dpt"          "LOC110524124" "LOC110524232" "LOC110524342" "tp63" 

[15] "zbtb11"       "LOC110524726" "LOC110525040" "LOC110525276" "LOC100653484" "LOC110502362"

And then I have got some gene symbols from AnnotationHub (it's a data.frame, and I screenshot a small part of it):

View(annotations)

enter image description here

ADD REPLY • link 2.2 years ago Lan • 0

0

Entering edit mode

Please don't piggyback on old posts! If you have a question, please ask in a new thread.

LOC values are NCBI gene symbols (which in turn are just LOC followed by the NCBI Gene ID). What you show from AnnotationHub are Ensembl Gene IDs, and it's not worth the time to try to map between NCBI and Ensembl. If you have aligned your data using gene locations from NCBI, then you should stick with NCBI for the gene mappings.

Also, just a quick check for two of those IDs at NCBI indicates that they have been replaced by the actual NCBI Gene ID and have symbols. The O. mykiss genome is getting pretty good. But they unfortunately don't always just strip off the LOC from the Gene ID when they update. Because of that, mapping from those LOC symbols might be a bit daunting, and it might just be easier to regenerate the counts/gene using an updated GTF (or update however you did the original alignment/gene counting).

ADD REPLY • link 2.2 years ago James W. MacDonald 65k

0

Entering edit mode

Hi Dr. MacDonald, thanks very much for your reply! I thought my question is very related to this post, so I piggybacked on here, but will try to avoid that in the future.

The last column of the table I got from AnnotationHub is actually the NCBI gene ID (formerly Entrez ID), without LOC. I tried using the most updated genome assembly USDA_OmykA_1.1 (https://www.ncbi.nlm.nih.gov/assembly/GCF_013265735.2/#/def_asm_Primary_Assembly) and annotation gtf file GCF_013265735.2_genomic.gtf (https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/013/265/735/GCF_013265735.2_USDA_OmykA_1.1/GCF_013265735.2_USDA_OmykA_1.1_genomic.gtf.gz) from NCBI, annotation has been improved, but still found a big proportion of genes noted as "LOCxxxx".

    head(C1@assays[["RNA"]]@data@Dimnames[[1]], 100)
 [1] "LOC110523811"      "sft2d2a"           "tmem45a"           "LOC110532027"      "tbc1d23"          
  [6] "nit2"              "plcxd1"            "LOC110520928"      "rp2"               "slc9a7"           
 [11] "chst7"             "tubgcp5"           "cyfip1"            "nipa2"             "nipa1"            
 [16] "gabrb3"            "atp10a"            "LOC110520782"      "LOC100135976"      "tmem131"          
 [21] "si:ch211-201h21.5" "zgc:172121"        "nme7"              "atp1b1a"           "dpt"              
 [26] "zbtb11"            "tp63"              "LOC110523943"      "LOC110518923"      "bcl6aa"           
 [31] "smx5"              "inpp5d"            "cep63"             "LOC110524898"      "LOC110524778"     
 [36] "LOC110524992"      "crocc2"            "klhl30"            "LOC110486761"      "LOC110525986"     
 [41] "LOC110526196"      "LOC110526525"      "mfsd8l1"           "polr2h"            "LOC110527749"     
 [46] "b3gnt7"            "LOC110527966"      "ncl"               "LOC118966752"      "LOC110528054"     
 [51] "LOC110528129"      "ankle1"            "ddx59"             "LOC110528753"      "LOC110528839"     
 [56] "LOC110528925"      "nmur1a"            "dynlt2b"           "LOC110529397"      "LOC100135997"     
 [61] "LOC110529481"      "LOC110529671"      "LOC110529747"      "LOC110529807"      "LOC110508833"     
 [66] "LOC118966096"      "ndufb5"            "usp13"             "pex5la"            "LOC110531220"     
 [71] "LOC110531147"      "psmd1"             "htr2b"             "LOC110530621"      "ppp1r2"           
 [76] "LOC110532571"      "LOC110533060"      "LOC110533192"      "ftsj3"             "asb3"             
 [81] "LOC110533496"      "gipc3"             "LOC110522842"      "LOC110533913"      "LOC110533997"     
 [86] "LOC110523899"      "LOC110534079"      "LOC110524396"      "LOC110534151"      "ccdc50"           
 [91] "LOC110534688"      "cpox"              "LOC110534884"      "LOC110535080"      "tab3"             
 [96] "mindy4"            "LOC110535719"      "LOC110535978"      "armh1"             "LOC110536107"

If you search, e.g., LOC110532027 in NCBI website you will get the gene discripted as "mitochondrial import receptor subunit TOM70", however, in gtf file gene_id is still the "LOC110532027". On the other hand, if I search "entrezid = 110532027" in the data.frame I got from AnnotationHub, there is a corresponding gene_name "tomm70a", see screenshot below: enter image description here

so there are some gene IDs, which are originally "LOCxxx", can be replaced to be gene symbol base on AnnotationHub data. I just don't know how to achieve that. I can also redo the alignment and counting if you have any idea of replacing the LOC with gene symbol in the gtf file instead of in the seurat object. I also have the list of all genes as a tsv file (as a part of alignment and counting outputs), if I can replace LOCs with gene symbols in there, it's also can be very helpful.

Thanks again!

ADD REPLY • link updated 2.2 years ago by James W. MacDonald 65k • written 2.2 years ago by Lan • 0

0

Entering edit mode

The most updated gene names should come from that GTF file.

> library(rtracklayer)
> library(GenomicFeatures)
> z <- import("https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/013/265/735/GCF_013265735.2_USDA_OmykA_1.1/GCF_013265735.2_USDA_OmykA_1.1_genomic.gtf.gz")
trying URL 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/013/265/735/GCF_013265735.2_USDA_OmykA_1.1/GCF_013265735.2_USDA_OmykA_1.1_genomic.gtf.gz'
Content type 'application/x-gzip' length 32019068 bytes (30.5 MB)
downloaded 30.5 MB

> zz <- makeTxDbFromGRanges(z)
Warning messages:
1: In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type
  stop_codon. This information was ignored.
2: In makeTxDbFromGRanges(z) :
  The following transcripts were dropped because their exon ranks could
  not be inferred (either because the exons are not on the same
  chromosome/strand or because they are not separated by introns):
  unknown_transcript_1
3: In makeTxDbFromGRanges(z) :
  The following transcripts were dropped because no genomic ranges could
  be found for them and their ranges could not be inferred from their
  exons either (because they have them on both strands):
  unknown_transcript_1
4: In .find_exon_cds(exons, cds) :
  The following transcripts have exons that contain more than one CDS
  (only the first CDS was kept for each exon): NM_001142780.1,
  NM_001142783.1
> genes(zz)["LOC110532027"]
GRanges object with 1 range and 1 metadata column:
                  seqnames        ranges strand |      gene_id
                     <Rle>     <IRanges>  <Rle> |  <character>
  LOC110532027 NC_048565.1 384207-421233      + | LOC110532027
  -------
  seqinfo: 416 sequences from an unspecified genome; no seqlengths
> genes(zz)["LOC110523991"]
Error: subscript contains invalid names

And that last bit is the problem. There are still plenty of LOC Gene IDs, but for those that have been updated, some have changed. For example, LOC110523991 got changed to zbtb11, which is Gene ID 110524418. And it's hard to track that down

> genes(zz)["LOC110523991"]
Error: subscript contains invalid names
> genes(zz)["110524418"]
Error: subscript contains invalid names
> z[grep("110524418", z$db_xref),]
GRanges object with 155 ranges and 26 metadata columns:
           seqnames          ranges strand |   source        type     score
              <Rle>       <IRanges>  <Rle> | <factor>    <factor> <numeric>
    [1] NC_048565.1 2212135-2258325      + |   Gnomon        gene        NA
    [2] NC_048565.1 2212136-2212659      + |   Gnomon        exon        NA
    [3] NC_048565.1 2221226-2221485      + |   Gnomon        exon        NA
    [4] NC_048565.1 2230665-2230896      + |   Gnomon        exon        NA
    [5] NC_048565.1 2231164-2232359      + |   Gnomon        exon        NA
    ...         ...             ...    ... .      ...         ...       ...
  [151] NC_048565.1 2247608-2247684      + |   Gnomon CDS                NA
  [152] NC_048565.1 2254459-2254634      + |   Gnomon CDS                NA
  [153] NC_048565.1 2256732-2257249      + |   Gnomon CDS                NA
  [154] NC_048565.1 2230716-2230718      + |   Gnomon start_codon        NA
  [155] NC_048565.1 2257250-2257252      + |   Gnomon stop_codon         NA
            phase     gene_id  transcript_id          db_xref       gbkey
        <integer> <character>    <character>      <character> <character>
    [1]      <NA>      zbtb11                GeneID:110524418        Gene
    [2]      <NA>      zbtb11 XR_005052082.1 GeneID:110524418    misc_RNA
    [3]      <NA>      zbtb11 XR_005052082.1 GeneID:110524418    misc_RNA
    [4]      <NA>      zbtb11 XR_005052082.1 GeneID:110524418    misc_RNA
    [5]      <NA>      zbtb11 XR_005052082.1 GeneID:110524418    misc_RNA

But if you used the TxDb that I just generated to get the counts/gene, you wouldn't have to do that mapping any more because the counts for zbtb11 would be assigned to ztbt11 instead of LOC110523991. So it is probably easier to just go back and regenerate the counts.

ADD REPLY • link 2.2 years ago James W. MacDonald 65k

0

Entering edit mode

Thanks for your reply again!

The example of LOC110523991 came from my first post, which used the old annotation, but I can see the zbtb11 gene in my counts/gene matrix generted by the updated annotation OmykA_1.1/GCF_013265735.2, the same situation of the TxDb code you gave, you can see the zbtb11 gene name, so I guess using the TxDb to get the counts/gene will generate the exactly the same results since both are using OmykA_1.1/GCF_013265735.2 gtf file, make sense?

Maybe the best solution would be somebody can generate a new version of annotation that has all the current best available knowledge of the gene symbols as there are still substantial genes only labelled LOC (although we already know what genes are they). Data from AnnotationHub have slightly more genes with known gene symbols (as the example of 110532027 corresponding to tomm70a), but mapping that down to OmykA_1.1/GCF_013265735.2 can only solve a small part of the issue, let alone I don't have any idea to acheive that with code.

Anyway, thanks so much for the discussion help with me.

ADD REPLY • link 2.2 years ago Lan • 0