LOC gene IDs for characterised genes in annotation files
1
0
Entering edit mode
Chris • 0
@6c640a95
Last seen 9 weeks ago
United Kingdom

I've noticed that there are a lot of genes (>1/3 of total) which have names and/or symbols in NCBI that are not in the reference annotation files. The oviAri4.ncbiRefSeq.gtf and GCF_002742125.1_Oar_rambouillet_v1.0_genomic.gtf annotation files both have the problem.

To demonstrate, and to calculate the extent of the problem, I have used the following code:

First I import the gtf file and find the total number of genes, and the total that have a LOC gene ID.

txdb <- makeTxDbFromGFF(file=gffFile, format=c("gtf"))
txdf <- AnnotationDbi::select(txdb, keys(txdb,"GENEID"), 
        "TXNAME", "GENEID")
no_of_genes <- nrow(unique(txdf[1]))
LOC_only <- txdf[1] %>%
  filter(str_starts(GENEID, "LOC")) %>%
  unique()

Then I query each against the NCBI database and extract the gene symbol ("name") and the gene name ("description").

for (i in 1:nrow(LOC_only)){
  tryCatch({
    e_name <- 
      entrez_summary(db="gene",
           id=as.numeric(gsub("LOC","",
                    LOC_only$GENEID[i])))$name
    e_description <- 
      entrez_summary(db="gene",
            id=as.numeric(gsub("LOC","",
                    LOC_only$GENEID[i])))$description
    LOC_only$symbol[i] <- e_name
    LOC_only$description[i] <- e_description
  }, error=function(e){})
}

Doing this I have discovered that a large proportion of the genes using a LOC ID have names and symbols on the NCBI database. I quantified this using the following code:

LOC_uncharacterized<- LOC_only %>%
  filter(str_starts(description, "uncharacterized")) %>%
  nrow()
LOC_with_name <- LOC_only %>%
  filter(!str_starts(description, "uncharacterized")) %>%
  nrow()
LOC_with_Symbol <- LOC_only %>%
  filter(!str_starts(symbol, "LOC")) %>%
  nrow()
LOC_with_name_no_symbol <- LOC_only %>%
  filter(str_starts(symbol, "LOC")) %>%
  filter(!str_starts(description, "uncharacterized")) %>%
  nrow()

The results were:

For the oviAri4.ncbiRefSeq.gtf:

Out of the 24478 genes in the annotation, 8462 only have a LOC symbol = 8462/24478 = 34.6%.

3778 were uncharacterized = 3778/8462 = 44.6%. 4654 had a description containing a name = 4654/8462 = 55.0%. 526 also had an NCBI symbol = 526/8462 = 6.2%. 4128 had a description containing a name but no symbol = 4128/8462 = 48.8%.

For the Oar_rambouillet_v1.0.gtf:

Out of the 28524 genes in the annotation, 10441 only have a LOC symbol = 10441/28524 = 36.6%.

3153 were uncharacterized = 3153/10441 = 30.1%. 7288 had a description containing a name = 7288/10441 = 69.8%. 15 also had an NCBI symbol = 15/10441 = 0.1%. 7273 had a description containing a name but no symbol = 7273/10441 = 69.7%.

As you can see, over 6% of the genes in oviAri4.ncbiRefSeq.gtf that have a LOC ID actually have a gene symbol in NCBI. This is only 0.1% for Oar_rambouillet_v1.0.gtf, which is not too bad. However, of the genes that have a LOC ID, 55% in oviAri4.ncbiRefSeq.gtf and nearly 70% in Oar_rambouillet_v1.0.gtf have names in NCBI but no symbol.

Examples:

> head(LOC_only,20)
         GENEID       symbol                                       description
1  LOC100037664 LOC100037664                        60S ribosomal protein L35a
2  LOC100037668 LOC100037668                        40S ribosomal protein S15a
3  LOC100101238 LOC100101238                           regakine 1-like protein
4  LOC100125610 LOC100125610                       elongation factor 1-alpha 1
5  LOC100141295 LOC100141295 keratin, type II microfibrillar, component 5-like
6  LOC100144429 LOC100144429                                        tryptase-2
7  LOC100191064 LOC100191064                                     antigen WC1.1
8  LOC100192427 LOC100192427                               Fc gamma 2 receptor
9  LOC100526780 LOC100526780                                        keratin 83
10 LOC100526782 LOC100526782                                       keratin 33B
11 LOC100846990 LOC100846990                                                N2
12 LOC101101846 LOC101101846          pregnancy-associated glycoprotein 1-like
13 LOC101101848 LOC101101848                           olfactory receptor 10S1
14 LOC101101856 LOC101101856                    ceramide synthase 5 pseudogene
15 LOC101101861 LOC101101861                           olfactory receptor 2AE1
16 LOC101101862 LOC101101862                      uncharacterized LOC101101862
18 LOC101101875 LOC101101875     histone-lysine N-methyltransferase PRDM9-like
19 LOC101101878         AATK              apoptosis associated tyrosine kinase
20 LOC101101888 LOC101101888                  GTPase IMAP family member 7-like
21 LOC101101889 LOC101101889                              hyaluronidase-4-like

So, my question...is there a way this can be fixed, or is this something to do with gene symbols needing to be signed off before incorporated, or something? If this can't be fixed, is there a better way that I can correct the annotation files manually? The above code is very slow and takes a number of hours.

R session info below.

Thank you!

R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
 [1] data.table_1.14.0      rentrez_1.2.3          stringr_1.4.0         
 [4] GenomicFeatures_1.40.1 AnnotationDbi_1.50.3   Biobase_2.48.0        
 [7] GenomicRanges_1.40.0   GenomeInfoDb_1.24.2    IRanges_2.22.2        
[10] S4Vectors_0.26.1       BiocGenerics_0.34.0    dplyr_1.0.4           
[13] readr_1.4.0           

loaded via a namespace (and not attached):
 [1] httr_1.4.2                  edgeR_3.30.3               
 [3] bit64_4.0.5                 jsonlite_1.7.2             
 [5] assertthat_0.2.1            askpass_1.1                
 [7] BiocFileCache_1.12.1        blob_1.2.1                 
 [9] GenomeInfoDbData_1.2.3      Rsamtools_2.4.0            
[11] yaml_2.2.1                  progress_1.2.2             
[13] pillar_1.5.0                RSQLite_2.2.3              
[15] lattice_0.20-41             limma_3.44.3               
[17] glue_1.4.2                  digest_0.6.27              
[19] XVector_0.28.0              colorspace_2.0-0           
[21] plyr_1.8.6                  htmltools_0.5.1.9000       
[23] Matrix_1.3-2                XML_3.99-0.5               
[25] pkgconfig_2.0.3             biomaRt_2.44.4             
[27] zlibbioc_1.34.0             purrr_0.3.4                
[29] scales_1.1.1                BiocParallel_1.22.0        
[31] tibble_3.0.6                openssl_1.4.3              
[33] ggplot2_3.3.3               generics_0.1.0             
[35] ellipsis_0.3.1              cachem_1.0.4               
[37] SummarizedExperiment_1.18.2 magrittr_2.0.1             
[39] crayon_1.4.1                memoise_2.0.0              
[41] evaluate_0.14               fansi_0.4.2                
[43] xml2_1.3.2                  tools_4.0.2                
[45] prettyunits_1.1.1           hms_1.0.0                  
[47] lifecycle_1.0.0             matrixStats_0.58.0         
[49] munsell_0.5.0               locfit_1.5-9.4             
[51] DelayedArray_0.14.1         Biostrings_2.56.0          
[53] compiler_4.0.2              tinytex_0.29               
[55] rlang_0.4.10                grid_4.0.2                 
[57] RCurl_1.98-1.2              rstudioapi_0.13            
[59] DRIMSeq_1.16.1              rappdirs_0.3.3             
[61] bitops_1.0-6                rmarkdown_2.7              
[63] gtable_0.3.0                DBI_1.1.1                  
[65] curl_4.3                    reshape2_1.4.4             
[67] R6_2.5.0                    GenomicAlignments_1.24.0   
[69] knitr_1.31                  rtracklayer_1.48.0         
[71] fastmap_1.1.0               bit_4.0.4                  
[73] utf8_1.1.4                  stringi_1.5.3              
[75] Rcpp_1.0.6                  vctrs_0.3.6                
[77] dbplyr_2.1.0                tidyselect_1.1.0           
[79] xfun_0.21
Annotation RNAseq • 242 views
ADD COMMENT
2
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States

The TxDb objects are intended to provide genomic positions for genes/exons/transcripts/etc. They are not intended to provide other non-positional annotations like gene symbols. Those data come from OrgDb objects, which you can get from the AnnotationHub

> library(AnnotationHub)
> hub <- AnnotationHub()
> 
  |======================================================================| 100%

snapshotDate(): 2020-10-27

> query(hub, c("ovis","orgdb"))
AnnotationHub with 4 records
# snapshotDate(): 2020-10-27
# $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Ovis ovis, Ovis orientalis_aries, Ovis aries, Ovis ammon_aries
# $rdataclass: OrgDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH85594"]]' 

            title                              
  AH85594 | org.Ovis_ammon_aries.eg.sqlite     
  AH85595 | org.Ovis_aries.eg.sqlite           
  AH85596 | org.Ovis_orientalis_aries.eg.sqlite
  AH85597 | org.Ovis_ovis.eg.sqlite            
> z <- hub[["AH85595"]]
downloading 1 resources
retrieving 1 resource
  |======================================================================| 100%

loading from cache
> z
OrgDb object:
| DBSCHEMAVERSION: 2.1
| DBSCHEMA: NOSCHEMA_DB
| ORGANISM: Ovis aries
| SPECIES: Ovis aries
| CENTRALID: GID
| Taxonomy ID: 9940
| Db type: OrgDb
| Supporting package: AnnotationDbi

Please see: help('select') for usage information
> length(keys(z))
[1] 34328
## How many have no symbol?
> sum(is.na(mapIds(z, keys(z), "ENTREZID", "GID")))
'select()' returned 1:1 mapping between keys and columns
[1] 0
#What do some of those symbols look like?
> select(z, head(keys(z)), "SYMBOL", "GID")
'select()' returned 1:1 mapping between keys and columns
     GID SYMBOL
1 442992  GLUT4
2 442993 UGT1A9
3 442996   GHRL
4 442997 CXCL10
5 442998   PYGM
6 442999   AQP1

The central ID for this OrgDb is listed as the NCBI GID, but it's actually the NCBI Gene ID (formerly known as the Entrez Gene ID). Your 'LOC' IDs are simply the NCBI Gene ID with a 'LOC' stuck on the front.

> zz <- select(z, keys(z), "SYMBOL")
'select()' returned 1:1 mapping between keys and columns

> head(zz[grep("^LOC", zz[,2]),])
       GID    SYMBOL
16  443015 LOC443015
31  443035 LOC443035
102 443144 LOC443144
110 443162 LOC443162
127 443187 LOC443187
128 443188 LOC443188

So if you need to map between gene locations that have a 'LOC' and actual Gene IDs, just strip the 'LOC' off.

0
Entering edit mode

Thanks James, that's great information. I didn't realise this fundamental difference and have modified my workflows to use OrgDb objects for annotation instead. But there's still lots of genes that have a GENENAME and no SYMBOL. I see the OrgDb object has aliases too, but the majority don't have any.

library(AnnotationHub)
hub <- AnnotationHub()
z <- hub[["AH85595"]]

zzz<- select(z, keys(z), c("SYMBOL","ALIAS","GENENAME")) %>% 
  dplyr::nest_by(GID, SYMBOL, GENENAME, .key="ALIASES")
nrow(zzz)  #34328 genes

LOC_zzz <- zzz[which(
  str_starts(zzz$SYMBOL, "LOC") == T),]

Examples of no SYMBOL but has GENENAME. Most have no ALIAS, others have one or more.

   GID       SYMBOL       GENENAME                                                         ALIASES
   <chr>     <chr>        <chr>                                                 <list<tibble[,1]>>
 1 100037664 LOC100037664 60S ribosomal protein L35a                                       [1 x 1]
 2 100037668 LOC100037668 40S ribosomal protein S15a                                       [1 x 1]
 3 100037688 LOC100037688 kappa light chain                                                [1 x 1]
 4 100037697 LOC100037697 galanin receptor 1                                               [1 x 1]
 5 100037698 LOC100037698 galanin receptor 2                                               [1 x 1]
 6 100037699 LOC100037699 galanin receptor 3                                               [1 x 1]
 7 100037700 LOC100037700 GALP                                                             [1 x 1]
 8 100101232 LOC100101232 glyceraldehyde-3-phosphate dehydrogenase                         [1 x 1]
 9 100101238 LOC100101238 regakine 1-like protein                                          [1 x 1]
10 100125610 LOC100125610 elongation factor 1-alpha 1                                      [1 x 1]
11 100134870 LOC100134870 beta-C globin                                                    [2 x 1]
12 100135684 LOC100135684 serum amyloid A3.2                                               [1 x 1]
13 100141295 LOC100141295 keratin, type II microfibrillar, component 5-like                [3 x 1]
14 100144429 LOC100144429 tryptase-2                                                       [1 x 1]
15 100144759 LOC100144759 T19 protein                                                      [1 x 1]
16 100188976 LOC100188976 BIIIB3 high-sulfur keratin pseudogene                            [1 x 1]
17 100191064 LOC100191064 antigen WC1.1                                                    [1 x 1]
18 100192427 LOC100192427 Fc gamma 2 receptor                                              [1 x 1]
19 100526780 LOC100526780 keratin 83                                                       [1 x 1]
20 100526781 LOC100526781 keratin, type I microfibrillar 48 kDa, component 8C-1            [2 x 1]

To quantify:

nrow(LOC_zzz)   #15541 with LOC/GID as SYMBOL = 45.3% total genes

LOC_zzz_unchar <- LOC_zzz[which(
  str_starts(LOC_zzz$GENENAME, "uncharacterized") == T),] 

nrow(LOC_zzz_unchar)   #3205 of these are uncharacterised = 20.6% of LOC/GID genes

nrow(LOC_zzz) - nrow(LOC_zzz_unchar) #12336 have GENENAMES but no SYMBOL

As you can see, nearly half of all the genes in that OrgDb have the LOC/GID for SYMBOL, but 79.4% of those have a GENENAME entry.

Is this a case of waiting for these genes to have symbols assigned officially, or similar, or is it something that I can do a work around for?

Having the aliases on hand is useful, so I can get them and incorporate as a list, or select the first one or something, but it's not really making a dent on the number that have a GENENAME but no SYMBOL.

library(tidyverse)
LOC_zzz <- LOC_zzz %>%
  rowwise() %>%
  mutate(num_aliases = nrow(ALIASES))

has_alias <- LOC_zzz[which(LOC_zzz$num_aliases>1),]
nrow(has_alias)   #690 of those with LOC/GUI for symbol have an alias

I ask because when we do RNA seq, there are always loads of genes that come up that have the LOC/GID. Then it's a laborious task of finding out what they all are, and some are really important genes. For instance:

> zzz[which(zzz$GENENAME=="cytochrome c"),]
# A tibble: 5 x 4
# Rowwise:  GID, SYMBOL, GENENAME
  GID       SYMBOL       GENENAME                ALIASES
  <chr>     <chr>        <chr>        <list<tibble[,1]>>
1 101107359 LOC101107359 cytochrome c            [1 x 1]
2 101107954 LOC101107954 cytochrome c            [3 x 1]
3 101117527 LOC101117527 cytochrome c            [1 x 1]
4 106990092 LOC106990092 cytochrome c            [1 x 1]
5 114108777 LOC114108777 cytochrome c            [1 x 1]

Thanks for having a look at this and humouring a fool! Every time I think I'm getting a grasp of it I find out I don't understand as much as I thought I did!

Edit:- Side question: the OrgDb doesn't seem to have transcript IDs either...how do you go about identifying differential splicing without these?

ADD REPLY
1
Entering edit mode

You are making the unwarranted assumption that the gene symbol for say Gene ID 100037644 isn't LOC100037644. But I'm here to tell you that it is. Or at least according to NCBI, which is where those data come from. And presumably NCBI is waiting for HUGO or whomever to come up with the officially designated gene symbol.

And to be clear, all of these packages are simply an attempt to make it easier for people to access available data. We don't do any annotating ourselves, but simply take what data we can get from various annotation services and repackage in a format that is more useful to our end users.

The OrgDb does have transcript names!

> select(z, head(keys(z)), "REFSEQ")
'select()' returned 1:many mapping between keys and columns
      GID         REFSEQ
1  442992           <NA>
2  442993 NP_001009189.1
3  442993 NM_001009189.1
4  442996 NP_001009721.1
5  442996 NM_001009721.3
6  442997 NP_001009191.1
7  442997 NM_001009191.1
8  442998 NP_001009192.1
9  442998 NM_001009192.2
10 442999 NP_001009194.1
11 442999 NM_001009194.1

Again, for those genes that NCBI says there is a transcript. If you use Ensembl instead you will almost certainly get a different answer.

Anyway, the NCBI GTF file should have what you need for transcript work.

> download.file("https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Ovis_aries/latest_assembly_versions/GCF_002742125.1_Oar_rambouillet_v1.0/GCF_002742125.1_Oar_rambouillet_v1.0_genomic.gtf.gz", "./Oar_rambouillet_v1.0_genomic.gtf.gz")
trying URL 'https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Ovis_aries/latest_assembly_versions/GCF_002742125.1_Oar_rambouillet_v1.0/GCF_002742125.1_Oar_rambouillet_v1.0_genomic.gtf.gz'
Content type 'application/x-gzip' length 14016343 bytes (13.4 MB)
downloaded 13.4 MB

> library(rtracklayer)
> zz <- import("Oar_rambouillet_v1.0_genomic.gtf.gz")

> library(GenomicFeatures)
> zzz <- makeTxDbFromGRanges(zz)

> transcriptsBy(zzz)
GRangesList object of length 28524:
$A1BG
GRanges object with 1 range and 2 metadata columns:
         seqnames            ranges strand |     tx_id        tx_name
            <Rle>         <IRanges>  <Rle> | <integer>    <character>
  [1] NC_040265.1 70903705-70908207      - |     34408 XM_027978951.1
  -------
  seqinfo: 395 sequences from an unspecified genome; no seqlengths

$A1CF
GRanges object with 3 ranges and 2 metadata columns:
         seqnames          ranges strand |     tx_id        tx_name
            <Rle>       <IRanges>  <Rle> | <integer>    <character>
  [1] NC_040273.1 9814449-9905829      + |     44757 XM_004020007.4
  [2] NC_040273.1 9814449-9905829      + |     44758 XM_004020008.4
  [3] NC_040273.1 9814451-9905829      + |     44759 XM_027960211.1
  -------
  seqinfo: 395 sequences from an unspecified genome; no seqlengths

$A2ML1
GRanges object with 4 ranges and 2 metadata columns:
         seqnames              ranges strand |     tx_id        tx_name
            <Rle>           <IRanges>  <Rle> | <integer>    <character>
  [1] NC_040254.1 221898659-221949146      - |     13777 XM_004006891.3
  [2] NC_040254.1 221898659-221949146      - |     13778 XM_012175470.2
  [3] NC_040254.1 221898659-221949146      - |     13779 XM_012175471.2
  [4] NC_040254.1 221898659-221949146      - |     13780 XM_012175472.3
  -------
  seqinfo: 395 sequences from an unspecified genome; no seqlengths

...
<28521 more elements>
>

There's probably other GTF files at NCBI you could use as well. I just grabbed the first one I saw.

ADD REPLY
0
Entering edit mode

Brilliant, thank you! Just really wanted to make sure what I was seeing was in fact the current state of play and not just me being a fool! I shall continue waiting until NCBI / HUGO / anyone else feels it's time for some attention to the sheep annotation.

Thank you for your time and the illuminating insight!

ADD REPLY

Login before adding your answer.

Traffic: 316 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6