I've noticed that there are a lot of genes (>1/3 of total) which have names and/or symbols in NCBI that are not in the reference annotation files. The oviAri4.ncbiRefSeq.gtf and GCF_002742125.1_Oar_rambouillet_v1.0_genomic.gtf annotation files both have the problem.
To demonstrate, and to calculate the extent of the problem, I have used the following code:
First I import the gtf file and find the total number of genes, and the total that have a LOC gene ID.
txdb <- makeTxDbFromGFF(file=gffFile, format=c("gtf"))
txdf <- AnnotationDbi::select(txdb, keys(txdb,"GENEID"),
"TXNAME", "GENEID")
no_of_genes <- nrow(unique(txdf[1]))
LOC_only <- txdf[1] %>%
filter(str_starts(GENEID, "LOC")) %>%
unique()
Then I query each against the NCBI database and extract the gene symbol ("name") and the gene name ("description").
for (i in 1:nrow(LOC_only)){
tryCatch({
e_name <-
entrez_summary(db="gene",
id=as.numeric(gsub("LOC","",
LOC_only$GENEID[i])))$name
e_description <-
entrez_summary(db="gene",
id=as.numeric(gsub("LOC","",
LOC_only$GENEID[i])))$description
LOC_only$symbol[i] <- e_name
LOC_only$description[i] <- e_description
}, error=function(e){})
}
Doing this I have discovered that a large proportion of the genes using a LOC ID have names and symbols on the NCBI database. I quantified this using the following code:
LOC_uncharacterized<- LOC_only %>%
filter(str_starts(description, "uncharacterized")) %>%
nrow()
LOC_with_name <- LOC_only %>%
filter(!str_starts(description, "uncharacterized")) %>%
nrow()
LOC_with_Symbol <- LOC_only %>%
filter(!str_starts(symbol, "LOC")) %>%
nrow()
LOC_with_name_no_symbol <- LOC_only %>%
filter(str_starts(symbol, "LOC")) %>%
filter(!str_starts(description, "uncharacterized")) %>%
nrow()
The results were:
For the oviAri4.ncbiRefSeq.gtf:
Out of the 24478 genes in the annotation, 8462 only have a LOC symbol = 8462/24478 = 34.6%.
3778 were uncharacterized = 3778/8462 = 44.6%. 4654 had a description containing a name = 4654/8462 = 55.0%. 526 also had an NCBI symbol = 526/8462 = 6.2%. 4128 had a description containing a name but no symbol = 4128/8462 = 48.8%.
For the Oar_rambouillet_v1.0.gtf:
Out of the 28524 genes in the annotation, 10441 only have a LOC symbol = 10441/28524 = 36.6%.
3153 were uncharacterized = 3153/10441 = 30.1%. 7288 had a description containing a name = 7288/10441 = 69.8%. 15 also had an NCBI symbol = 15/10441 = 0.1%. 7273 had a description containing a name but no symbol = 7273/10441 = 69.7%.
As you can see, over 6% of the genes in oviAri4.ncbiRefSeq.gtf that have a LOC ID actually have a gene symbol in NCBI. This is only 0.1% for Oar_rambouillet_v1.0.gtf, which is not too bad. However, of the genes that have a LOC ID, 55% in oviAri4.ncbiRefSeq.gtf and nearly 70% in Oar_rambouillet_v1.0.gtf have names in NCBI but no symbol.
Examples:
> head(LOC_only,20)
GENEID symbol description
1 LOC100037664 LOC100037664 60S ribosomal protein L35a
2 LOC100037668 LOC100037668 40S ribosomal protein S15a
3 LOC100101238 LOC100101238 regakine 1-like protein
4 LOC100125610 LOC100125610 elongation factor 1-alpha 1
5 LOC100141295 LOC100141295 keratin, type II microfibrillar, component 5-like
6 LOC100144429 LOC100144429 tryptase-2
7 LOC100191064 LOC100191064 antigen WC1.1
8 LOC100192427 LOC100192427 Fc gamma 2 receptor
9 LOC100526780 LOC100526780 keratin 83
10 LOC100526782 LOC100526782 keratin 33B
11 LOC100846990 LOC100846990 N2
12 LOC101101846 LOC101101846 pregnancy-associated glycoprotein 1-like
13 LOC101101848 LOC101101848 olfactory receptor 10S1
14 LOC101101856 LOC101101856 ceramide synthase 5 pseudogene
15 LOC101101861 LOC101101861 olfactory receptor 2AE1
16 LOC101101862 LOC101101862 uncharacterized LOC101101862
18 LOC101101875 LOC101101875 histone-lysine N-methyltransferase PRDM9-like
19 LOC101101878 AATK apoptosis associated tyrosine kinase
20 LOC101101888 LOC101101888 GTPase IMAP family member 7-like
21 LOC101101889 LOC101101889 hyaluronidase-4-like
So, my question...is there a way this can be fixed, or is this something to do with gene symbols needing to be signed off before incorporated, or something? If this can't be fixed, is there a better way that I can correct the annotation files manually? The above code is very slow and takes a number of hours.
R session info below.
Thank you!
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods
[9] base
other attached packages:
[1] data.table_1.14.0 rentrez_1.2.3 stringr_1.4.0
[4] GenomicFeatures_1.40.1 AnnotationDbi_1.50.3 Biobase_2.48.0
[7] GenomicRanges_1.40.0 GenomeInfoDb_1.24.2 IRanges_2.22.2
[10] S4Vectors_0.26.1 BiocGenerics_0.34.0 dplyr_1.0.4
[13] readr_1.4.0
loaded via a namespace (and not attached):
[1] httr_1.4.2 edgeR_3.30.3
[3] bit64_4.0.5 jsonlite_1.7.2
[5] assertthat_0.2.1 askpass_1.1
[7] BiocFileCache_1.12.1 blob_1.2.1
[9] GenomeInfoDbData_1.2.3 Rsamtools_2.4.0
[11] yaml_2.2.1 progress_1.2.2
[13] pillar_1.5.0 RSQLite_2.2.3
[15] lattice_0.20-41 limma_3.44.3
[17] glue_1.4.2 digest_0.6.27
[19] XVector_0.28.0 colorspace_2.0-0
[21] plyr_1.8.6 htmltools_0.5.1.9000
[23] Matrix_1.3-2 XML_3.99-0.5
[25] pkgconfig_2.0.3 biomaRt_2.44.4
[27] zlibbioc_1.34.0 purrr_0.3.4
[29] scales_1.1.1 BiocParallel_1.22.0
[31] tibble_3.0.6 openssl_1.4.3
[33] ggplot2_3.3.3 generics_0.1.0
[35] ellipsis_0.3.1 cachem_1.0.4
[37] SummarizedExperiment_1.18.2 magrittr_2.0.1
[39] crayon_1.4.1 memoise_2.0.0
[41] evaluate_0.14 fansi_0.4.2
[43] xml2_1.3.2 tools_4.0.2
[45] prettyunits_1.1.1 hms_1.0.0
[47] lifecycle_1.0.0 matrixStats_0.58.0
[49] munsell_0.5.0 locfit_1.5-9.4
[51] DelayedArray_0.14.1 Biostrings_2.56.0
[53] compiler_4.0.2 tinytex_0.29
[55] rlang_0.4.10 grid_4.0.2
[57] RCurl_1.98-1.2 rstudioapi_0.13
[59] DRIMSeq_1.16.1 rappdirs_0.3.3
[61] bitops_1.0-6 rmarkdown_2.7
[63] gtable_0.3.0 DBI_1.1.1
[65] curl_4.3 reshape2_1.4.4
[67] R6_2.5.0 GenomicAlignments_1.24.0
[69] knitr_1.31 rtracklayer_1.48.0
[71] fastmap_1.1.0 bit_4.0.4
[73] utf8_1.1.4 stringi_1.5.3
[75] Rcpp_1.0.6 vctrs_0.3.6
[77] dbplyr_2.1.0 tidyselect_1.1.0
[79] xfun_0.21
Thanks James, that's great information. I didn't realise this fundamental difference and have modified my workflows to use
OrgDb
objects for annotation instead. But there's still lots of genes that have aGENENAME
and noSYMBOL
. I see theOrgDb
object has aliases too, but the majority don't have any.Examples of no
SYMBOL
but hasGENENAME
. Most have noALIAS
, others have one or more.To quantify:
As you can see, nearly half of all the genes in that
OrgDb
have the LOC/GID forSYMBOL
, but 79.4% of those have aGENENAME
entry.Is this a case of waiting for these genes to have symbols assigned officially, or similar, or is it something that I can do a work around for?
Having the aliases on hand is useful, so I can get them and incorporate as a list, or select the first one or something, but it's not really making a dent on the number that have a
GENENAME
but noSYMBOL
.I ask because when we do RNA seq, there are always loads of genes that come up that have the LOC/GID. Then it's a laborious task of finding out what they all are, and some are really important genes. For instance:
Thanks for having a look at this and humouring a fool! Every time I think I'm getting a grasp of it I find out I don't understand as much as I thought I did!
Edit:- Side question: the
OrgDb
doesn't seem to have transcript IDs either...how do you go about identifying differential splicing without these?You are making the unwarranted assumption that the gene symbol for say Gene ID 100037644 isn't LOC100037644. But I'm here to tell you that it is. Or at least according to NCBI, which is where those data come from. And presumably NCBI is waiting for HUGO or whomever to come up with the officially designated gene symbol.
And to be clear, all of these packages are simply an attempt to make it easier for people to access available data. We don't do any annotating ourselves, but simply take what data we can get from various annotation services and repackage in a format that is more useful to our end users.
The
OrgDb
does have transcript names!Again, for those genes that NCBI says there is a transcript. If you use Ensembl instead you will almost certainly get a different answer.
Anyway, the NCBI GTF file should have what you need for transcript work.
There's probably other GTF files at NCBI you could use as well. I just grabbed the first one I saw.
Brilliant, thank you! Just really wanted to make sure what I was seeing was in fact the current state of play and not just me being a fool! I shall continue waiting until NCBI / HUGO / anyone else feels it's time for some attention to the sheep annotation.
Thank you for your time and the illuminating insight!
Hi Chris and James, thanks so much for the valuable discussion on this issue! I am dealing with scRNA-seq for rainbow trout and having the same issue here. I got a data frame of gene symbol annotation from AnnotationHub according to your discussion, my questeion is how can I replace each of the LOCxxx in my seurat object with the gene symbol according to the data I got from AnnotationHub. (although there are only some of the gene ID have correspoding gene sybmols, but it's helpful to have whatever is available out there). Thanks very much for your help!
Here is what I have in my seurat object:
And then I have got some gene symbols from AnnotationHub (it's a data.frame, and I screenshot a small part of it):
Please don't piggyback on old posts! If you have a question, please ask in a new thread.
LOC values are NCBI gene symbols (which in turn are just LOC followed by the NCBI Gene ID). What you show from AnnotationHub are Ensembl Gene IDs, and it's not worth the time to try to map between NCBI and Ensembl. If you have aligned your data using gene locations from NCBI, then you should stick with NCBI for the gene mappings.
Also, just a quick check for two of those IDs at NCBI indicates that they have been replaced by the actual NCBI Gene ID and have symbols. The O. mykiss genome is getting pretty good. But they unfortunately don't always just strip off the LOC from the Gene ID when they update. Because of that, mapping from those LOC symbols might be a bit daunting, and it might just be easier to regenerate the counts/gene using an updated GTF (or update however you did the original alignment/gene counting).
Hi Dr. MacDonald, thanks very much for your reply! I thought my question is very related to this post, so I piggybacked on here, but will try to avoid that in the future.
The last column of the table I got from AnnotationHub is actually the NCBI gene ID (formerly Entrez ID), without LOC. I tried using the most updated genome assembly USDA_OmykA_1.1 (https://www.ncbi.nlm.nih.gov/assembly/GCF_013265735.2/#/def_asm_Primary_Assembly) and annotation gtf file GCF_013265735.2_genomic.gtf (https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/013/265/735/GCF_013265735.2_USDA_OmykA_1.1/GCF_013265735.2_USDA_OmykA_1.1_genomic.gtf.gz) from NCBI, annotation has been improved, but still found a big proportion of genes noted as "LOCxxxx".
If you search, e.g., LOC110532027 in NCBI website you will get the gene discripted as "mitochondrial import receptor subunit TOM70", however, in gtf file gene_id is still the "LOC110532027". On the other hand, if I search "entrezid = 110532027" in the data.frame I got from AnnotationHub, there is a corresponding gene_name "tomm70a", see screenshot below:
so there are some gene IDs, which are originally "LOCxxx", can be replaced to be gene symbol base on AnnotationHub data. I just don't know how to achieve that. I can also redo the alignment and counting if you have any idea of replacing the LOC with gene symbol in the gtf file instead of in the seurat object. I also have the list of all genes as a tsv file (as a part of alignment and counting outputs), if I can replace LOCs with gene symbols in there, it's also can be very helpful.
Thanks again!
The most updated gene names should come from that GTF file.
And that last bit is the problem. There are still plenty of LOC Gene IDs, but for those that have been updated, some have changed. For example, LOC110523991 got changed to zbtb11, which is Gene ID 110524418. And it's hard to track that down
But if you used the
TxDb
that I just generated to get the counts/gene, you wouldn't have to do that mapping any more because the counts for zbtb11 would be assigned to ztbt11 instead of LOC110523991. So it is probably easier to just go back and regenerate the counts.Thanks for your reply again!
The example of LOC110523991 came from my first post, which used the old annotation, but I can see the zbtb11 gene in my counts/gene matrix generted by the updated annotation OmykA_1.1/GCF_013265735.2, the same situation of the TxDb code you gave, you can see the zbtb11 gene name, so I guess using the TxDb to get the counts/gene will generate the exactly the same results since both are using OmykA_1.1/GCF_013265735.2 gtf file, make sense?
Maybe the best solution would be somebody can generate a new version of annotation that has all the current best available knowledge of the gene symbols as there are still substantial genes only labelled LOC (although we already know what genes are they). Data from AnnotationHub have slightly more genes with known gene symbols (as the example of 110532027 corresponding to tomm70a), but mapping that down to OmykA_1.1/GCF_013265735.2 can only solve a small part of the issue, let alone I don't have any idea to acheive that with code.
Anyway, thanks so much for the discussion help with me.