Inconsistent data within org.Sc.sgd.db package
1
0
Entering edit mode
@felixgmernst-14764
Last seen 5 months ago
Germany

Hi,

I have a problem using the org.Sc.sgd.db package and have questions regarding the validity of the information within the package.

The results seams to be inconsistent, since YAL068C_mRNA describes PAU8 and YGL261C_mRNA == PAU11. The transcripts name for PAU9 is YBL108C-A_mRNA and the genename is YBL108C-A, which is not in the database.

I crossreferenced the genes using the resources on https://www.yeastgenome.org/.

Would you be able the comment on this? Did I use a wrong query??

Best regards Felix

library(org.Sc.sgd.db)
#> [1] "YAL068C"   "YGL261C"   "YAL067W-A" "YAL067C"   "YAL065C"   "YAL064W-B"
select(org.Sc.sgd.db,"YAL068C",c("ENSEMBLTRANS","ENTREZID","GENENAME"),"ENSEMBL")
#> 'select()' returned 1:many mapping between keys and columns
#>   ENSEMBL ENSEMBLTRANS ENTREZID GENENAME
#> 1 YAL068C YAL068C_mRNA   851229     PAU8
#> 2 YAL068C YGL261C_mRNA   851229     PAU8
#> 3 YAL068C YAL068C_mRNA   852163     PAU9
#> 4 YAL068C YGL261C_mRNA   852163     PAU9
#> 5 YAL068C YAL068C_mRNA   852630    PAU11
#> 6 YAL068C YGL261C_mRNA   852630    PAU11
select(org.Sc.sgd.db,"PAU9",c("ENSEMBLTRANS","ENTREZID","ENSEMBL"),"GENENAME")
#> 'select()' returned 1:many mapping between keys and columns
#>   GENENAME ENSEMBLTRANS ENTREZID ENSEMBL
#> 1     PAU9 YAL068C_mRNA   852163 YAL068C
#> 2     PAU9 YAL068C_mRNA   852163 YGL261C
#> 3     PAU9 YGL261C_mRNA   852163 YAL068C
#> 4     PAU9 YGL261C_mRNA   852163 YGL261C
select(org.Sc.sgd.db,"YBL108C-A",c("ENSEMBLTRANS","ENTREZID","GENENAME"),"ENSEMBL")
#> Error in .testForValidKeys(x, keys, keytype, fks): None of the keys entered are valid keys for 'ENSEMBL'.
#> Please use the keys method to see a listing of valid arguments.
sessionInfo()
#> R version 3.6.2 (2019-12-12)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18363)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252
#> [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
#> [5] LC_TIME=German_Germany.1252
#>
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets
#> [8] methods   base
#>
#> other attached packages:
#> [1] org.Sc.sgd.db_3.10.0 AnnotationDbi_1.48.0 IRanges_2.20.2
#> [4] S4Vectors_0.24.3     Biobase_2.46.0       BiocGenerics_0.32.0
#>
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.3      knitr_1.27      magrittr_1.5    bit_1.1-15.1
#>  [5] rlang_0.4.2     stringr_1.4.0   blob_1.2.1      highr_0.8
#>  [9] tools_3.6.2     xfun_0.12       DBI_1.1.0       htmltools_0.4.0
#> [13] yaml_2.2.0      bit64_0.9-7     digest_0.6.23   tibble_2.1.3
#> [17] crayon_1.3.4    vctrs_0.2.1     zeallot_0.1.0   memoise_1.1.0
#> [21] evaluate_0.14   RSQLite_2.2.0   rmarkdown_2.1   stringi_1.4.4
#> [25] pillar_1.4.3    compiler_3.6.2  backports_1.1.5 pkgconfig_2.0.3

org.Sc.sgd.db AnnotationDbi • 385 views
1
Entering edit mode
@james-w-macdonald-5106
Last seen 2 days ago
United States

The data for that OrgDb package were downloaded in June and July last year, and will remain static until the next release in April this year. Because of that, and given the rapidity of changes that can occur at the annotation services (which update on much smaller time scales), it's possible to get out of sync with the current data.

I'm not saying that's what happened, but there's this at NCBI:

PAU9 seripauperin PAU9 [ Saccharomyces cerevisiae S288C ]
Gene ID: 852163, updated on 1-Nov-2019



And under the summary section:

Gene symbol
PAU9
Gene description
seripauperin PAU9
Primary source
SGD:S000007592
Locus tag
YBL108C-A


NCBI is the primary source for our annotations, and something was updated in November. So if the gene symbol Locus tag was updated by Yeast Genome after the data were downloaded, you will end up with different results from the OrgDb package as compared to what you find online.

This is a trade-off. It's not possible for us to regenerate OrgDb packages any time there is an update at one or more of the annotation services (which are weekly, in the case of NCBI, our primary source). In addition, people tend to like having stable annotations, and can rely on the fact that a given release of Bioconductor won't result in changes in annotation if they re-run later. But if you need completely current data, you could use biomaRt:

> library(biomaRt)
> mart <- useEnsembl("ensembl","scerevisiae_gene_ensembl", mirror = "useast")
> getBM(c("ensembl_gene_id","ensembl_transcript_id","external_gene_name"), "external_gene_name", "PAU9", mart)
ensembl_gene_id ensembl_transcript_id external_gene_name
1       YBL108C-A        YBL108C-A_mRNA               PAU9


0
Entering edit mode

That is definitly a possibility. However, this doesn't explain the broken relation between PAU8, PAU9 and PAU11. This should not be the case in any event and it is really puzzling, how this could happen, since all transcripts stem from different chromosomes.

I would really like to debug this, but I couldn't find the source for generating the org.Sc package's data, so I will have to use biomaRt resources for now.

0
Entering edit mode

That's a different story. You could imagine that NCBI and EBI/EMBL completely agree on what is and isn't a gene, and which one is which, and what transcripts are part of each gene, but you would be gravely mistaken. As an example:

> select(org.Sc.sgd.db, c("PAU8","PAU9","PAU11"), c("ENTREZID","GENENAME"), "GENENAME")
'select()' returned 1:1 mapping between keys and columns
GENENAME ENTREZID
1     PAU8   851229
2     PAU9   852163
3    PAU11   852630
> select(org.Sc.sgd.db, c("PAU8","PAU9","PAU11"), c("ENTREZID","GENENAME", "ENSEMBL"), "GENENAME")
'select()' returned 1:many mapping between keys and columns
GENENAME ENTREZID ENSEMBL
1     PAU8   851229 YAL068C
2     PAU8   851229 YGL261C
3     PAU9   852163 YAL068C
4     PAU9   852163 YGL261C
5    PAU11   852630 YAL068C
6    PAU11   852630 YGL261C
> select(org.Sc.sgd.db, c("PAU8","PAU9","PAU11"), c("ENTREZID","GENENAME", "ENSEMBL","ENSEMBLTRANS"), "GENENAME")
'select()' returned 1:many mapping between keys and columns
GENENAME ENTREZID ENSEMBL ENSEMBLTRANS
1      PAU8   851229 YAL068C YAL068C_mRNA
2      PAU8   851229 YAL068C YGL261C_mRNA
3      PAU8   851229 YGL261C YAL068C_mRNA
4      PAU8   851229 YGL261C YGL261C_mRNA
5      PAU9   852163 YAL068C YAL068C_mRNA
6      PAU9   852163 YAL068C YGL261C_mRNA
7      PAU9   852163 YGL261C YAL068C_mRNA
8      PAU9   852163 YGL261C YGL261C_mRNA
9     PAU11   852630 YAL068C YAL068C_mRNA
10    PAU11   852630 YAL068C YGL261C_mRNA
11    PAU11   852630 YGL261C YAL068C_mRNA
12    PAU11   852630 YGL261C YGL261C_mRNA



NCBI says there are three Gene IDs, one for each of PAU8, 9, and 11. And if we try to map to Ensembl IDs there are issues. We can try to go the other way:

> getBM(c("entrezgene_id","ensembl_gene_id","external_gene_name"), "external_gene_name", paste0("PAU", c(8,9,11)), mart)
entrezgene_id ensembl_gene_id external_gene_name
1            NA       YBL108C-A               PAU9
2        851229         YGL261C              PAU11
3        852630         YGL261C              PAU11
4        852163         YGL261C              PAU11
5        851229         YAL068C               PAU8
6        852630         YAL068C               PAU8
7        852163         YAL068C               PAU8


You could try to figure out what the issues are, and why there is a disagreement between NCBI and EBI, but this is just one gene, and there are thousands of them. I take the alternative viewpoint, which is that the two annotation services disagree about lots of things, and I don't really care about that, but I do know that trying to map between them is a fool's errand, and it's much better to just pick one and stick with it for a given analysis.

0
Entering edit mode

Ok the whole thing becomes a bit clearer now. I haven't found any disagreements on the major website, for example for PAU9:

https://www.ncbi.nlm.nih.gov/gene/852163 https://www.ensembl.org/Saccharomycescerevisiae/Gene/Summary?g=YBL108C-A;r=II:7605-7733;t=YBL108C-AmRNA https://www.yeastgenome.org/locus/S000007592

However on Uniprot there is a discrepency:

https://www.uniprot.org/uniprot/Q3E770

Here all the ids are put in the same blender and the results mirrors the contents of org.Sc.sgd.db. Since I don't know, how the contents of the org.Sc.sgd.db package are created, it is quite unclear to me where the problem arise and which resource is to blame. In the end the name of the package is org.Sc.sgd.db, but the content does not reflect SGD information as much as I would like it to do. So I think thats it then.

Thanks for the explanation and the time taken. Much appreciated.

0
Entering edit mode

This issue has less to do with the source of the information than trying to map things between annotation sources. NCBI says there are three Gene IDs, and they map to all the right things:

> select(org.Sc.sgd.db, c("PAU8","PAU9","PAU11"), c("ENTREZID","GENENAME", "ORF"), "GENENAME")
'select()' returned 1:1 mapping between keys and columns
GENENAME ENTREZID       ORF        SGD
1     PAU8   851229   YAL068C S000002142
2     PAU9   852163 YBL108C-A S000007592
3    PAU11   852630   YGL261C S000003230


Which seems OK to me? You seem to think that Ensembl IDs are the same thing as SGD IDs, which isn't the case. It may be that Ensembl co-opted the SGD ORF IDs rather than generating their own, but that doesn't make Ensembl IDs SGD IDs. If you want the SGD data you have to ask for it, not Ensembl.

0
Entering edit mode

Thats what I thought, I would get. I was totally confused since the package name is org.Sc.sgd.db and has nothing about Ensembl in the name like for org.Hs.eg.db.

I dug deeper into the reference manual yesterday evening: Two of the three links provided as data source are basically invalid. ftp://genome-ftp.stanford.edu/pub does not exist anymore, ftp://ftp.ensembl.org/pub/current_fasta/saccharomyces_cerevisiae/ only contains fasta files (which I think are not helpful mapping ids, but I haven't looked at them in detail), and ftp://ftp.ncbi.nlm.nih.gov/gene/DATA contains some data (which I haven't had the time to dig into). Without the actual code for importing the data, it is hard to guess, what is actually going on.

1
Entering edit mode

Just to say that 'we' (Bioc core team) are looking into this a bit more thoroughly; superficially it's consistent with James' explanation, but like Felix I'd like to understand where the problem arises. I also consulted the help page and saw the dead / pretty incomprehensible links / templated description.

The code generating the underlying data bases is available in https://github.com/Bioconductor/BioconductorAnnotationPipeline but frankly is very 'complicated' to understand; the basics are in https://github.com/Bioconductor/BioconductorAnnotationPipeline/tree/master/annosrc/yeast/script . There is some additional packaging done in the AnnotationForge package (github).

0
Entering edit mode

I will check this code out and see if anything obvious comes to mind. Thanks for the link.

0
Entering edit mode

So here is a bit of good news:

1. I still know how to write SQL statements (sort of) :)
2. HeidiSQL now works with sqlite in the latest builds, which makes debugging very straight forward
3. it is not a general problem

1. the following tables definitly have a problem:

ensembl*

uniprot

1. jugdging from duplicate ids as indicator of broken data, it appears that a few genes (~100 / 1.5% of all genes in Sce) have a problem. You can check yourself using the followin queries:

SELECT * FROM ensembl WHERE _id IN (SELECT _id FROM ensembl GROUP BY _id HAVING COUNT(_id) > 1)

SELECT * FROM uniprot WHERE _id IN (SELECT _id FROM uniprot GROUP BY _id HAVING COUNT(_id) > 1)

(and so on)

1. Some of the non unique id have consistent problems accross the table, but some don't. ensembl* tables have a distinct problem, which differs from the uniprot table

2. In case of the uniprot table the overlaps can be directly followed up using the uniprot website: Example

https://www.uniprot.org/uniprot/?query=YDR316W-A

All the SGD ids associated with these entries overlap for the KEGG information. This overlapp ends up up in the uniprot table of org.Sc.sgd.db. Interestingly all problematic uniprot entries share a function, e.g. they are homologs and share a KEGG entry, and stem from stretches of uniprot ids, for example the largest stretch P0CX08-P0CY09.

So I think the problem is really on the ensembl/uniprot side of this annotation data.

~~~This cannot be fixed here I guess, but I also don't have the time to follow up on this with ensembl and uniprot, since they can always claim that the way this package was assembled is wrong.~~~

The queries writing the inconsistencies to the org.Sc.sgd.db package are

INSERT INTO ensembl SELECT DISTINCT s._id, n.ensid FROM (SELECT * FROM NCBIsrc.ensembl INNER JOIN NCBIsrc.locus_tags USING(_id)) AS n, sgd as s WHERE s.systematic_name=n.locus_tag UNION SELECT DISTINCT s._id, n.ensid FROM (SELECT * FROM NCBIsrc.ensembl INNER JOIN NCBIsrc.sgd_ids USING(_id)) AS n, sgd as s WHERE s.sgd_id=n.sgd_id;

and

INSERT INTO uniprot SELECT DISTINCT s._id, n.uniprot_id FROM (SELECT * FROM NCBIsrc.uniprot INNER JOIN NCBIsrc.locus_tags USING(_id)) AS n, sgd as s WHERE s.systematic_name=n.locus_tag UNION SELECT DISTINCT s._id, n.uniprot_id FROM (SELECT * FROM NCBIsrc.uniprot INNER JOIN NCBIsrc.sgd_ids USING(_id)) AS n, sgd as s WHERE s.sgd_id=n.sgd_id;

in https://github.com/Bioconductor/BioconductorAnnotationPipeline/blob/master/annosrc/yeast/script/bindb2.sql

The queries causing the inconsistencies in the first place are probably not found in the yeast specific part of BioconductorAnnotationPipeline, but in the more general https://github.com/Bioconductor/BioconductorAnnotationPipeline/blob/master/annosrc/organism_annotation/script/getdb.sh in which the NCBIsrc.* tables are created as chipsrc_yeastNCBI.sqlite

For now I will stick with the gff file of sacCer3/R64-2-1 as the source of truth. Ensembl and Uniprot data is definity broken for 1.5% of yeast genes.

1
Entering edit mode

Just as a follow up from the Bioc core team, I have been able to trace this back to a biomaRt issue. When we parse the Ensembl downloaded data, we run a function from the biomaRt package that gets ensembl to NCBI mappings. After running this code is when we see the duplicates introduced. I did some digging around on biomaRt and found these similar duplicates which leads me to believe this is a biomaRt issue. I have reached out to biomaRt to point out this issue and hopefully they will correct it so that we can report the correct information in our package. Thank you for bringing this to our attention and I hope we can have it straightened out soon.