Hi,
I have a problem using the org.Sc.sgd.db package and have questions regarding the validity of the information within the package.
The results seams to be inconsistent, since YAL068C_mRNA
describes PAU8 and YGL261C_mRNA == PAU11
. The transcripts name for PAU9 is YBL108C-A_mRNA
and the genename is YBL108C-A
, which is not in the database.
I crossreferenced the genes using the resources on https://www.yeastgenome.org/.
Would you be able the comment on this? Did I use a wrong query??
Thanks for any advice.
Best regards Felix
library(org.Sc.sgd.db)
head(keys(org.Sc.sgd.db,"ENSEMBL"))
#> [1] "YAL068C" "YGL261C" "YAL067W-A" "YAL067C" "YAL065C" "YAL064W-B"
select(org.Sc.sgd.db,"YAL068C",c("ENSEMBLTRANS","ENTREZID","GENENAME"),"ENSEMBL")
#> 'select()' returned 1:many mapping between keys and columns
#> ENSEMBL ENSEMBLTRANS ENTREZID GENENAME
#> 1 YAL068C YAL068C_mRNA 851229 PAU8
#> 2 YAL068C YGL261C_mRNA 851229 PAU8
#> 3 YAL068C YAL068C_mRNA 852163 PAU9
#> 4 YAL068C YGL261C_mRNA 852163 PAU9
#> 5 YAL068C YAL068C_mRNA 852630 PAU11
#> 6 YAL068C YGL261C_mRNA 852630 PAU11
select(org.Sc.sgd.db,"PAU9",c("ENSEMBLTRANS","ENTREZID","ENSEMBL"),"GENENAME")
#> 'select()' returned 1:many mapping between keys and columns
#> GENENAME ENSEMBLTRANS ENTREZID ENSEMBL
#> 1 PAU9 YAL068C_mRNA 852163 YAL068C
#> 2 PAU9 YAL068C_mRNA 852163 YGL261C
#> 3 PAU9 YGL261C_mRNA 852163 YAL068C
#> 4 PAU9 YGL261C_mRNA 852163 YGL261C
select(org.Sc.sgd.db,"YBL108C-A",c("ENSEMBLTRANS","ENTREZID","GENENAME"),"ENSEMBL")
#> Error in .testForValidKeys(x, keys, keytype, fks): None of the keys entered are valid keys for 'ENSEMBL'.
#> Please use the keys method to see a listing of valid arguments.
sessionInfo()
#> R version 3.6.2 (2019-12-12)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18363)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
#> [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
#> [5] LC_TIME=German_Germany.1252
#>
#> attached base packages:
#> [1] parallel stats4 stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] org.Sc.sgd.db_3.10.0 AnnotationDbi_1.48.0 IRanges_2.20.2
#> [4] S4Vectors_0.24.3 Biobase_2.46.0 BiocGenerics_0.32.0
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.3 knitr_1.27 magrittr_1.5 bit_1.1-15.1
#> [5] rlang_0.4.2 stringr_1.4.0 blob_1.2.1 highr_0.8
#> [9] tools_3.6.2 xfun_0.12 DBI_1.1.0 htmltools_0.4.0
#> [13] yaml_2.2.0 bit64_0.9-7 digest_0.6.23 tibble_2.1.3
#> [17] crayon_1.3.4 vctrs_0.2.1 zeallot_0.1.0 memoise_1.1.0
#> [21] evaluate_0.14 RSQLite_2.2.0 rmarkdown_2.1 stringi_1.4.4
#> [25] pillar_1.4.3 compiler_3.6.2 backports_1.1.5 pkgconfig_2.0.3
That is definitly a possibility. However, this doesn't explain the broken relation between
PAU8
,PAU9
andPAU11
. This should not be the case in any event and it is really puzzling, how this could happen, since all transcripts stem from different chromosomes.I would really like to debug this, but I couldn't find the source for generating the
org.Sc
package's data, so I will have to usebiomaRt
resources for now.That's a different story. You could imagine that NCBI and EBI/EMBL completely agree on what is and isn't a gene, and which one is which, and what transcripts are part of each gene, but you would be gravely mistaken. As an example:
NCBI says there are three Gene IDs, one for each of PAU8, 9, and 11. And if we try to map to Ensembl IDs there are issues. We can try to go the other way:
You could try to figure out what the issues are, and why there is a disagreement between NCBI and EBI, but this is just one gene, and there are thousands of them. I take the alternative viewpoint, which is that the two annotation services disagree about lots of things, and I don't really care about that, but I do know that trying to map between them is a fool's errand, and it's much better to just pick one and stick with it for a given analysis.
Ok the whole thing becomes a bit clearer now. I haven't found any disagreements on the major website, for example for PAU9:
https://www.ncbi.nlm.nih.gov/gene/852163 https://www.ensembl.org/Saccharomycescerevisiae/Gene/Summary?g=YBL108C-A;r=II:7605-7733;t=YBL108C-AmRNA https://www.yeastgenome.org/locus/S000007592
However on Uniprot there is a discrepency:
https://www.uniprot.org/uniprot/Q3E770
Here all the ids are put in the same blender and the results mirrors the contents of
org.Sc.sgd.db
. Since I don't know, how the contents of theorg.Sc.sgd.db
package are created, it is quite unclear to me where the problem arise and which resource is to blame. In the end the name of the package isorg.Sc.sgd.db
, but the content does not reflect SGD information as much as I would like it to do. So I think thats it then.Thanks for the explanation and the time taken. Much appreciated.
This issue has less to do with the source of the information than trying to map things between annotation sources. NCBI says there are three Gene IDs, and they map to all the right things:
Which seems OK to me? You seem to think that Ensembl IDs are the same thing as SGD IDs, which isn't the case. It may be that Ensembl co-opted the SGD ORF IDs rather than generating their own, but that doesn't make Ensembl IDs SGD IDs. If you want the SGD data you have to ask for it, not Ensembl.
Thats what I thought, I would get. I was totally confused since the package name is
org.Sc.sgd.db
and has nothing about Ensembl in the name like fororg.Hs.eg.db
.I dug deeper into the reference manual yesterday evening: Two of the three links provided as data source are basically invalid.
ftp://genome-ftp.stanford.edu/pub
does not exist anymore,ftp://ftp.ensembl.org/pub/current_fasta/saccharomyces_cerevisiae/
only contains fasta files (which I think are not helpful mapping ids, but I haven't looked at them in detail), andftp://ftp.ncbi.nlm.nih.gov/gene/DATA
contains some data (which I haven't had the time to dig into). Without the actual code for importing the data, it is hard to guess, what is actually going on.Just to say that 'we' (Bioc core team) are looking into this a bit more thoroughly; superficially it's consistent with James' explanation, but like Felix I'd like to understand where the problem arises. I also consulted the help page and saw the dead / pretty incomprehensible links / templated description.
The code generating the underlying data bases is available in https://github.com/Bioconductor/BioconductorAnnotationPipeline but frankly is very 'complicated' to understand; the basics are in https://github.com/Bioconductor/BioconductorAnnotationPipeline/tree/master/annosrc/yeast/script . There is some additional packaging done in the AnnotationForge package (github).
I will check this code out and see if anything obvious comes to mind. Thanks for the link.
So here is a bit of good news:
The bad news:
ensembl*
uniprot
Sce
) have a problem. You can check yourself using the followin queries:SELECT * FROM ensembl WHERE _id IN (SELECT _id FROM ensembl GROUP BY _id HAVING COUNT(_id) > 1)
SELECT * FROM uniprot WHERE _id IN (SELECT _id FROM uniprot GROUP BY _id HAVING COUNT(_id) > 1)
(and so on)
Some of the non unique id have consistent problems accross the table, but some don't.
ensembl*
tables have a distinct problem, which differs from theuniprot
tableIn case of the uniprot table the overlaps can be directly followed up using the uniprot website: Example
https://www.uniprot.org/uniprot/?query=YDR316W-A
All the SGD ids associated with these entries overlap for the KEGG information. This overlapp ends up up in the uniprot table of
org.Sc.sgd.db
. Interestingly all problematic uniprot entries share a function, e.g. they are homologs and share a KEGG entry, and stem from stretches of uniprot ids, for example the largest stretchP0CX08-P0CY09
.So I think the problem is really on the ensembl/uniprot side of this annotation data.
~~~This cannot be fixed here I guess, but I also don't have the time to follow up on this with ensembl and uniprot, since they can always claim that the way this package was assembled is wrong.~~~
The queries writing the inconsistencies to the
org.Sc.sgd.db
package areINSERT INTO ensembl SELECT DISTINCT s._id, n.ensid FROM (SELECT * FROM NCBIsrc.ensembl INNER JOIN NCBIsrc.locus_tags USING(_id)) AS n, sgd as s WHERE s.systematic_name=n.locus_tag UNION SELECT DISTINCT s._id, n.ensid FROM (SELECT * FROM NCBIsrc.ensembl INNER JOIN NCBIsrc.sgd_ids USING(_id)) AS n, sgd as s WHERE s.sgd_id=n.sgd_id;
and
INSERT INTO uniprot SELECT DISTINCT s._id, n.uniprot_id FROM (SELECT * FROM NCBIsrc.uniprot INNER JOIN NCBIsrc.locus_tags USING(_id)) AS n, sgd as s WHERE s.systematic_name=n.locus_tag UNION SELECT DISTINCT s._id, n.uniprot_id FROM (SELECT * FROM NCBIsrc.uniprot INNER JOIN NCBIsrc.sgd_ids USING(_id)) AS n, sgd as s WHERE s.sgd_id=n.sgd_id;
in
https://github.com/Bioconductor/BioconductorAnnotationPipeline/blob/master/annosrc/yeast/script/bindb2.sql
The queries causing the inconsistencies in the first place are probably not found in the yeast specific part of
BioconductorAnnotationPipeline
, but in the more generalhttps://github.com/Bioconductor/BioconductorAnnotationPipeline/blob/master/annosrc/organism_annotation/script/getdb.sh
in which theNCBIsrc.*
tables are created aschipsrc_yeastNCBI.sqlite
For now I will stick with the gff file of
sacCer3
/R64-2-1
as the source of truth. Ensembl and Uniprot data is definity broken for 1.5% of yeast genes.Just as a follow up from the Bioc core team, I have been able to trace this back to a biomaRt issue. When we parse the Ensembl downloaded data, we run a function from the biomaRt package that gets ensembl to NCBI mappings. After running this code is when we see the duplicates introduced. I did some digging around on biomaRt and found these similar duplicates which leads me to believe this is a biomaRt issue. I have reached out to biomaRt to point out this issue and hopefully they will correct it so that we can report the correct information in our package. Thank you for bringing this to our attention and I hope we can have it straightened out soon.
How does one track the progress on this issue? I'm interested in understanding how this is resolved, but also a timeline for an update to sacCer3/R64-3-1, which was released in 2021-04