2.9 years ago by
I found the following presentation:
Bassi, Sebastian and Gonzalez, Virginia. New checksum functions for Biopython. Available from Nature Precedings, 2007
Abstract: Checksum algorithms are used in biological databases for integrity check and identification purposes. CRC64 is the only checksum algorithm already included in Biopython. This work proposes two new implementation of known algorithms (GCG Checksum and SEGUID). There is also an application based on SEGUID: Looking for redundancy between two FASTA files full of protein sequences based only in sequence information, by comparing the SEGUIDs of both files. The code is shown in the manuscript and may be available at Biopython.org.
Download presentation: http://dx.doi.org/10.1038/npre.2007.278.1 (PDF/PPT without paywall)
To summarize, they mention the following checksum algorithms:
- CRC64: Proteins in Uniprot.
- GCG-Checksum: DNA and Protein sequences in the file format of GCG and compatible programs.
- SEGUID: “A SEquence Globally Unique IDentifier” Proteome Database
If you read the slides you find that the first two are not strong enough, i.e. two different sequences can get identical checksums. The SEGUID looks very promising:
“We propose the use of a unique sequence identifier (SEGUID) that is derived from the primary sequence itself and easily generated by any user. SEGUIDs are resilient to changes in public and private databases as they remain constant throughout the lifetime of a given protein sequence. The SEGUID Proteome Database (http://bioinformatics.anl.gov/seguid/ ) provides aliases for the annotated entries available from several public databases and can be downloaded or generated easily at remote sites. SEGUIDs have been used in our proteomics laboratory for years and proved to be useful integrating mass spectrometry results, two-dimensional gelelectrophoresis data, and bioinformatics information”
Source: SEGUID: Overview. http://bioinformatics.anl.gov/seguid/overview.aspx (broken URL; Most recent Web Archive version: http://web.archive.org/web/20130214121710/http://bioinformatics.anl.gov/seguid/overview.aspx)
There is also a reference to a 2006 Proteomics article (http://dx.doi.org/10.1002/pmic.200600032), which is behind a paywall, and that I can't be bothered to read.
In other words, it might be that SEGUID is a better checksum algorithm for genomic sequences than a generic algorithms such as MD5. The fact that Biopython has decided to implement may help the decision and to do initial validation.