Question: DNAString: Standard checksum function?
0
3.6 years ago by
United States
Henrik Bengtsson2.4k wrote:

Hi,

in a reverse-engineering task, I'd like to compare the content of two FASTA files (for which I cannot trace the origin).  When comparing the FASTA index files, the sequence/chromosome names as well as the lengths of the individual sequences agree, although in a different order.  Next I'd like to compare the actual sequences.  I could read these in using the FaFile class of Rsamtools and do pairwise comparison in memory.

However, if I had to redo the same task in the future I figured that why not calculate checksums of the sequences and store those on file (imagine an enhanced FASTA index file containing also the checksums per sequence).  Then one could quite quickly load and compare checksums instead of having to parse the whole FASTA file.

To my question:

I cannot be the first one doing this, so is there a standard/preferred checksum algorithm I should use for genomic sequences?  Is there even a method for this that I'm not aware of?  If I would implement this myself, I would probably go with md5 on toupper(as.character(seq)), but I prefer to use a de facto standard if that already exist.

Thxs

sequencing biostrings fasta • 1.1k views
modified 3.6 years ago • written 3.6 years ago by Henrik Bengtsson2.4k
0
3.6 years ago by
Hervé Pagès ♦♦ 14k
United States
Hervé Pagès ♦♦ 14k wrote:

Hi Henrik,

MD5 sounds good. I guess any decent checksum algo would probably do. Note that if seq is a DNAString object, you don't need to pass the result of as.character(seq)) thru toupper().

Maybe I should add something like this to Biostrings, but that means I would need to commit to whatever checksum algo I choose so people in the future can rely on the checksums they've calculated (and saved) in the past. I guess the reason I didn't add this to Biostrings so far is to avoid this kind of commitment.

H.

0
3.6 years ago by
United States
Henrik Bengtsson2.4k wrote:

I found the following presentation:

Bassi, Sebastian and Gonzalez, Virginia. New checksum functions for Biopython. Available from Nature Precedings, 2007

Abstract: Checksum algorithms are used in biological databases for integrity check and identification purposes. CRC64 is the only checksum algorithm already included in Biopython. This work proposes two new implementation of known algorithms (GCG Checksum and SEGUID). There is also an application based on SEGUID: Looking for redundancy between two FASTA files full of protein sequences based only in sequence information, by comparing the SEGUIDs of both files.  The code is shown in the manuscript and may be available at Biopython.org.

To summarize, they mention the following checksum algorithms:

1. CRC64: Proteins in Uniprot.
2. GCG-Checksum: DNA and Protein sequences in the file format of GCG and compatible programs.
3. SEGUID: “A SEquence Globally Unique IDentifier” Proteome Database

If you read the slides you find that the first two are not strong enough, i.e. two different sequences can get identical checksums.  The SEGUID looks very promising:

“We propose the use of a unique sequence identifier (SEGUID) that is derived from the primary sequence itself and easily generated by any user. SEGUIDs are resilient to changes in public and private databases as they remain constant throughout the lifetime of a given protein sequence. The SEGUID Proteome Database (http://bioinformatics.anl.gov/seguid/ ) provides aliases for the annotated entries available from several public databases and can be downloaded or generated easily at remote sites. SEGUIDs have been used in our proteomics laboratory for years and proved to be useful integrating mass spectrometry results, two-dimensional gelelectrophoresis data, and bioinformatics information”

Source: SEGUID: Overview. http://bioinformatics.anl.gov/seguid/overview.aspx (broken URL; Most recent Web Archive version: http://web.archive.org/web/20130214121710/http://bioinformatics.anl.gov/seguid/overview.aspx)

There is also a reference to a 2006 Proteomics article (http://dx.doi.org/10.1002/pmic.200600032), which is behind a paywall, and that I can't be bothered to read.

In other words, it might be that SEGUID is a better checksum algorithm for genomic sequences than a generic algorithms such as MD5.  The fact that Biopython has decided to implement may help the decision and to do initial validation.

All finite-length hash functions have collisions. The longer the checksum, the less likely it is to generate collisions. But long checksums are also more expensive to generate, store, and compare. So it's a question of compromise.

It seems that SEGUID was primarily designed for protein sequences. IIUC there still seems to be a lack of consensus about which checksum the Bioinformatics community as a whole should adopt for DNA sequences. Right now different DNA sequence databases use different sequence IDs but it would be great if they all agreed on an ID that is unique across databases.

A big advantage of md5 is that it's already available via the digest package.

H.

I would favor long enough checksums to minimize the risk for collisions; the cost for troubleshooting when there are collisions is much higher.  It's fairly simple to memoize/cache results to file.

> Right now different DNA sequence databases use different sequence IDs but it would be great if they all agreed on an ID that is unique across databases.

At least, should we try to gather what the common ones are?

> A big advantage of md5 is that it's already available via the digest package.

Agree about MD5.  SEGUID and others could be kept on the todo list.  Also, when using digest it's fairly easy to select among its supported checksum algorithms; currently md5, sha1, crc32, sha256, sha512, xxhash32, xxhash64, murmur32.

However, it would be helpful to define exactly what should be checksum:ed, e.g.

• upper or lower case characters/symbols?
• the symbols as represented in ASCII?
• ...?