Hi,
Just to clarify, PDict()
supports a variable length set of strings. It's just that you need to specify a "trusted band" when preprocessing the strings. The trusted band is a rectangular area obtained by clipping the strings on one side or the other (or on both sides) and used internally to build the Aho-Corasick tree. The user specifies the trusted band via the tb.start
and tb.end
arguments. Here is an example where the set of strings is not rectangular. It also has an N in it that is treated as an ambiguity (this works because the N is not in the trusted band):
library(BSgenome.Scerevisiae.UCSC.sacCer2)
dna <- c("GTAAGT", "ACAGGTNG", "CGTAAGC", "CACTTG")
pdict <- PDict(dna, tb.end=min(width(dna)))
m <- matchPDict(pdict, Scerevisiae$chrI, fixed=FALSE)
The thing to keep in mind is: the wider the trusted band is, the faster the algorithm will run. See ?PDict
for more information.
PDict
/matchPDict
was optimized towards big rectangular dictionaries of short DNA patterns. When the set of patterns is rectangular, then no pattern can be a substring of another pattern. This has some interesting consequences when you build the Aho-Corasick tree out of this rectangular set of patterns e.g. all the leaf nodes are at the same depth in the tree, and, when you walk the tree along the subject, you get a hit iff you reach a leaf node. So you don't need to map any intermediate nodes back to a pattern, only the leaf nodes (this mapping is needed in order to report the hits). This allows for some optimizations in the way nodes are represented in memory.
For another AC tree implementation in an R package, see match_ac()
in the Starr package.
Cheers,
H.
Hi,
The primary use case we wanted to support with this implementation of Aho-Corasick was fast alignment of a dictionary of short DNA patterns to a reference genome. We wanted to be able to do this with big dictionaries of tens or hundreds of millions patterns. In this case the Aho-Corasick tree is huge with hundreds of millions of nodes. One of the challenges of the current
PDict
/matchPDict
implementation was to come up with a representation of the tree (the PDict object) that would be as compact as possible in memory. In particular the C structures used to represent the nodes of the Aho-Corasick tree were designed from the very start to take advantage of the fact that the nucleotide alphabet is very small (only 4 letters). Also special care was taken to avoid reallocations when the tree grows (during construction, or later when using it to walk along a subject, in which case it keeps growing when failure links are added to it). The no-reallocation strategy also takes advantage of the fact that the alphabet has only 4 letters.All this to say that supporting amino acid sequences would require using different C structures for the tree. That would probably require a significant amount of work.
Cheers,
H.
Oops, I should have added a comment, not an answer. I have a decent standard C++ implementation of Aho Corasick which I use for peptide/protein lookup in my C++ programs. It's templated to work with different keyword types and alphabet sizes. It uses TR1's std::shared_ptr to store the keywords in the results. It's certainly not as efficient as the PDict implementation, but I've used it with hundreds of thousands of peptides and it doesn't consume too much memory. I don't know anything about implementing C++ into Bioconductor though. How much work would that be? Here's my implementation:
How much work would that be... for you or for me? ;-)
Integrating your C++ code to Biostrings would require to:
If you're not familiar with R package development, it will take some time to learn. The usual place to start for this is http://cran.fhcrc.org/doc/manuals/r-release/R-exts.html You would also need to learn a bit about the internals of the Biostrings package.
Probably easier for me to do but not a trivial task either because item 1 above would require a significant amount of modifications/additions to your code.
Plus let's not forget the usual final polish of the product i.e. update the documentation and add unit tests.
Unfortunately I don't have time to work on this at the moment, sorry. Patch welcome.
Thanks,
H.
I forgot to mention that you can also use
matchPDict()
without preprocessing. In that case it's equivalent to callingmatchPattern()
in a loop except that the returned object is an MIndex object and the loop happens at the C level. That means none of the restrictions discussed previously apply i.e. you can use it on a variable length AAStringSet object. Of course it's not as fast as using an Aho-Corasick tree but still pretty fast e.g. it takes about 20 sec on my laptop to match 200k short peptides to a subject of length 76k:Might be fast enough for your use case.
Note that in that case (i.e. no preprocessing) you can also allow a small number of mismatches and/or indels. For more information please see documentation of the
pdict
argument in?matchPDict
and also "D. USING A NON-PREPROCESSED DICTIONARY" in the examples section of that man page.Hope this helps,
H.