PDict restricts the strings to constant length, and possibly also to nucleotides. I'd like to use it (or something similar) for matching peptides to proteins. But I'd like to relax the length restriction, rather than needing 1 PDict for each peptide length. What kind of performance improvement does pmatch get for having that constant length restriction? Alternatively, are there any general AC trie implementations in another R package? My Google searching turned up nothing (nor for Commentz-Walters).
Just to clarify,
PDict() supports a variable length set of strings. It's just that you need to specify a "trusted band" when preprocessing the strings. The trusted band is a rectangular area obtained by clipping the strings on one side or the other (or on both sides) and used internally to build the Aho-Corasick tree. The user specifies the trusted band via the
tb.end arguments. Here is an example where the set of strings is not rectangular. It also has an N in it that is treated as an ambiguity (this works because the N is not in the trusted band):
library(BSgenome.Scerevisiae.UCSC.sacCer2) dna <- c("GTAAGT", "ACAGGTNG", "CGTAAGC", "CACTTG") pdict <- PDict(dna, tb.end=min(width(dna))) m <- matchPDict(pdict, Scerevisiae$chrI, fixed=FALSE)
The thing to keep in mind is: the wider the trusted band is, the faster the algorithm will run. See
?PDict for more information.
matchPDict was optimized towards big rectangular dictionaries of short DNA patterns. When the set of patterns is rectangular, then no pattern can be a substring of another pattern. This has some interesting consequences when you build the Aho-Corasick tree out of this rectangular set of patterns e.g. all the leaf nodes are at the same depth in the tree, and, when you walk the tree along the subject, you get a hit iff you reach a leaf node. So you don't need to map any intermediate nodes back to a pattern, only the leaf nodes (this mapping is needed in order to report the hits). This allows for some optimizations in the way nodes are represented in memory.
For another AC tree implementation in an R package, see
match_ac() in the Starr package.
I wasn't able to find an amino-acid-friendly existing solution and my parallelized `lapply(peptides, gregexpr, proteins)` approach was still not fast enough, so I learned Rcpp and implemented this myself. Now CRAN has a good general purpose multiple-keyword search package: https://cran.r-project.org/web/packages/AhoCorasickTrie/index.html
If I trust all the amino acids in my peptides, I can create a trusted band that is as long as my shortest peptide? Then I could split the dataset in half for short peptides (6 AAs trusted) and long peptides (about 15 AAs trusted). Unfortunately the nucleotide restriction seems hard-coded. How hard would it be to extend it for AAStringSet?
I will look at match_ac() as well. My Google skills really failed me there. I tried "R corasick" but not "bioconductor corasick".
edit: Starr is hard-coded for nucleotides as well. I guess I could reverse translate my peptides and proteins...but I don't think I'll go there. :)