Multiple (variable length) string searching in large text (e.g. Aho-Corasick)?
3
0
Entering edit mode
@mattchambers42-10186
Last seen 7.6 years ago

PDict restricts the strings to constant length, and possibly also to nucleotides. I'd like to use it (or something similar) for matching peptides to proteins. But I'd like to relax the length restriction, rather than needing 1 PDict for each peptide length. What kind of performance improvement does pmatch get for having that constant length restriction? Alternatively, are there any general AC trie implementations in another R package? My Google searching turned up nothing (nor for Commentz-Walters).

Biostrings • 1.7k views
ADD COMMENT
2
Entering edit mode
@herve-pages-1542
Last seen 2 days ago
Seattle, WA, United States

Hi,

Just to clarify, PDict() supports a variable length set of strings. It's just that you need to specify a "trusted band" when preprocessing the strings. The trusted band is a rectangular area obtained by clipping the strings on one side or the other (or on both sides) and used internally to build the Aho-Corasick tree. The user specifies the trusted band via the tb.start and tb.end arguments. Here is an example where the set of strings is not rectangular. It also has an N in it that is treated as an ambiguity (this works because the N is not in the trusted band):

library(BSgenome.Scerevisiae.UCSC.sacCer2)
dna <- c("GTAAGT", "ACAGGTNG", "CGTAAGC", "CACTTG")
pdict <- PDict(dna, tb.end=min(width(dna)))
m <- matchPDict(pdict, Scerevisiae$chrI, fixed=FALSE)

The thing to keep in mind is: the wider the trusted band is, the faster the algorithm will run. See ?PDict for more information.

PDict/matchPDict was optimized towards big rectangular dictionaries of short DNA patterns. When the set of patterns is rectangular, then no pattern can be a substring of another pattern. This has some interesting consequences when you build the Aho-Corasick tree out of this rectangular set of patterns e.g. all the leaf nodes are at the same depth in the tree, and, when you walk the tree along the subject, you get a hit iff you reach a leaf node. So you don't need to map any intermediate nodes back to a pattern, only the leaf nodes (this mapping is needed in order to report the hits). This allows for some optimizations in the way nodes are represented in memory.

For another AC tree implementation in an R package, see match_ac() in the Starr package.

Cheers,

H.

ADD COMMENT
2
Entering edit mode
@mattchambers42-10186
Last seen 7.6 years ago

I wasn't able to find an amino-acid-friendly existing solution and my parallelized `lapply(peptides, gregexpr, proteins)` approach was still not fast enough, so I learned Rcpp and implemented this myself. Now CRAN has a good general purpose multiple-keyword search package: https://cran.r-project.org/web/packages/AhoCorasickTrie/index.html

ADD COMMENT
0
Entering edit mode
@mattchambers42-10186
Last seen 7.6 years ago

If I trust all the amino acids in my peptides, I can create a trusted band that is as long as my shortest peptide? Then I could split the dataset in half for short peptides (6 AAs trusted) and long peptides (about 15 AAs trusted). Unfortunately the nucleotide restriction seems hard-coded. How hard would it be to extend it for AAStringSet?

I will look at match_ac() as well. My Google skills really failed me there. I tried "R corasick" but not "bioconductor corasick". 

edit: Starr is hard-coded for nucleotides as well. I guess I could reverse translate my peptides and proteins...but I don't think I'll go there. :)

ADD COMMENT
0
Entering edit mode

Hi,

The primary use case we wanted to support with this implementation of Aho-Corasick was fast alignment of a dictionary of short DNA patterns to a reference genome. We wanted to be able to do this with big dictionaries of tens or hundreds of millions patterns. In this case the Aho-Corasick tree is huge with hundreds of millions of nodes. One of the challenges of the current PDict/matchPDict implementation was to come up with a representation of the tree (the PDict object) that would be as compact as possible in memory. In particular the C structures used to represent the nodes of the Aho-Corasick tree were designed from the very start to take advantage of the fact that the nucleotide alphabet is very small (only 4 letters). Also special care was taken to avoid reallocations when the tree grows (during construction, or later when using it to walk along a subject, in which case it keeps growing when failure links are added to it). The no-reallocation strategy also takes advantage of the fact that the alphabet has only 4 letters.

All this to say that supporting amino acid sequences would require using different C structures for the tree. That would probably require a significant amount of work.

Cheers,

H.

ADD REPLY
0
Entering edit mode

Oops, I should have added a comment, not an answer. I have a decent standard C++ implementation of Aho Corasick which I use for peptide/protein lookup in my C++ programs. It's templated to work with different keyword types and alphabet sizes. It uses TR1's std::shared_ptr to store the keywords in the results. It's certainly not as efficient as the PDict implementation, but I've used it with hundreds of thousands of peptides and it doesn't consume too much memory. I don't know anything about implementing C++ into Bioconductor though. How much work would that be? Here's my implementation: 

ADD REPLY
0
Entering edit mode

How much work would that be... for you or for me? ;-)

Integrating your C++ code to Biostrings would require to:

  1. Adapt the code to make it work with AAStringSet objects and return an MIndex object.
  2. Add all the glue between the C++ code and the PDict/matchPDict front end.

If you're not familiar with R package development, it will take some time to learn. The usual place to start for this is http://cran.fhcrc.org/doc/manuals/r-release/R-exts.html You would also need to learn a bit about the internals of the Biostrings package.

Probably easier for me to do but not a trivial task either because item 1 above would require a significant amount of modifications/additions to your code.

Plus let's not forget the usual final polish of the product i.e. update the documentation and add unit tests.

Unfortunately I don't have time to work on this at the moment, sorry. Patch welcome.

Thanks,

H.

ADD REPLY
0
Entering edit mode

I forgot to mention that you can also use matchPDict() without preprocessing. In that case it's equivalent to calling matchPattern() in a loop except that the returned object is an MIndex object and the loop happens at the C level. That means none of the restrictions discussed previously apply i.e. you can use it on a variable length AAStringSet object. Of course it's not as fast as using an Aho-Corasick tree but still pretty fast e.g. it takes about 20 sec on my laptop to match 200k short peptides to a subject of length 76k:

library(Biostrings)
library(hgu95av2probe)
peptides <- translate(DNAStringSet(hgu95av2probe, end=-2))
peptides
#   A AAStringSet instance of length 201800
#          width seq
#      [1]     8 WLLLRSPF
#      [2]     8 GCEFLYIF
#      [3]     8 ASIPLCFN
#      [4]     8 AV*QSMLC
#      ...   ... ...
# [201797]     8 IAFCQSII
# [201798]     8 FCQSIIST
# [201799]     8 QSIISTSP
# [201800]     8 VLLVNSAP

library(BSgenome.Scerevisiae.UCSC.sacCer2)
subject <- translate(Scerevisiae$chrI)
system.time(m <- matchPDict(peptides, subject))
#    user  system elapsed 
#  20.770   0.007  20.793 

Might be fast enough for your use case.

Note that in that case (i.e. no preprocessing) you can also allow a small number of mismatches and/or indels. For more information please see documentation of the pdict argument in ?matchPDict and also "D. USING A NON-PREPROCESSED DICTIONARY" in the examples section of that man page.

Hope this helps,

H.

ADD REPLY

Login before adding your answer.

Traffic: 568 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6