I have been getting a lot of use out of Biostrings latley. My collaborator has a series of regulatory elements that have been randomly joined together randomly, then put into a t-cell, and then sequenced after the cells were put under selective pressure. He basically wants to know which combinations of elements (which sequences) are most popular in his cells. The TLDR is that I was able to use your Biostrings package to quickly learn what order his specific sequences had been seen in. So I gave each one of these elements a one letter code to simplify representation, and so now I have a bunch of strings that look like this:
ABDFOYT QEWNILL UDFNHOA
Etc. (and so on for many thousands of strings)
What we want to do now is to ask: which combinations of elements are most common? Well letterFrequency() is great for the 1st layer of that!
But the next thing we want to know is: what combinations of letters are most common? IOW: how can I tabulate how often I see “AB”, or even "ABD" etc.
I tried using “AB” as a string for letterFrequency(). But that assumes that I actually mean “A|B” (A OR B), when what I really want is “A followed by B” OR possibly “B followed by A” (in my case those two things would be equivalent). Can letterFrequency() be repurposed to do anything like that?