Dear all,
I have a AAstringSet. Some of them have stop codon that I need to filter out and the aastring need to fit to my pattern LVXXXLXXXL, X in IUPAC code, represent any amino acids.
stopcodon = vmatchPattern("*",seqs.frame$aa,max.mismatch=0)
stopcodon.frame=as.data.frame(stopcodon)
I have the index of the AAStrings now, but how do I get rid of them from the seqs.frame ?
I guess I can use vmatchPattern again with for the LVXXXLXXL once the above is figured out.
Thank you
Thank you. Regarding the pattern filter, fixed=FALSE can only be used for DNASting and RNAString. when use vmatchpattern or vcountpattern with 'X' (extended IPUAC code), all return with no match.
CompareStrings("LXXXLXXXL",seqs.frame$aa[1]) returns "L???L???L"---fit and "?????????" --non-fit. But only compares string to string, need a for and if loop here.
Is there a better way to compare/analysis AAstrings?
Thank you.
XIA
Unfortunately
fixed=TRUE
is not supported on AAStringSet objects. One (imperfect) way to work around this is to usegregexpr()
with a regular expression. However note that this solution misses hits that overlap with other hits e.g.:Another way to work around this problem is to use a trick that takes advantage of the simplicity of the specific pattern. The trick is to convert
aa
into a DNAStringSet object after doing the following replacements:Something like this:
So now you can call
vmatchPattern()
withfixed=FALSE
:This trick can be slightly adapted to work with patterns that use more than 4 non-X letters from the AA alphabet. More precisely:
vmatchPattern()
withfixed="subject"
.The following
vmatchPattern2()
function implements this:Lightly tested only!
H.
The AAs in certain position are fixed, the rest are random, eg: L is located at the first, fourth and last position.
So I use substr and paste to replace the random AAs to A and then apply vcountPattern/vmatchPattern.
A big thank you!