Dear all,
I have a AAstringSet. Some of them have stop codon that I need to filter out and the aastring need to fit to my pattern LVXXXLXXXL, X in IUPAC code, represent any amino acids.
stopcodon = vmatchPattern("*",seqs.frame$aa,max.mismatch=0)
stopcodon.frame=as.data.frame(stopcodon)
I have the index of the AAStrings now, but how do I get rid of them from the seqs.frame ?
I guess I can use vmatchPattern again with for the LVXXXLXXL once the above is figured out.
Thank you

Thank you. Regarding the pattern filter, fixed=FALSE can only be used for DNASting and RNAString. when use vmatchpattern or vcountpattern with 'X' (extended IPUAC code), all return with no match.
CompareStrings("LXXXLXXXL",seqs.frame$aa[1]) returns "L???L???L"---fit and "?????????" --non-fit. But only compares string to string, need a for and if loop here.
Is there a better way to compare/analysis AAstrings?
Thank you.
XIA
Unfortunately
fixed=TRUEis not supported on AAStringSet objects. One (imperfect) way to work around this is to usegregexpr()with a regular expression. However note that this solution misses hits that overlap with other hits e.g.:aa <- AAStringSet(toupper(c("LVaaaLVaaLLaaaL", "LVaaaLaaaLVaaaLaaaL", "LVaaaLaaaLLVaaaLaaaL")) gregexpr("LV...L...L", aa) # all 3 sequences in 'aa' have 2 hits but # gregexpr() only reports 1 hit in the 1st # and 1 hit in the 2nd sequenceAnother way to work around this problem is to use a trick that takes advantage of the simplicity of the specific pattern. The trick is to convert
aainto a DNAStringSet object after doing the following replacements:Something like this:
dna <- BStringSet(aa) old <- paste(AA_ALPHABET[!(AA_ALPHABET %in% c("L", "V"))], collapse="") new <- strrep("A", nchar(old)) dna <- chartr(old, new, dna) dna <- chartr("L", "C", dna) dna <- chartr("V", "G", dna) dna <- DNAStringSet(dna) dna # A DNAStringSet instance of length 3 # width seq # [1] 15 CGAAACGAACCAAAC # [2] 19 CGAAACAAACGAAACAAAC # [3] 20 CGAAACAAACCGAAACAAACSo now you can call
vmatchPattern()withfixed=FALSE:vmatchPattern("CGNNNCNNNC", dna, fixed=FALSE) # MIndex object of length 3 # [[1]] # IRanges object with 2 ranges and 0 metadata columns: # start end width # <integer> <integer> <integer> # [1] 1 10 10 # [2] 6 15 10 # # [[2]] # IRanges object with 2 ranges and 0 metadata columns: # start end width # <integer> <integer> <integer> # [1] 1 10 10 # [2] 10 19 10 # ...This trick can be slightly adapted to work with patterns that use more than 4 non-X letters from the AA alphabet. More precisely:
vmatchPattern()withfixed="subject".The following
vmatchPattern2()function implements this:## A version of vmatchPattern() that treats Xs in AA pattern as ## ambiguities. vmatchPattern2 <- function(pattern, subject, max.mismatch=0, min.mismatch=0, with.indels=FALSE) { stopifnot(is(subject, "AAStringSet")) pattern <- AAString(pattern) old <- uniqueLetters(pattern) if (length(setdiff(old, "X")) >= length(DNA_ALPHABET)) stop("too many different unique letters in 'pattern'") ## Turn 'pattern' into a DNAString object. new <- old new[old == "X"] <- "N" new[old != "X"] <- head(DNA_ALPHABET[DNA_ALPHABET != "N"], n=length(old)-1) pattern <- DNAString(chartr(paste(old, collapse=""), paste(new, collapse=""), BString(pattern))) ## Turn 'subject' into a DNAStringSet object. old2 <- uniqueLetters(subject) new2 <- rep("N", length(old2)) m <- match(old2, old) new2[!is.na(m)] <- new[m[!is.na(m)]] subject <- DNAStringSet(chartr(paste(old2, collapse=""), paste(new2, collapse=""), BStringSet(subject))) ## Call vmatchPattern() with fixed="subject". vmatchPattern(pattern, subject, max.mismatch=max.mismatch, min.mismatch=min.mismatch, with.indels=with.indels, fixed="subject") }Lightly tested only!
H.
The AAs in certain position are fixed, the rest are random, eg: L is located at the first, fourth and last position.
So I use substr and paste to replace the random AAs to A and then apply vcountPattern/vmatchPattern.
A big thank you!