How to get rid of the AAStrings with a stop codon
Entering edit mode
XIA.PAN ▴ 20
Last seen 2.6 years ago

Dear all,

I have a AAstringSet. Some of them have stop codon that I need to filter out and the aastring need to fit to my pattern LVXXXLXXXL, X in IUPAC code, represent any amino acids.

stopcodon = vmatchPattern("*",seqs.frame$aa,max.mismatch=0)

I have the index of the AAStrings now, but how do I get rid of them from the seqs.frame ?

I guess I can use vmatchPattern again with for the LVXXXLXXL once the above is figured out.

Thank you

aastringset aastring • 964 views
Entering edit mode
Last seen 1 day ago
Seattle, WA, United States


Since it seems that you don't really care about the exact location of the stop codon within each AA sequence, you could use vcountPattern() instead of vmatchPattern():

nstopcodon <- vcountPattern("*", seqs.frame$aa, max.mismatch=0)

Then just subset your data frame with:

seqs.frame <- seqs.frame[nstopcodon != 0, ]



Entering edit mode

Thank you. Regarding the pattern filter, fixed=FALSE can only be used for DNASting and RNAString. when use vmatchpattern or vcountpattern with 'X' (extended IPUAC code), all return with no match.

CompareStrings("LXXXLXXXL",seqs.frame$aa[1]) returns "L???L???L"---fit and "?????????" --non-fit.   But only compares string to string, need a for and if loop here.

Is there a better way to compare/analysis AAstrings?

Thank you.





Entering edit mode

Unfortunately fixed=TRUE is not supported on AAStringSet objects. One (imperfect) way to work around this is to use gregexpr() with a regular expression. However note that this solution misses hits that overlap with other hits e.g.:

aa <- AAStringSet(toupper(c("LVaaaLVaaLLaaaL",
gregexpr("LV...L...L", aa)  # all 3 sequences in 'aa' have 2 hits but
                            # gregexpr() only reports 1 hit in the 1st
                            # and 1 hit in the 2nd sequence 

Another way to work around this problem is to use a trick that takes advantage of the simplicity of the specific pattern. The trick is to convert aa into a DNAStringSet object after doing the following replacements:

  1. Replace all letters that are not L or V with As
  2. Replace Ls with Cs
  3. Replace Vs with Gs

Something like this:

dna <- BStringSet(aa)
old <- paste(AA_ALPHABET[!(AA_ALPHABET %in% c("L", "V"))], collapse="")
new <- strrep("A", nchar(old))
dna <- chartr(old, new, dna)
dna <- chartr("L", "C", dna)
dna <- chartr("V", "G", dna)
dna <- DNAStringSet(dna)
#   A DNAStringSet instance of length 3
#     width seq

So now you can call vmatchPattern() with fixed=FALSE:

vmatchPattern("CGNNNCNNNC", dna, fixed=FALSE)
# MIndex object of length 3
# [[1]]
# IRanges object with 2 ranges and 0 metadata columns:
#           start       end     width
#       <integer> <integer> <integer>
#   [1]         1        10        10
#   [2]         6        15        10
# [[2]]
# IRanges object with 2 ranges and 0 metadata columns:
#           start       end     width
#       <integer> <integer> <integer>
#   [1]         1        10        10
#   [2]        10        19        10
# ...

This trick can be slightly adapted to work with patterns that use more than 4 non-X letters from the AA alphabet. More precisely:

  1. Map Xs to N.
  2. Map non-X letters in the pattern to DNA letters using the full DNA alphabet, except N (already taken). Using the full DNA alphabet means using letters in DNA_ALPHABET that are not A, C, G, or T.
  3. Do the same mapping for the subject. Letters in the subject not present in the pattern are mapped to N.
  4. Call vmatchPattern() with fixed="subject".

The following vmatchPattern2() function implements this:

## A version of vmatchPattern() that treats Xs in AA pattern as
## ambiguities.
vmatchPattern2 <- function(pattern, subject,
                           max.mismatch=0, min.mismatch=0,
    stopifnot(is(subject, "AAStringSet"))
    pattern <- AAString(pattern)

    old <- uniqueLetters(pattern)
    if (length(setdiff(old, "X")) >= length(DNA_ALPHABET))
        stop("too many different unique letters in 'pattern'")

    ## Turn 'pattern' into a DNAString object.
    new <- old
    new[old == "X"] <- "N"
    new[old != "X"] <- head(DNA_ALPHABET[DNA_ALPHABET != "N"],
    pattern <- DNAString(chartr(paste(old, collapse=""),
                                paste(new, collapse=""),

    ## Turn 'subject' into a DNAStringSet object.
    old2 <- uniqueLetters(subject)
    new2 <- rep("N", length(old2))
    m <- match(old2, old)
    new2[!] <- new[m[!]]
    subject <- DNAStringSet(chartr(paste(old2, collapse=""),
                                   paste(new2, collapse=""),

    ## Call vmatchPattern() with fixed="subject".
    vmatchPattern(pattern, subject,
                  max.mismatch=max.mismatch, min.mismatch=min.mismatch,
                  with.indels=with.indels, fixed="subject")

Lightly tested only!


Entering edit mode

The AAs in certain position are fixed, the rest are random, eg: L is located at the first, fourth and last position.

So I use substr and paste to replace the random AAs to A and then apply vcountPattern/vmatchPattern.

A big thank you!  


Login before adding your answer.

Traffic: 193 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6