How to get rid of the AAStrings with a stop codon
1
0
Entering edit mode
XIA.PAN ▴ 20
@xiapan-12407
Last seen 5.1 years ago

Dear all,

I have a AAstringSet. Some of them have stop codon that I need to filter out and the aastring need to fit to my pattern LVXXXLXXXL, X in IUPAC code, represent any amino acids.

stopcodon = vmatchPattern("*",seqs.frame$aa,max.mismatch=0)
stopcodon.frame=as.data.frame(stopcodon)

I have the index of the AAStrings now, but how do I get rid of them from the seqs.frame ?

I guess I can use vmatchPattern again with for the LVXXXLXXL once the above is figured out.

Thank you

aastringset aastring • 1.8k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 1 day ago
Seattle, WA, United States

Hi,

Since it seems that you don't really care about the exact location of the stop codon within each AA sequence, you could use vcountPattern() instead of vmatchPattern():

nstopcodon <- vcountPattern("*", seqs.frame$aa, max.mismatch=0)

Then just subset your data frame with:

seqs.frame <- seqs.frame[nstopcodon != 0, ]

Cheers,

H.

ADD COMMENT
0
Entering edit mode

Thank you. Regarding the pattern filter, fixed=FALSE can only be used for DNASting and RNAString. when use vmatchpattern or vcountpattern with 'X' (extended IPUAC code), all return with no match.

CompareStrings("LXXXLXXXL",seqs.frame$aa[1]) returns "L???L???L"---fit and "?????????" --non-fit.   But only compares string to string, need a for and if loop here.

Is there a better way to compare/analysis AAstrings?

Thank you.

XIA

 

 

 

ADD REPLY
0
Entering edit mode

Unfortunately fixed=TRUE is not supported on AAStringSet objects. One (imperfect) way to work around this is to use gregexpr() with a regular expression. However note that this solution misses hits that overlap with other hits e.g.:

aa <- AAStringSet(toupper(c("LVaaaLVaaLLaaaL",
                            "LVaaaLaaaLVaaaLaaaL",
                            "LVaaaLaaaLLVaaaLaaaL"))
gregexpr("LV...L...L", aa)  # all 3 sequences in 'aa' have 2 hits but
                            # gregexpr() only reports 1 hit in the 1st
                            # and 1 hit in the 2nd sequence 

Another way to work around this problem is to use a trick that takes advantage of the simplicity of the specific pattern. The trick is to convert aa into a DNAStringSet object after doing the following replacements:

  1. Replace all letters that are not L or V with As
  2. Replace Ls with Cs
  3. Replace Vs with Gs

Something like this:

dna <- BStringSet(aa)
old <- paste(AA_ALPHABET[!(AA_ALPHABET %in% c("L", "V"))], collapse="")
new <- strrep("A", nchar(old))
dna <- chartr(old, new, dna)
dna <- chartr("L", "C", dna)
dna <- chartr("V", "G", dna)
dna <- DNAStringSet(dna)
dna
#   A DNAStringSet instance of length 3
#     width seq
# [1]    15 CGAAACGAACCAAAC
# [2]    19 CGAAACAAACGAAACAAAC
# [3]    20 CGAAACAAACCGAAACAAAC

So now you can call vmatchPattern() with fixed=FALSE:

vmatchPattern("CGNNNCNNNC", dna, fixed=FALSE)
# MIndex object of length 3
# [[1]]
# IRanges object with 2 ranges and 0 metadata columns:
#           start       end     width
#       <integer> <integer> <integer>
#   [1]         1        10        10
#   [2]         6        15        10
#
# [[2]]
# IRanges object with 2 ranges and 0 metadata columns:
#           start       end     width
#       <integer> <integer> <integer>
#   [1]         1        10        10
#   [2]        10        19        10
# ...

This trick can be slightly adapted to work with patterns that use more than 4 non-X letters from the AA alphabet. More precisely:

  1. Map Xs to N.
  2. Map non-X letters in the pattern to DNA letters using the full DNA alphabet, except N (already taken). Using the full DNA alphabet means using letters in DNA_ALPHABET that are not A, C, G, or T.
  3. Do the same mapping for the subject. Letters in the subject not present in the pattern are mapped to N.
  4. Call vmatchPattern() with fixed="subject".

The following vmatchPattern2() function implements this:

## A version of vmatchPattern() that treats Xs in AA pattern as
## ambiguities.
vmatchPattern2 <- function(pattern, subject,
                           max.mismatch=0, min.mismatch=0,
                           with.indels=FALSE)
{
    stopifnot(is(subject, "AAStringSet"))
    pattern <- AAString(pattern)

    old <- uniqueLetters(pattern)
    if (length(setdiff(old, "X")) >= length(DNA_ALPHABET))
        stop("too many different unique letters in 'pattern'")

    ## Turn 'pattern' into a DNAString object.
    new <- old
    new[old == "X"] <- "N"
    new[old != "X"] <- head(DNA_ALPHABET[DNA_ALPHABET != "N"],
                            n=length(old)-1)
    pattern <- DNAString(chartr(paste(old, collapse=""),
                                paste(new, collapse=""),
                                BString(pattern)))

    ## Turn 'subject' into a DNAStringSet object.
    old2 <- uniqueLetters(subject)
    new2 <- rep("N", length(old2))
    m <- match(old2, old)
    new2[!is.na(m)] <- new[m[!is.na(m)]]
    subject <- DNAStringSet(chartr(paste(old2, collapse=""),
                                   paste(new2, collapse=""),
                                   BStringSet(subject)))

    ## Call vmatchPattern() with fixed="subject".
    vmatchPattern(pattern, subject,
                  max.mismatch=max.mismatch, min.mismatch=min.mismatch,
                  with.indels=with.indels, fixed="subject")
}

Lightly tested only!

H.

ADD REPLY
0
Entering edit mode

The AAs in certain position are fixed, the rest are random, eg: L is located at the first, fourth and last position.

So I use substr and paste to replace the random AAs to A and then apply vcountPattern/vmatchPattern.

A big thank you!  

ADD REPLY

Login before adding your answer.

Traffic: 605 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6