Question: How to get rid of the AAStrings with a stop codon
0
gravatar for XIA.PAN
2.8 years ago by
XIA.PAN20
XIA.PAN20 wrote:

Dear all,

I have a AAstringSet. Some of them have stop codon that I need to filter out and the aastring need to fit to my pattern LVXXXLXXXL, X in IUPAC code, represent any amino acids.

stopcodon = vmatchPattern("*",seqs.frame$aa,max.mismatch=0)
stopcodon.frame=as.data.frame(stopcodon)

I have the index of the AAStrings now, but how do I get rid of them from the seqs.frame ?

I guess I can use vmatchPattern again with for the LVXXXLXXL once the above is figured out.

Thank you

aastring aastringset • 626 views
ADD COMMENTlink modified 2.8 years ago by Hervé Pagès ♦♦ 14k • written 2.8 years ago by XIA.PAN20
Answer: How to get rid of the AAStrings with a stop codon
0
gravatar for Hervé Pagès
2.8 years ago by
Hervé Pagès ♦♦ 14k
United States
Hervé Pagès ♦♦ 14k wrote:

Hi,

Since it seems that you don't really care about the exact location of the stop codon within each AA sequence, you could use vcountPattern() instead of vmatchPattern():

nstopcodon <- vcountPattern("*", seqs.frame$aa, max.mismatch=0)

Then just subset your data frame with:

seqs.frame <- seqs.frame[nstopcodon != 0, ]

Cheers,

H.

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by Hervé Pagès ♦♦ 14k

Thank you. Regarding the pattern filter, fixed=FALSE can only be used for DNASting and RNAString. when use vmatchpattern or vcountpattern with 'X' (extended IPUAC code), all return with no match.

CompareStrings("LXXXLXXXL",seqs.frame$aa[1]) returns "L???L???L"---fit and "?????????" --non-fit.   But only compares string to string, need a for and if loop here.

Is there a better way to compare/analysis AAstrings?

Thank you.

XIA

 

 

 

ADD REPLYlink written 2.8 years ago by XIA.PAN20

Unfortunately fixed=TRUE is not supported on AAStringSet objects. One (imperfect) way to work around this is to use gregexpr() with a regular expression. However note that this solution misses hits that overlap with other hits e.g.:

aa <- AAStringSet(toupper(c("LVaaaLVaaLLaaaL",
                            "LVaaaLaaaLVaaaLaaaL",
                            "LVaaaLaaaLLVaaaLaaaL"))
gregexpr("LV...L...L", aa)  # all 3 sequences in 'aa' have 2 hits but
                            # gregexpr() only reports 1 hit in the 1st
                            # and 1 hit in the 2nd sequence 

Another way to work around this problem is to use a trick that takes advantage of the simplicity of the specific pattern. The trick is to convert aa into a DNAStringSet object after doing the following replacements:

  1. Replace all letters that are not L or V with As
  2. Replace Ls with Cs
  3. Replace Vs with Gs

Something like this:

dna <- BStringSet(aa)
old <- paste(AA_ALPHABET[!(AA_ALPHABET %in% c("L", "V"))], collapse="")
new <- strrep("A", nchar(old))
dna <- chartr(old, new, dna)
dna <- chartr("L", "C", dna)
dna <- chartr("V", "G", dna)
dna <- DNAStringSet(dna)
dna
#   A DNAStringSet instance of length 3
#     width seq
# [1]    15 CGAAACGAACCAAAC
# [2]    19 CGAAACAAACGAAACAAAC
# [3]    20 CGAAACAAACCGAAACAAAC

So now you can call vmatchPattern() with fixed=FALSE:

vmatchPattern("CGNNNCNNNC", dna, fixed=FALSE)
# MIndex object of length 3
# [[1]]
# IRanges object with 2 ranges and 0 metadata columns:
#           start       end     width
#       <integer> <integer> <integer>
#   [1]         1        10        10
#   [2]         6        15        10
#
# [[2]]
# IRanges object with 2 ranges and 0 metadata columns:
#           start       end     width
#       <integer> <integer> <integer>
#   [1]         1        10        10
#   [2]        10        19        10
# ...

This trick can be slightly adapted to work with patterns that use more than 4 non-X letters from the AA alphabet. More precisely:

  1. Map Xs to N.
  2. Map non-X letters in the pattern to DNA letters using the full DNA alphabet, except N (already taken). Using the full DNA alphabet means using letters in DNA_ALPHABET that are not A, C, G, or T.
  3. Do the same mapping for the subject. Letters in the subject not present in the pattern are mapped to N.
  4. Call vmatchPattern() with fixed="subject".

The following vmatchPattern2() function implements this:

## A version of vmatchPattern() that treats Xs in AA pattern as
## ambiguities.
vmatchPattern2 <- function(pattern, subject,
                           max.mismatch=0, min.mismatch=0,
                           with.indels=FALSE)
{
    stopifnot(is(subject, "AAStringSet"))
    pattern <- AAString(pattern)

    old <- uniqueLetters(pattern)
    if (length(setdiff(old, "X")) >= length(DNA_ALPHABET))
        stop("too many different unique letters in 'pattern'")

    ## Turn 'pattern' into a DNAString object.
    new <- old
    new[old == "X"] <- "N"
    new[old != "X"] <- head(DNA_ALPHABET[DNA_ALPHABET != "N"],
                            n=length(old)-1)
    pattern <- DNAString(chartr(paste(old, collapse=""),
                                paste(new, collapse=""),
                                BString(pattern)))

    ## Turn 'subject' into a DNAStringSet object.
    old2 <- uniqueLetters(subject)
    new2 <- rep("N", length(old2))
    m <- match(old2, old)
    new2[!is.na(m)] <- new[m[!is.na(m)]]
    subject <- DNAStringSet(chartr(paste(old2, collapse=""),
                                   paste(new2, collapse=""),
                                   BStringSet(subject)))

    ## Call vmatchPattern() with fixed="subject".
    vmatchPattern(pattern, subject,
                  max.mismatch=max.mismatch, min.mismatch=min.mismatch,
                  with.indels=with.indels, fixed="subject")
}

Lightly tested only!

H.

ADD REPLYlink written 2.8 years ago by Hervé Pagès ♦♦ 14k

The AAs in certain position are fixed, the rest are random, eg: L is located at the first, fourth and last position.

So I use substr and paste to replace the random AAs to A and then apply vcountPattern/vmatchPattern.

A big thank you!  

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by XIA.PAN20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 282 users visited in the last hour