Question

How to get rid of the AAStrings with a stop codon

0

Entering edit mode

XIA.PAN ▴ 20

@xiapan-12407

Last seen 5.3 years ago

Dear all,

I have a AAstringSet. Some of them have stop codon that I need to filter out and the aastring need to fit to my pattern LVXXXLXXXL, X in IUPAC code, represent any amino acids.

stopcodon = vmatchPattern("*",seqs.frame$aa,max.mismatch=0)
stopcodon.frame=as.data.frame(stopcodon)

I have the index of the AAStrings now, but how do I get rid of them from the seqs.frame ?

I guess I can use vmatchPattern again with for the LVXXXLXXL once the above is figured out.

Thank you

aastringset aastring • 1.8k views

ADD COMMENT • link updated 7.9 years ago by Hervé Pagès 16k • written 7.9 years ago by XIA.PAN ▴ 20

score 0 · Answer 1 · 2017-02-23

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 11 days ago

Seattle, WA, United States

Hi,

Since it seems that you don't really care about the exact location of the stop codon within each AA sequence, you could use vcountPattern() instead of vmatchPattern():

nstopcodon <- vcountPattern("*", seqs.frame$aa, max.mismatch=0)

Then just subset your data frame with:

seqs.frame <- seqs.frame[nstopcodon != 0, ]

Cheers,

H.

ADD COMMENT • link 7.9 years ago Hervé Pagès 16k

0

Entering edit mode

Thank you. Regarding the pattern filter, fixed=FALSE can only be used for DNASting and RNAString. when use vmatchpattern or vcountpattern with 'X' (extended IPUAC code), all return with no match.

CompareStrings("LXXXLXXXL",seqs.frame$aa[1]) returns "L???L???L"---fit and "?????????" --non-fit. But only compares string to string, need a for and if loop here.

Is there a better way to compare/analysis AAstrings?

Thank you.

XIA

ADD REPLY • link 7.9 years ago XIA.PAN ▴ 20

0

Entering edit mode

Unfortunately fixed=TRUE is not supported on AAStringSet objects. One (imperfect) way to work around this is to use gregexpr() with a regular expression. However note that this solution misses hits that overlap with other hits e.g.:

aa <- AAStringSet(toupper(c("LVaaaLVaaLLaaaL",
                            "LVaaaLaaaLVaaaLaaaL",
                            "LVaaaLaaaLLVaaaLaaaL"))
gregexpr("LV...L...L", aa)  # all 3 sequences in 'aa' have 2 hits but
                            # gregexpr() only reports 1 hit in the 1st
                            # and 1 hit in the 2nd sequence

Another way to work around this problem is to use a trick that takes advantage of the simplicity of the specific pattern. The trick is to convert aa into a DNAStringSet object after doing the following replacements:

Replace all letters that are not L or V with As
Replace Ls with Cs
Replace Vs with Gs

Something like this:

dna <- BStringSet(aa)
old <- paste(AA_ALPHABET[!(AA_ALPHABET %in% c("L", "V"))], collapse="")
new <- strrep("A", nchar(old))
dna <- chartr(old, new, dna)
dna <- chartr("L", "C", dna)
dna <- chartr("V", "G", dna)
dna <- DNAStringSet(dna)
dna
#   A DNAStringSet instance of length 3
#     width seq
# [1]    15 CGAAACGAACCAAAC
# [2]    19 CGAAACAAACGAAACAAAC
# [3]    20 CGAAACAAACCGAAACAAAC

So now you can call vmatchPattern() with fixed=FALSE:

vmatchPattern("CGNNNCNNNC", dna, fixed=FALSE)
# MIndex object of length 3
# [[1]]
# IRanges object with 2 ranges and 0 metadata columns:
#           start       end     width
#       <integer> <integer> <integer>
#   [1]         1        10        10
#   [2]         6        15        10
#
# [[2]]
# IRanges object with 2 ranges and 0 metadata columns:
#           start       end     width
#       <integer> <integer> <integer>
#   [1]         1        10        10
#   [2]        10        19        10
# ...

This trick can be slightly adapted to work with patterns that use more than 4 non-X letters from the AA alphabet. More precisely:

Map Xs to N.
Map non-X letters in the pattern to DNA letters using the full DNA alphabet, except N (already taken). Using the full DNA alphabet means using letters in DNA_ALPHABET that are not A, C, G, or T.
Do the same mapping for the subject. Letters in the subject not present in the pattern are mapped to N.
Call vmatchPattern() with fixed="subject".

The following vmatchPattern2() function implements this:

## A version of vmatchPattern() that treats Xs in AA pattern as
## ambiguities.
vmatchPattern2 <- function(pattern, subject,
                           max.mismatch=0, min.mismatch=0,
                           with.indels=FALSE)
{
    stopifnot(is(subject, "AAStringSet"))
    pattern <- AAString(pattern)

    old <- uniqueLetters(pattern)
    if (length(setdiff(old, "X")) >= length(DNA_ALPHABET))
        stop("too many different unique letters in 'pattern'")

    ## Turn 'pattern' into a DNAString object.
    new <- old
    new[old == "X"] <- "N"
    new[old != "X"] <- head(DNA_ALPHABET[DNA_ALPHABET != "N"],
                            n=length(old)-1)
    pattern <- DNAString(chartr(paste(old, collapse=""),
                                paste(new, collapse=""),
                                BString(pattern)))

    ## Turn 'subject' into a DNAStringSet object.
    old2 <- uniqueLetters(subject)
    new2 <- rep("N", length(old2))
    m <- match(old2, old)
    new2[!is.na(m)] <- new[m[!is.na(m)]]
    subject <- DNAStringSet(chartr(paste(old2, collapse=""),
                                   paste(new2, collapse=""),
                                   BStringSet(subject)))

    ## Call vmatchPattern() with fixed="subject".
    vmatchPattern(pattern, subject,
                  max.mismatch=max.mismatch, min.mismatch=min.mismatch,
                  with.indels=with.indels, fixed="subject")
}

Lightly tested only!

H.

ADD REPLY • link 7.9 years ago Hervé Pagès 16k

0

Entering edit mode

The AAs in certain position are fixed, the rest are random, eg: L is located at the first, fourth and last position.

So I use substr and paste to replace the random AAs to A and then apply vcountPattern/vmatchPattern.

A big thank you!

ADD REPLY • link 7.9 years ago XIA.PAN ▴ 20