Search
Question: matching of AAStringSet vs. another AAStringSet
0
2.0 years ago by
tobias.kockmann20 wrote:
> e.coli <- readAAStringSet(filepath = "uniprot-proteome_UP000000625.fasta")
> summary(e.coli)
Length       Class        Mode
4306 AAStringSet          S4
> some.peptide <- AAString("DYWRALQNRIREGHVEDVYAYRRRQ")
> summary(some.peptide)
Length    Class     Mode
25 AAString       S4
> x <- matchPDict(pdict = some.peptide, subject = e.coli)
Error in matchPDict(pdict = some.peptide, subject = e.coli) :
please use vmatchPDict() when 'subject' is an XStringSet object (multiple sequence)

> x <- vmatchPDict(pdict = some.peptide, subject = e.coli)
Error in .local(pdict, subject, max.mismatch, min.mismatch, with.indels,  :
vmatchPDict() is not ready yet, sorry

What am I doing wrong? I would like to match a list of peptide sequences (here as AAStringSet) vs. a list of Proteins (also AAStringSet).

Greetings,

Tobi

modified 24 months ago • written 2.0 years ago by tobias.kockmann20
1
2.0 years ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi Tobi,

You have only 1 peptide so why use matchPDict/vmatchPDict? Functions in the matchPDict family are meant to be used on a set of short motifs. If you only have 1 motif, use functions from the matchPattern family instead. More precisely, if you have 1 sequence in your subject, use matchPattern (but first subset your subject with e.coli[[1]] if it's an AAStringSet object). And if you have more than 1 sequence in your subject, use vmatchPattern

H.

1
2.0 years ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

I'm not aware of the notion of iterator in R in general.

If you only have 10-20k queries, you can simply use a for loop. The overhead of looping over an AAStringSet object of length 20k is not much, only 6 sec on my machine:

library(Biostrings)
library(hgu95av2probe)
peptides <- translate(DNAStringSet(hgu95av2probe)[1:20000])
system.time( for (i in seq_along(peptides)) p <- peptides[[i]] )
#   user  system elapsed
#  6.233   0.000   6.245 

H.

I followed your suggestion to use a for loop:

> peptides <- AAStringSet(peptides[1:10])
> m <- vector("list", length(peptides)) #pre-allocate an empty list of right length
> system.time(for (i in seq_along(peptides)){
+   m[[i]] <- vmatchPattern(pattern = peptides[[i]], subject = H.sapiens)
+ })
user  system elapsed
4.310   0.037   4.398 

But the performance is not really great. For 10 queries I already need 4.3 s. That means for 20k queries it will be around 140 min !!! Is this the best I can get from AAStrings match functions?

1
2.0 years ago by
tobias.kockmann20 wrote:

Following Martin Morgans talk yesterday at the European bioconductor meeting in Basel I also implemented a nicer looking version of the above:

> foo <- function(x){vmatchPattern(pattern = x, subject = H.sapiens)}
> system.time(
+   l <- BiocGenerics::lapply(peptides, foo)
+ )
user  system elapsed
4.349   0.044   4.453 

@Martin: I hope this the the "poetry" style of code you were mentioning! ;-)

...but I still have the feeling that this solution might be "correct and robust" but in no way "efficient". Another problem when writing those code was that the lapply() function to handle the S4 AAStringSet object was hidden in the BiocGenerics package and not loaded by default once you load Biostrings. What is the reason for this? It took me quite a while to find this out and it looks completely unnecessary to me. Why not put all functions that handle instances of a certain class in the package that def. the class???

0
2.0 years ago by
tobias.kockmann20 wrote:
> e.coli <- readAAStringSet(filepath = "uniprot-proteome_UP000000625.fasta")
> summary(e.coli)
Length       Class        Mode
4306 AAStringSet          S4
> summary(peptides)
Length       Class        Mode
2 AAStringSet          S4
> x <- vmatchPDict(pdict = peptides, subject = e.coli)
Error in .local(pdict, subject, max.mismatch, min.mismatch, with.indels,  :
vmatchPDict() is not ready yet, sorry

Finally, I want to match a set of AAStrings vs. another set of AAStrings. I simply used the trivial case of n =1 above. How to do this? If vmatchPict() is not ready yet, how can I loop over a set of AAStrings (queries) efficiently? I have 10-20k queries. The core of that loop could be:

vmatchPattern(pattern = peptides[[i]], subject = e.coli)

Is there something like an iterator for the AAStringSet class?

0
2.0 years ago by
tobias.kockmann20 wrote:

A downstream problem. I tried to use the index returned by vmatchPattern to the org. object and get this strange error message:

> extractAllMatches(subject = H.sapiens, mindex = l[[1]])
Error in extractAllMatches(subject = H.sapiens, mindex = l[[1]]) :
'subject' must be an XString or MaskedXString object
> class(H.sapiens)
[1] "AAStringSet"
attr(,"package")
[1] "Biostrings"



How can that be? T thought AAStrings is derived from XString?

ok. I see the problem now. I need to subset for subject and Mindex to make it work:

> extractAllMatches(subject = H.sapiens[[1]], mindex = l[[1]])
Views on a 246-letter AAString subject
subject: MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGA...SLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN
views:
start end width
[1]    88 100    13 [IEAELQDICNDVL]
[2]    88 100    13 [IEAELQDICNDVL]
[3]    88 100    13 [IEAELQDICNDVL]
[4]    88 100    13 [IEAELQDICNDVL]
[5]    88 100    13 [IEAELQDICNDVL]
[6]    88 100    13 [IEAELQDICNDVL]
[7]    88 100    13 [IEAELQDICNDVL]

1

Hi Tobi,

Here you have a performant and IMHO easier to use solution based on pure CRAN packages seqinr and AhoCorasickTrie.

use :

seqinr::read.fasta(file = file, as.string = TRUE, seqtype="AA")

then perform the matching with:

system.time(res <- AhoCorasickTrie::AhoCorasickSearch(unique(pepseq) , unlist(fasta), alphabet = "aminoacid"))

I benchmarked this function with approx 100_000 peptides and approx 40_000 protein sequences and it returns in 10s.

In the package prozor on github https://github.com/protViz/prozor

you will find a function annotateAHO which does some massaging of the list returned by AhoCorasickSearch and produces a data.frame with 4 columns containing:

"proteinID"       "peptideSequence" "Offset"          "proteinSequence"

Witold

I'm glad that worked. There are considerable benefits to interoperability using Biostings in the context of Bioconductor; the approach would replace seqinr::read.fasta() with Biostrings::readDNAStringSet() and use as.character(unlist(fasta)) and as.character(fasta) for arguments. Also C: Match Pattern Vector to Subject Vector notes that AhoChorasickTrie is useful for exact matching only.

Pretty nice:

> e.coli <- readAAStringSet(filepath = "uniprot-proteome_UP000000625.fasta")

> pep <- BiocGenerics::lapply(e.coli[1:10], Views, start = 1, end = 10)

> p <- BiocGenerics::sapply(pep, toString)

> e  <- BiocGenerics::sapply(e.coli, toString)

> #p <- AAStringSet(p)

> system.time(

+   l <- AhoCorasickTrie::AhoCorasickSearch(keywords = p, text = e, alphabet = "aminoacid")

+ )

user  system elapsed

0.014   0.000   0.014 

Is there a difference between as.character(unlist(...)) and using sapply(..., toString)???

Broadly, as.character(unlist()) involves two R-level function calls, whereas sapply() involves length(.) function calls, and the former will be much more efficient for large objects.

Also, I think your use of Views is not required, just narrow(pep, 1, 10) or subseq(pep, 1, 10).

Great! That is a substantial time gain! Maybe one could convert the results of the AhoCorasick search into a set of IRanges that could be used on a AAStrings again...than one would be back in the bioc world.

0
24 months ago by
tobias.kockmann20 wrote:

A related question: If AhoCorasickTrie is soooo much faster for exact string matching why not use it under the hood when calling:

vmatchPattern(..., algorithm = "AhoCorasick-exact")

???

Because vmatchPattern() solves a different problem i.e. it matches a single pattern against a collection of reference sequences. AFAIK the Aho-Corasick algo doesn't handle that problem.

H.

Sorry...vmatchPDict()

Yes, vmatchPDict() will use the same Aho-Corasick implementation as matchPDict(). This is actually already the case for vcountPDict() and vwhichPDict(). Note that this Aho-Corasick implementation is not from the AhoCorasickTrie package but from the Biostrings package itself. The reason vmatchPDict() is not available yet is because it was not clear until recently what kind of object it should return (these days I would be inclined to use a Hits object now that they support repeated hits), and also because new developments/improvements to the Biostrings package have been neglected in the last few years because of other priorities.

H.

Oh really! Good to know. What are "hits objects"? The question of the return type is indeed not trivial. The list of MIndex objects returned by vmatchPattern() is not to easy to deal with...at least it is not obvious to me and it looks quite bulky (big memory footprint), because it returns an empty container for each subject.

1

Hits objects are described in their man page: ?Hits

Even though it's true that subsetting MIndex object mi with mi[[i]] returns a (possibly empty) IRanges object containing the matches for the i-th subject, that doesn't mean that all these IRanges objects are stored inside the MIndex object. That would indeed be very inefficient. Instead they are generated on-the-fly by the [[ method. In other words what you see coming out of an MIndex object doesn't reflect its internal representation. Note that this is true for other S4 containers in Bioconductor (e.g. GRangesList or DNAStringSet objects), not just MIndex objects. All these objects use tricks internally to minimize memory footprint and make some of the most common operation on them very efficient. There is a lot going on behind the scene.

If you want to assess the memory footprint of an MIndex object, use object.size(). You might be surprised. MIndex objects were designed to efficiently store hundreds of millions and even billions of matches (you can easily reach these numbers when matching tens of millions of short reads to the Human genome). This is actually one reason why I've been hesitant to abandon them. Replacing them with Hits objects (with metadata columns) is going to significantly increase memory footprint when there are hundreds of millions of matches or more.

H.

Hi Herve,

hmmm...I looked at the size of the list of MIndex objects returned and got this:

> object.size(m[[1]])
9458832 bytes

In total the list contains 10 of these MIndex objects, so the list of MIndex objects has a total memory footprint of 90 Mb! The corresponding list-of-list structure returned by the AhoCorasickTri package (https://github.com/chambm/AhoCorasickTrie) of Matt is:

> object.size(l)
10344 bytes

I used a toy example (10 peptide sequences as query that are found in 15 proteins). So yes, I am surprised ;-). Could it be that MIndex is not really memory efficient when the number of reference sequence is big (my proteome contains about 50 k sequences). Turned into a tidy data frame it looks like:

> df
pep
3         AASQPGELKDWFVGR
8             AAKLAPEFAKR
9             AAKLAPEFAKR
10            AAKLAPEFAKR
11            AAKLAPEFAKR
12              AACYTKLLE
13 ACIPQSFTVDSSKAGLAPLEVR
14   AAMQQDPVASSHKIDAKTYR
prot
1                             sp|P01027|CO3_MOUSE Complement C3 OS=Mus musculus GN=C3 PE=1 SV=3
2                sp|P62242|RS8_MOUSE 40S ribosomal protein S8 OS=Mus musculus GN=Rps8 PE=1 SV=2
3                      sp|Q00493|CBPE_MOUSE Carboxypeptidase E OS=Mus musculus GN=Cpe PE=1 SV=2
7  sp|P68037|UB2L3_MOUSE Ubiquitin-conjugating enzyme E2 L3 OS=Mus musculus GN=Ube2l3 PE=1 SV=1
8                      sp|O08709|PRDX6_MOUSE Peroxiredoxin-6 OS=Mus musculus GN=Prdx6 PE=1 SV=3
9                     tr|Q6GT24|Q6GT24_MOUSE Peroxiredoxin 6 OS=Mus musculus GN=Prdx6 PE=1 SV=1
10                          tr|Q8BG37|Q8BG37_MOUSE MCG48959 OS=Mus musculus GN=Prdx6b PE=1 SV=1
11                    tr|D3Z0Y2|D3Z0Y2_MOUSE Peroxiredoxin-6 OS=Mus musculus GN=Prdx6 PE=1 SV=1
12     sp|Q60864|STIP1_MOUSE Stress-induced-phosphoprotein 1 OS=Mus musculus GN=Stip1 PE=1 SV=1
13                             sp|Q80X90|FLNB_MOUSE Filamin-B OS=Mus musculus GN=Flnb PE=1 SV=3
14   sp|O09131|GSTO1_MOUSE Glutathione S-transferase omega-1 OS=Mus musculus GN=Gsto1 PE=1 SV=2
15                  sp|P62897|CYC_MOUSE Cytochrome c, somatic OS=Mus musculus GN=Cycs PE=1 SV=2
offset
1     657
2     185
3     165
4     150
5     150
6     135
7     123
8      54
9      54
10     54
11     30
12    401
13   1437
14    203
15     93

Tobi

Hi Tobi,

I can't really give a detailed explanation because I don't have enough information about how you obtained m. It seems that you used vmatchPattern() in an lapply() loop so you ended up with a list of 1 MIndex object per peptide sequence.

The memory footprint of an MIndex object is:

    some_overhead + nb_of_matches x 4 bytes

If the object was returned by vmatchPattern() and the vector of reference sequences has no names, then the overhead is proportional to the nb of reference sequences (roughly 8 bytes per reference sequence). However, this overhead is much bigger if the vector of reference sequences has names because the names are propagated to the MIndex object. So yes the overhead is important in your case because you have 50k reference sequences, and possibly 50k names. And the same overhead is repeated 10 times! Also it seems that you didn't get many matches so the total memory footprint consists almost entirely of this overhead.

As I said MIndex objects were designed to efficiently store hundreds of millions and even billions of matches in a memory efficient way. When an MIndex object contains so many matches, the overhead portion of the total memory footprint becomes negligible. What becomes important/relevant is that only 4 bytes are needed to store a match.

Hope this helps,

H.