Question

vcountPattern for Pattern Contained within or Containing Subject

0

Entering edit mode

Dario Strbenac ★ 1.5k

@dario-strbenac-5916

Last seen 4 days ago

Australia

Is there an efficient way to search a subject with a query that finds matches where the pattern is entirely contained in the subject, or the subject is entirely contained within the pattern ? The strings in the subject may be shorter or longer than the query string.

> vcountPattern("ABCD", "ABCDEF")
[1] 1
> vcountPattern("ABCDEF", "ABCD")
[1] 0

I'd like both of these to return 1.

biostrings countpattern • 1.2k views

ADD COMMENT • link 8.5 years ago Dario Strbenac ★ 1.5k

0

Entering edit mode

Dario Strbenac ★ 1.5k

@dario-strbenac-5916

Last seen 4 days ago

Australia

I have developed a vectorised version which is shorter and simpler.

    queries <- c("ABCD", "ABCDEFG", "BCDE")
    subjects <- c("ABCDEF", "CDE")

    queriesInSubjects <- colSums(sapply(queries, function(query) vcountPattern(query, subjects)))
    queriesContainSubjects <- rowSums(sapply(subjects, function(subject) sapply(vmatchPattern(subject, queries), length)))

    rowSums(matrix(c(queriesInSubjects, queriesContainSubjects), ncol = 2))

ADD COMMENT • link 8.5 years ago Dario Strbenac ★ 1.5k

0

Entering edit mode

OK, you're apparently satisfied with your own solution. But just for the record and in case anybody else is looking for a bi-directional vcountPattern():

Your solution fails on your own example (pattern="ABCD", subject="ABCDEF") with the following error:

      Error in colSums(sapply(pattern, function(p) vcountPattern(p, subject))) : 
        'x' must be an array of at least two dimensions

When it does not fail the count is wrong e.g. with pattern=c("a", "bc") and subject=c("c", "b", "c") it returns:

      [1] 0 3

It's very inefficient. You said you wanted "an efficient way". Not that avoiding the loop like I did with vcountPattern2() was rocket science but if you allow yourself to loop, then it's a one liner:

      mapply(function(p, s) max(vcountPattern(p, s, ...),
                                vcountPattern(s, p, ...)),
             pattern, subject)

Only problem with this is that it's 1000x slower than vcountPattern2() when subject contains tens or hundreds of thousands of sequences. However, unlilke vcountPattern2(), the mapply-based solution supports multiple patterns (my vcountPattern2() function did not because your original post didn't suggest that you needed that feature).

H.

ADD REPLY • link 8.5 years ago Hervé Pagès 16k

0

Entering edit mode

The question wasn't clearly written. You are comparing the patterns and subjects in parallel. I was thinking of each pattern against all of the subjects, in which case the counting gives the desired numbers. The mapply-based solution shouldn't have ... provided to vcountPattern.

ADD REPLY • link 8.5 years ago Dario Strbenac ★ 1.5k

score 2 · Accepted Answer · 2015-11-02

Hi,

I would do something like this:

## A bi-directional vcountPattern().
vcountPattern2 <- function(pattern, subject, ...)
{
    if (!is(subject, "XStringSet"))
        subject <- as(subject, "XStringSet")
    ans <- integer(length(subject))
    swap <- width(pattern) > width(subject)
    swap_idx <- which(swap)
    noswap_idx <- which(!swap)
    if (length(noswap_idx) != 0L) {
        ans[noswap_idx] <- vcountPattern(pattern,
                                         subject[noswap_idx],
                                         ...)
    }
    if (length(swap_idx) != 0L) {
        pattern2 <- subject[swap_idx]
        subject2 <- as(pattern, "XString")
        ans[swap_idx] <- countPDict(pattern2, subject2, ...)
    }
    ans
}

Then:

vcountPattern2("ABcdAB", c("ABcdABef", "DD", "AB"))
# [1] 1 0 2

Cheers,

H.