Search
Question: vcountPattern for Pattern Contained within or Containing Subject
0
gravatar for Dario Strbenac
2.0 years ago by
Dario Strbenac1.4k
Australia
Dario Strbenac1.4k wrote:

Is there an efficient way to search a subject with a query that finds matches where the pattern is entirely contained in the subject, or the subject is entirely contained within the pattern ? The strings in the subject may be shorter or longer than the query string.

> vcountPattern("ABCD", "ABCDEF")
[1] 1
> vcountPattern("ABCDEF", "ABCD")
[1] 0

I'd like both of these to return 1.

ADD COMMENTlink modified 2.0 years ago • written 2.0 years ago by Dario Strbenac1.4k
2
gravatar for Hervé Pagès
2.0 years ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi,

I would do something like this:

## A bi-directional vcountPattern().
vcountPattern2 <- function(pattern, subject, ...)
{
    if (!is(subject, "XStringSet"))
        subject <- as(subject, "XStringSet")
    ans <- integer(length(subject))
    swap <- width(pattern) > width(subject)
    swap_idx <- which(swap)
    noswap_idx <- which(!swap)
    if (length(noswap_idx) != 0L) {
        ans[noswap_idx] <- vcountPattern(pattern,
                                         subject[noswap_idx],
                                         ...)
    }
    if (length(swap_idx) != 0L) {
        pattern2 <- subject[swap_idx]
        subject2 <- as(pattern, "XString")
        ans[swap_idx] <- countPDict(pattern2, subject2, ...)
    }
    ans
}

Then:

vcountPattern2("ABcdAB", c("ABcdABef", "DD", "AB"))
# [1] 1 0 2

Cheers,

H.

ADD COMMENTlink modified 2.0 years ago • written 2.0 years ago by Hervé Pagès ♦♦ 13k
0
gravatar for Dario Strbenac
2.0 years ago by
Dario Strbenac1.4k
Australia
Dario Strbenac1.4k wrote:

I have developed a vectorised version which is shorter and simpler.

    queries <- c("ABCD", "ABCDEFG", "BCDE")
    subjects <- c("ABCDEF", "CDE")

    queriesInSubjects <- colSums(sapply(queries, function(query) vcountPattern(query, subjects)))
    queriesContainSubjects <- rowSums(sapply(subjects, function(subject) sapply(vmatchPattern(subject, queries), length)))

    rowSums(matrix(c(queriesInSubjects, queriesContainSubjects), ncol = 2))
ADD COMMENTlink written 2.0 years ago by Dario Strbenac1.4k

OK, you're apparently satisfied with your own solution. But just for the record and in case anybody else is looking for a bi-directional vcountPattern():

  • Your solution fails on your own example (pattern="ABCD", subject="ABCDEF") with the following error:
      Error in colSums(sapply(pattern, function(p) vcountPattern(p, subject))) : 
        'x' must be an array of at least two dimensions
  •  When it does not fail the count is wrong e.g. with pattern=c("a", "bc") and subject=c("c", "b", "c") it returns:
      [1] 0 3
  • It's very inefficient. You said you wanted "an efficient way". Not that avoiding the loop like I did with vcountPattern2() was rocket science but if you allow yourself to loop, then it's a one liner:
      mapply(function(p, s) max(vcountPattern(p, s, ...),
                                vcountPattern(s, p, ...)),
             pattern, subject)​

Only problem with this is that it's 1000x slower than vcountPattern2() when subject contains tens or hundreds of thousands of sequences. However, unlilke vcountPattern2(), the mapply-based solution supports multiple patterns (my vcountPattern2() function did not because your original post didn't suggest that you needed that feature).

H.

ADD REPLYlink written 2.0 years ago by Hervé Pagès ♦♦ 13k

The question wasn't clearly written. You are comparing the patterns and subjects in parallel. I was thinking of each pattern against all of the subjects, in which case the counting gives the desired numbers. The mapply-based solution shouldn't have ... provided to vcountPattern.

 
ADD REPLYlink written 2.0 years ago by Dario Strbenac1.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 130 users visited in the last hour