vcountPattern for Pattern Contained within or Containing Subject
2
0
Entering edit mode
Dario Strbenac ★ 1.5k
@dario-strbenac-5916
Last seen 1 day ago
Australia

Is there an efficient way to search a subject with a query that finds matches where the pattern is entirely contained in the subject, or the subject is entirely contained within the pattern ? The strings in the subject may be shorter or longer than the query string.

> vcountPattern("ABCD", "ABCDEF")
[1] 1
> vcountPattern("ABCDEF", "ABCD")
[1] 0

I'd like both of these to return 1.

biostrings countpattern • 1.5k views
ADD COMMENT
2
Entering edit mode
@herve-pages-1542
Last seen 3 days ago
Seattle, WA, United States

Hi,

I would do something like this:

## A bi-directional vcountPattern().
vcountPattern2 <- function(pattern, subject, ...)
{
    if (!is(subject, "XStringSet"))
        subject <- as(subject, "XStringSet")
    ans <- integer(length(subject))
    swap <- width(pattern) > width(subject)
    swap_idx <- which(swap)
    noswap_idx <- which(!swap)
    if (length(noswap_idx) != 0L) {
        ans[noswap_idx] <- vcountPattern(pattern,
                                         subject[noswap_idx],
                                         ...)
    }
    if (length(swap_idx) != 0L) {
        pattern2 <- subject[swap_idx]
        subject2 <- as(pattern, "XString")
        ans[swap_idx] <- countPDict(pattern2, subject2, ...)
    }
    ans
}

Then:

vcountPattern2("ABcdAB", c("ABcdABef", "DD", "AB"))
# [1] 1 0 2

Cheers,

H.

ADD COMMENT
0
Entering edit mode
Dario Strbenac ★ 1.5k
@dario-strbenac-5916
Last seen 1 day ago
Australia

I have developed a vectorised version which is shorter and simpler.

    queries <- c("ABCD", "ABCDEFG", "BCDE")
    subjects <- c("ABCDEF", "CDE")

    queriesInSubjects <- colSums(sapply(queries, function(query) vcountPattern(query, subjects)))
    queriesContainSubjects <- rowSums(sapply(subjects, function(subject) sapply(vmatchPattern(subject, queries), length)))

    rowSums(matrix(c(queriesInSubjects, queriesContainSubjects), ncol = 2))
ADD COMMENT
0
Entering edit mode

OK, you're apparently satisfied with your own solution. But just for the record and in case anybody else is looking for a bi-directional vcountPattern():

  • Your solution fails on your own example (pattern="ABCD", subject="ABCDEF") with the following error:
      Error in colSums(sapply(pattern, function(p) vcountPattern(p, subject))) : 
        'x' must be an array of at least two dimensions
  •  When it does not fail the count is wrong e.g. with pattern=c("a", "bc") and subject=c("c", "b", "c") it returns:
      [1] 0 3
  • It's very inefficient. You said you wanted "an efficient way". Not that avoiding the loop like I did with vcountPattern2() was rocket science but if you allow yourself to loop, then it's a one liner:
      mapply(function(p, s) max(vcountPattern(p, s, ...),
                                vcountPattern(s, p, ...)),
             pattern, subject)​

Only problem with this is that it's 1000x slower than vcountPattern2() when subject contains tens or hundreds of thousands of sequences. However, unlilke vcountPattern2(), the mapply-based solution supports multiple patterns (my vcountPattern2() function did not because your original post didn't suggest that you needed that feature).

H.

ADD REPLY
0
Entering edit mode

The question wasn't clearly written. You are comparing the patterns and subjects in parallel. I was thinking of each pattern against all of the subjects, in which case the counting gives the desired numbers. The mapply-based solution shouldn't have ... provided to vcountPattern.

 
ADD REPLY

Login before adding your answer.

Traffic: 501 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6