Is there an efficient way to search a subject with a query that finds matches where the pattern is entirely contained in the subject, or the subject is entirely contained within the pattern ? The strings in the subject may be shorter or longer than the query string.
OK, you're apparently satisfied with your own solution. But just for the record and in case anybody else is looking for a bi-directional vcountPattern():
Your solution fails on your own example (pattern="ABCD", subject="ABCDEF") with the following error:
Error in colSums(sapply(pattern, function(p) vcountPattern(p, subject))) :
'x' must be an array of at least two dimensions
When it does not fail the count is wrong e.g. with pattern=c("a", "bc") and subject=c("c", "b", "c") it returns:
[1] 0 3
It's very inefficient. You said you wanted "an efficient way". Not that avoiding the loop like I did with vcountPattern2() was rocket science but if you allow yourself to loop, then it's a one liner:
Only problem with this is that it's 1000x slower than vcountPattern2() when subject contains tens or hundreds of thousands of sequences. However, unlilke vcountPattern2(), the mapply-based solution supports multiple patterns (my vcountPattern2() function did not because your original post didn't suggest that you needed that feature).
The question wasn't clearly written. You are comparing the patterns and subjects in parallel. I was thinking of each pattern against all of the subjects, in which case the counting gives the desired numbers. The mapply-based solution shouldn't have ... provided to vcountPattern.
OK, you're apparently satisfied with your own solution. But just for the record and in case anybody else is looking for a bi-directional
vcountPattern()
:pattern="ABCD"
,subject="ABCDEF"
) with the following error:pattern=c("a", "bc")
andsubject=c("c", "b", "c")
it returns:vcountPattern2()
was rocket science but if you allow yourself to loop, then it's a one liner:Only problem with this is that it's 1000x slower than
vcountPattern2()
whensubject
contains tens or hundreds of thousands of sequences. However, unlilkevcountPattern2()
, themapply
-based solution supports multiple patterns (myvcountPattern2()
function did not because your original post didn't suggest that you needed that feature).H.
The question wasn't clearly written. You are comparing the patterns and subjects in parallel. I was thinking of each pattern against all of the subjects, in which case the counting gives the desired numbers. The
mapply
-based solution shouldn't have ... provided tovcountPattern
.