How about something like this? First we'll set up an example with 3 DNA sequences, and we're going to look for the pattern TACG in each of them, just like you've already done.
dna_initial <- DNAStringSet(c("ACGTACGGACGT",
hits = vmatchPattern("TACG", dna_initial, max.mismatch=0, fixed=FALSE)
hits.frame = as.data.frame(hits)
The pattern appears once in the first string, twice in the second, and not at all in the third. No idea if that's possible in your data, but it's nice to get a solution that applies to a general case.
group group_name start end width
1 1 <NA> 4 7 4
2 2 <NA> 2 5 4
3 2 <NA> 6 9 4
Now we'll use
mapply() to call the
subseq() function on three vectors (the set of strings to be subset, the start positions and the ends). The other things to note is that we use
hits.frame$group to select the strings we're going to process. This has the effect of dropping any strings that weren't found, and replicating those that have our substring multiple times.
dna_list <- mapply(subseq, dna_initial[hits.frame$group], hits.frame$start, hits.frame$end)
The result from this is a list of DNAStrings, which we can pass to
DNAStringSet() to get back a single set.
dna_regions <- DNAStringSet(dna_list)
A DNAStringSet instance of length 3
 4 TACG
 4 TACG
 4 TACG
The result in our case isn't very interesting, but it's found the three instances of the sequence we're looking for and extracted them from the original data.