How about something like this? First we'll set up an example with 3 DNA sequences, and we're going to look for the pattern TACG in each of them, just like you've already done.
dna_initial <- DNAStringSet(c("ACGTACGGACGT",
"GTACGTACGGAC",
"AAAAGGGAAAAG"))
hits = vmatchPattern("TACG", dna_initial, max.mismatch=0, fixed=FALSE)
hits.frame = as.data.frame(hits)
The pattern appears once in the first string, twice in the second, and not at all in the third. No idea if that's possible in your data, but it's nice to get a solution that applies to a general case.
> hits.frame
group group_name start end width
1 1 <NA> 4 7 4
2 2 <NA> 2 5 4
3 2 <NA> 6 9 4
Now we'll use mapply()
to call the subseq()
function on three vectors (the set of strings to be subset, the start positions and the ends). The other things to note is that we use hits.frame$group
to select the strings we're going to process. This has the effect of dropping any strings that weren't found, and replicating those that have our substring multiple times.
dna_list <- mapply(subseq, dna_initial[hits.frame$group], hits.frame$start, hits.frame$end)
The result from this is a list of DNAStrings, which we can pass to DNAStringSet()
to get back a single set.
dna_regions <- DNAStringSet(dna_list)
> dna_regions
A DNAStringSet instance of length 3
width seq
[1] 4 TACG
[2] 4 TACG
[3] 4 TACG
The result in our case isn't very interesting, but it's found the three instances of the sequence we're looking for and extracted them from the original data.
Hi H,
The difference with Mikes' below is that not every sequence contains the random region in my initial DNAStringSet. Yours will contain those rows without a hit, which will be useful if need mapping back to the original set
Thank you.
XIA