Question

how to extract part of the DNA sequences from a DNAstringset

1

Entering edit mode

XIA.PAN ▴ 20

@xiapan-12407

Last seen 5.1 years ago

Dear all,

I have a large DNAstringset from amplicons and only need the random region in each sequence from this set.

I got the position (or index) of the part that I needed use vmattchpattern function.

hits=vmatchPattern("AAANNNNNNNNAAA",sread,max.mismatch=1,fixed=FALSE)

hits.frame=as.data.frame(hits)

The start and end position is a bit different from sequence to sequence. how can I apply these index (start, end) back to the DNAstringset and get the random region as a new DNAstringset.

Thank you

XIA

DNAstringset • 6.1k views

ADD COMMENT • link updated 7.8 years ago by Hervé Pagès 16k • written 7.8 years ago by XIA.PAN ▴ 20

score 3 · Answer 1 · 2017-02-22

Hi XIA,

You can use extractAt to extract the sequences corresponding to the hits:

extractAt(sread, hits)

See ?extractAt for more information.

If you have off-limits hits, you can either get rid of them with:

hits <- as(hits, "CompressedIRangesList")
hits <- hits[start(hits) >= 1 & end(hits) <= width(sread)]

or trim them with:

hits <- as(hits, "CompressedIRangesList")
hits <- restrict(hits, 1, width(sread))

In both cases the reason I first turned hits into a CompressedIRangesList object is because CompressedIRangesList objects support the kind of operations that we need to perform (i.e. subsetting with [ in the first case, and restrict() in the 2nd case). Unfortunately, MIndex objects don't support these operations.

Hope this helps,

H.

score 2 · Answer 2 · 2017-02-22

How about something like this? First we'll set up an example with 3 DNA sequences, and we're going to look for the pattern TACG in each of them, just like you've already done.

dna_initial <- DNAStringSet(c("ACGTACGGACGT",
                              "GTACGTACGGAC",
                              "AAAAGGGAAAAG"))

hits = vmatchPattern("TACG", dna_initial, max.mismatch=0, fixed=FALSE)
hits.frame = as.data.frame(hits)

The pattern appears once in the first string, twice in the second, and not at all in the third. No idea if that's possible in your data, but it's nice to get a solution that applies to a general case.

> hits.frame
  group group_name start end width
1     1       <NA>     4   7     4
2     2       <NA>     2   5     4
3     2       <NA>     6   9     4

Now we'll use mapply() to call the subseq() function on three vectors (the set of strings to be subset, the start positions and the ends). The other things to note is that we use hits.frame$group to select the strings we're going to process. This has the effect of dropping any strings that weren't found, and replicating those that have our substring multiple times.

dna_list <- mapply(subseq, dna_initial[hits.frame$group], hits.frame$start, hits.frame$end)

The result from this is a list of DNAStrings, which we can pass to DNAStringSet() to get back a single set.

dna_regions <- DNAStringSet(dna_list)

> dna_regions
  A DNAStringSet instance of length 3
    width seq
[1]     4 TACG
[2]     4 TACG
[3]     4 TACG

The result in our case isn't very interesting, but it's found the three instances of the sequence we're looking for and extracted them from the original data.