Question

The way to window over sequences of a DNAStringSet

0

Entering edit mode

ben.ward ▴ 30

@benward-7169

Last seen 8.1 years ago

United Kingdom

Hi, if I have a DNAStringSet object, say 5 sequences, and I want to do a sliding window across those sequences and calculate `consensusMatrix` on only a some sites of the entire sequence, what is the best way to go about this? I think that for the selection of the sites I can use a mask, and I think I can also use a row mask to mask which sequences are not included, but how about the sliding windows? Do I have to use Ranges? and if so I or G ranges? Or do I have to use Views? I want to code a trial solution to this but I'm unsure which approach is the correct one to start with.

Thanks,

Ben.

biostrings window mask views • 1.9k views

ADD COMMENT • link updated 8.4 years ago by Hervé Pagès 16k • written 8.4 years ago by ben.ward ▴ 30

score 0 · Answer 1 · 2015-12-12

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 1 day ago

Seattle, WA, United States

Hi Ben,

If you want to calculate consensusMatrix on only some sites of the entire sequences, why do you need a sliding window? Also please clarify what would be the input of consensusMatrix() at each site? Would it be 5 subsequences, one per sequence in the original DNAStringSet object? Would these subsequences be taken from the same window (e.g. position 50-70) on each original sequence? That suggests that the original sequences have the same length. Is that the case?

So for example to calculate consensusMatrix for a window of width 20 at positions 50, 75, 90, 108:

library(Biostrings)
file <- system.file("extdata", "someORF.fa", package="Biostrings")
orf <- readDNAStringSet(file)  # DNAStringSet object with 7 sequences
my_pos <- c(50, 75, 90, 108)
lapply(my_pos, function(pos) consensusMatrix(subseq(orf, pos, width=20)))

This returns a list of consensus matrices, one per position.

This won't be efficient if you have tens of thousands of positions. But before we worry about performance, let's make sure we understand what you are trying to do exactly. Showing us some naive code might help.

Cheers,

H.

ADD COMMENT • link 8.4 years ago Hervé Pagès 16k

0

Entering edit mode

Hi Herve, The sequences are aligned with gap characters, and so are the same length, and so the sub-sequences are taken from the same window on each original sequence. I use consensusMatrix to then calculate a few statistics like number of polymorphisms, allele frequencies, number of states per site, and such. My naive code is not much different to yours in that I do many sub-sequences and the consensusMatrix on those sub-sequences, but it felt naive, and I suspected there was a better way using masking or ranges or something, but I don't know enough of the functionality available to know what exactly.

ADD REPLY • link 8.4 years ago ben.ward ▴ 30

0

Entering edit mode

Hi Ben,

I'm not sure there is much I can offer. If your naive code does the job and if performance is not an issue then it's probably good enough. I don't think you need to use sliding windows or Views or masks for that. Note that Views and masks can only be defined on a single sequence (DNAString object) but not on a DNAStringSet object, so that is a non-starter.

If performance is an issue, there might be ways to work around it but we would need to know more about what you do after the consensusMatrix step. Computing and returning a list of consensus matrices is expensive. However if we know what you do after that we might be able to suggest efficient ways to go straight to it (i.e. without having to generate the list of consensus matrices).

Cheers,

H.

ADD REPLY • link 8.4 years ago Hervé Pagès 16k