Question: The way to window over sequences of a DNAStringSet
0
3.1 years ago by
ben.ward30
United Kingdom
ben.ward30 wrote:

Hi, if I have a DNAStringSet object, say 5 sequences, and I want to do a sliding window across those sequences and calculate consensusMatrix on only a some sites of the entire sequence, what is the best way to go about this? I think that for the selection of the sites I can use a mask, and I think I can also use a row mask to mask which sequences are not included, but how about the sliding windows? Do I have to use Ranges? and if so I or G ranges? Or do I have to use Views? I want to code a trial solution to this but I'm unsure which approach is the correct one to start with.

Thanks,

Ben.

mask views biostrings window • 626 views
modified 3.1 years ago by Hervé Pagès ♦♦ 13k • written 3.1 years ago by ben.ward30
Answer: The way to window over sequences of a DNAStringSet
0
3.1 years ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi Ben,

If you want to calculate consensusMatrix on only some sites of the entire sequences, why do you need a sliding window? Also please clarify what would be the input of consensusMatrix() at each site? Would it be 5 subsequences, one per sequence in the original DNAStringSet object? Would these subsequences be taken from the same window (e.g. position 50-70) on each original sequence? That suggests that the original sequences have the same length. Is that the case?

So for example to calculate consensusMatrix for a window of width 20 at positions 50, 75, 90, 108:

library(Biostrings)
file <- system.file("extdata", "someORF.fa", package="Biostrings")
orf <- readDNAStringSet(file)  # DNAStringSet object with 7 sequences
my_pos <- c(50, 75, 90, 108)
lapply(my_pos, function(pos) consensusMatrix(subseq(orf, pos, width=20)))

This returns a list of consensus matrices, one per position.

This won't be efficient if you have tens of thousands of positions. But before we worry about performance, let's make sure we understand what you are trying to do exactly. Showing us some naive code might help.

Cheers,

H.

Hi Herve, The sequences are aligned with gap characters, and so are the same length, and so the sub-sequences are taken from the same window on each original sequence. I use consensusMatrix to then calculate a few statistics like number of polymorphisms, allele frequencies, number of states per site, and such. My naive code is not much different to yours in that I do many sub-sequences and the consensusMatrix on those sub-sequences, but it felt naive, and I suspected there was a better way using masking or ranges or something, but I don't know enough of the functionality available to know what exactly.

Hi Ben,

I'm not sure there is much I can offer. If your naive code does the job and if performance is not an issue then it's probably good enough. I don't think you need to use sliding windows or Views or masks for that. Note that Views and masks can only be defined on a single sequence (DNAString object) but not on a DNAStringSet object, so that is a non-starter.

If performance is an issue, there might be ways to work around it but we would need to know more about what you do after the consensusMatrix step. Computing and returning a list of consensus matrices is expensive. However if we know what you do after that we might be able to suggest efficient ways to go straight to it (i.e. without having to generate the list of consensus matrices).

Cheers,

H.