Hi, if I have a DNAStringSet object, say 5 sequences, and I want to do a sliding window across those sequences and calculate `consensusMatrix` on only a some sites of the entire sequence, what is the best way to go about this? I think that for the selection of the sites I can use a mask, and I think I can also use a row mask to mask which sequences are not included, but how about the sliding windows? Do I have to use Ranges? and if so I or G ranges? Or do I have to use Views? I want to code a trial solution to this but I'm unsure which approach is the correct one to start with.
Thanks,
Ben.
Hi Herve, The sequences are aligned with gap characters, and so are the same length, and so the sub-sequences are taken from the same window on each original sequence. I use consensusMatrix to then calculate a few statistics like number of polymorphisms, allele frequencies, number of states per site, and such. My naive code is not much different to yours in that I do many sub-sequences and the consensusMatrix on those sub-sequences, but it felt naive, and I suspected there was a better way using masking or ranges or something, but I don't know enough of the functionality available to know what exactly.
Hi Ben,
I'm not sure there is much I can offer. If your naive code does the job and if performance is not an issue then it's probably good enough. I don't think you need to use sliding windows or Views or masks for that. Note that Views and masks can only be defined on a single sequence (DNAString object) but not on a DNAStringSet object, so that is a non-starter.
If performance is an issue, there might be ways to work around it but we would need to know more about what you do after the
consensusMatrix
step. Computing and returning a list of consensus matrices is expensive. However if we know what you do after that we might be able to suggest efficient ways to go straight to it (i.e. without having to generate the list of consensus matrices).Cheers,
H.