Question

homopolymer gap extension penalty in Biostrings

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 6 days ago

Seattle, WA, United States

Hi Noah, On 03/02/2012 12:16 PM, Noah Hoffman wrote: > Hello Herve, > > It has been a while... I hope you're doing well. I have a question about the implementation of needleman-wunsch in pairwiseAlignment. We have some ion torrent sequencing data. Like the 454, (except worse, apparently) it likes to insert runs of homopolymers. The alignments with the primers, etc would be improved considerably if we could specify a lower penalty for a gap extension when there is a homopolymeric insertion in the opposite strand (as opposed to an insertion that is not a homopolymer). That is, two different gap extension penalties. Is this possible in pairwiseAlignment? I guess that would be technically possible even though certainly not straightforward. The Needleman-Wunsch / Smith-Waterman algos are complicated algorithms and the current implementation in Biostrings is from Patrick Aboyoun. Unfortunately I'm not familiar enough with it to be able to tell exactly how hard it would be to add something like the "homopolymeric insertion penalty" feature you propose but my gut feeling is that it would probably require some significant amount of work. Just to clarify, why are you saying "in the opposite strand" when you say: specify a lower penalty for a gap extension when there is a homopolymeric insertion in the opposite strand Could you provide an example? One workaround you could explore is to use a 2-pass algo (UNTESTED, and I'm not sure it would do exactly what you want): 1st pass: Use pairwiseAlignment() with a gap penalty as low as the penalty you'd like to put on homopolymeric insertions. You said you were using Needleman-Wunsch so I'm assuming that you are doing "global-global" alignment so you probably don't need to work around one limitation of pairwiseAlignment() when doing "global-local" which is that it only returns the first best match. Let's assume you managed to obtain 1 alignment for each read with a score. 2nd pass: Look at each alignment and in particular at the insertions they contain. Adjust the score of the alignment based on the type of insertions it has (i.e. homopolymeric or not). Patrick implemented tools for dissecting the alignments returned by pairwiseAlignment(): they are described in the man page for PairwiseAlignedXStringSet objects (?PairwiseAlignedXStringSet). > > Another thing that would be useful would be per-strand gap creation and gap extension penalties. Can you please elaborate on this? Depending on what you are trying to do exactly, there might be an easy way to go or not. Please provide an example. > > Any ideas? Fred Ross (who you may not have met yet) is working on this and would like to hear your answer as well. I'm cc'ing the Bioconductor mailing list so other people can give some advice. Maybe someone knows a tool in another BioC package or out of Bioconductor that already provide those features? Cheers, H. > > Thanks a lot, > Noah -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

Sequencing Alignment GO Cancer Sequencing Alignment GO Cancer • 1.9k views

ADD COMMENT • link updated 14.0 years ago by Noah Hoffman ▴ 10 • written 14.0 years ago by Hervé Pagès 16k

score 0 · Answer 1 · 2012-03-06

Hi Frederick, Let's please use the "Reply All" button so the discussion stays on the mailing list. On 03/06/2012 05:12 PM, Frederick Ross wrote: > On 3/6/12 4:56 PM, Hervé Pagès wrote: >> Just to clarify, why are you saying "in the opposite strand" when you >> say: >> >> specify a lower penalty for a gap extension when there is a >> homopolymeric insertion in the opposite strand >> >> Could you provide an example? > > Perhaps I can provide some context: the basic unit of pyrosequencing is > a homopolymer. I'm working with some preliminary data from Ion Torrent's > machine, and the primary (and I mean by an order of magnitude) error > rate is miscounting the length of homopolymers, often by large fractions > (measuring 10 bases in place of 2, for instance). > > If you set the gap penalties low enough to handle this, you break > alignment entirely since every mismatch and real gap goes all over the > place. If you set the gap penalties high enough, the aligner breaks again. > > I think the real solution is to work in homopolymers. I've been > homoencoding my sequencings, so AATCCGTTTA becomes A2 T1 C2 G1 T3 A1. In > this representation, I can write down an appropriate scoring function > that accounts for both base difference and length difference in > homopolymer, i.e., > > s(base1,length1,base2,length2) = > v*abs(length1-length2) > + if length1==length2 then 0 else a > + if base1==base2 then -k else k > > However, I don't know of a general enough implementation of > Smith-Waterman that I can hand a generalized function like this, along > with a generalized alphabet. I see. Those extra details are useful. Using a Run Length Encoding representation of the reads is an interesting idea. FWIW, the IRanges package implements the Rle class. Any regular atomic vector can be turned into an Rle object. For example: > library(IRanges) > rledna <- Rle(strsplit("AATCCGTTTA", "")[[1]]) > rledna 'character' Rle of length 10 with 6 runs Lengths: 2 1 2 1 3 1 Values : "A" "T" "C" "G" "T" "A" > runLength(rledna) [1] 2 1 2 1 3 1 > runValue(rledna) [1] "A" "T" "C" "G" "T" "A" Instead of trying to align seq1 to seq2, would it make sense to try to align runvals1 to runvals2 instead? Where runvals1 to runvals2 are defined as: rledna1 <- Rle(strsplit(seq1, "")[[1]]) runvals1 <- paste(runValue(rledna1), collapse="") rledna2 <- Rle(strsplit(seq2, "")[[1]]) runvals2 <- paste(runValue(rledna2), collapse="") Once you've aligned runvals1 to runvals2 you can infer the corresponding alignment between seq1 and seq2 and infer its score (by now looking at the lengths of the runs). It would be kind of like doing Needleman-Wunsch or Smith-Waterman but where the runs are the unbreakable units, not the individual letters, i.e. insertions, deletions and replacements are allowed for entire runs. Anyway, I'm still not sure why "in the opposite strand" in specify a lower penalty for a gap extension when there is a homopolymeric insertion *in the opposite strand* Also, I'm still not clear about this: Another thing that would be useful would be per-strand gap creation and gap extension penalties. Thanks, H. -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

score 0 · Answer 2 · 2012-03-06

Hello Herve, Thanks a lot for your suggestions - we are indeed doing exactly what you suggested (as Fred noted) with respect to run length encoding - and it's nice to see that this is so conveniently done using IRanges. > Also, I'm still not clear about this: >> >> Another thing that would be useful would be per-strand gap creation >> and gap extension penalties. What I meant here (and didn't express very clearly) is the idea that we may believe that one sequence in a pairwise alignment (sequence 1, for example a pyrosequencing read) is more likely to contain insertions than the other (sequence 2, eg a known primer sequence). In this case it would be useful to specify a different penalties for gap opening or extension in each sequence. Thanks a lot, Noah