Calculating alignment scores from aligned sequences

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 14 months ago

United States

Hi, I thought I'd add to the pairwiseAlignment-related activity on the list today, so here's a bit of navel gazing. I *think* I want to calculate alignment scores for subsets of already aligned sequences and I don't think there's an easy and efficient way to do this without having to realign the subsets of the strings I'm interested in. For instance, say I have a pairwise alignment `pa`, and the result of `compareStrings(pattern(pa), subject(pa))` looks like something like this: "GTA?TTT?A-----TTTCATATC?TGT?TC? ------------------------------------------------------CAT" Given values for the substitutionMatrix, gapOpenening, and gapExtension penalties used during the pairwiseAlignment, one could (in principle), calculate the alignment score of the first half vs. the second half, or (say) running windows of length N along the alignment. Like I said, I *think* this is what I want to do and I was just curious if I'm not missing something in Biostrings that I can wire into ... although it's straight-forward enough to write something myself. In the gapOpening == gapExtension case, I can imagine massaging the `compareString` to an integer vector and then doing a simple matrix multiplication into an augmented substitutionMatrix, but I think the more common gapOpening != gapExtension case suggests I should really write a helper function in C(++). Thanks, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

Alignment Cancer Biostrings Alignment Cancer Biostrings • 1.2k views

ADD COMMENT • link updated 12.0 years ago by Michael Lawrence ★ 11k • written 12.0 years ago by Steve Lianoglou ★ 13k

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 2.4 years ago

United States

Hi Steve, I doubt there is anything in Biostrings for this, as it is an unusual case to score an alignment after generating the alignment, since the scoring and the alignment are intertwined in the algorithm. I can't speak for everyone else, but it could be a useful contribution to Biostrings. Michael On Tue, Apr 17, 2012 at 2:07 PM, Steve Lianoglou < mailinglist.honeypot@gmail.com> wrote: > Hi, > > I thought I'd add to the pairwiseAlignment-related activity on the > list today, so here's a bit of navel gazing. > > I *think* I want to calculate alignment scores for subsets of already > aligned sequences and I don't think there's an easy and efficient way > to do this without having to realign the subsets of the strings I'm > interested in. > > For instance, say I have a pairwise alignment `pa`, and the result of > `compareStrings(pattern(pa), subject(pa))` looks like something like > this: > > > "GTA?TTT?A-----TTTCATATC?TGT?TC? ------------------------------------------------------CAT" > > Given values for the substitutionMatrix, gapOpenening, and > gapExtension penalties used during the pairwiseAlignment, one could > (in principle), calculate the alignment score of the first half vs. > the second half, or (say) running windows of length N along the > alignment. > > Like I said, I *think* this is what I want to do and I was just > curious if I'm not missing something in Biostrings that I can wire > into ... although it's straight-forward enough to write something > myself. In the gapOpening == gapExtension case, I can imagine > massaging the `compareString` to an integer vector and then doing a > simple matrix multiplication into an augmented substitutionMatrix, but > I think the more common gapOpening != gapExtension case suggests I > should really write a helper function in C(++). > > Thanks, > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 12.0 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi Steve, Michael, On 04/17/2012 04:08 PM, Michael Lawrence wrote: > Hi Steve, > > I doubt there is anything in Biostrings for this, as it is an unusual case > to score an alignment after generating the alignment, since the scoring and > the alignment are intertwined in the algorithm. I just had a look at the C code in Biostrings in charge of computing the score (src/align_pairwiseAlignment.c file, pairwiseAlignment function starting at line 90) and what I see confirms what Michael is saying: the scoring happens at the same time the alignment itself is computed. I guess that's just a consequence of the "dynamic programming" nature of the Smith-Waterman algo. A separate C function for computing the score of 2 already aligned sequences might not be too hard to implement but I suspect it won't reuse much of the existing code, just because the problem it tries to solve is different. It's also much simpler problem. Actually the problem might be easy to solve with pure R code (and in a way that might be efficient enough if you have a small number of aligned sequences). FWIW, here is a simple (and limited) solution that deals with DNA input only, "globally" aligned sequences only, where the scoring is defined by a substitution matrix and the gap opening and gap extension penalties only (no 'patternQuality', 'subjectQuality' or 'fuzzyMatrix' args): scoreTwoAlignedSequences <- function(pattern, subject, substitutionMatrix, gapOpening=-10, gapExtension=-4) { if (!is(pattern, "DNAString") || !is(subject, "DNAString") || length(pattern) != length(subject)) stop("'pattern' and 'subject' must be DNAString objects ", "of the same length") if (!is.matrix(substitutionMatrix)) stop("'substitutionMatrix' must be a matrix") pletters <- strsplit(as.character(pattern), split=NULL)[[1L]] sletters <- strsplit(as.character(subject), split=NULL)[[1L]] is_indel <- pletters == "-" | sletters =="-" ## Subsetting the substitutionMatrix matrix by a 2-column matrix ## (a little known feature of [ very useful here): ii <- cbind(pletters[!is_indel], sletters[!is_indel]) score <- sum(substitutionMatrix[ii]) prle <- Rle(pletters) pgaps_runs <- which(runValue(prle) == "-") score <- score + length(pgaps_runs) * gapOpening + sum(runLength(prle)[pgaps_runs]) * gapExtension srle <- Rle(sletters) sgaps_runs <- which(runValue(srle) == "-") score <- score + length(sgaps_runs) * gapOpening + sum(runLength(srle)[sgaps_runs]) * gapExtension score } Of course a serious candidate for inclusion to Biostrings would need to be extended to support all the sort of input supported by pairwiseAlignment()... and also use a better name (using a generic + methods would be better too)... and documented. Doesn't sound like just a 1- or 2-hour project to me :-/ H. > I can't speak for everyone > else, but it could be a useful contribution to Biostrings. > > Michael > > On Tue, Apr 17, 2012 at 2:07 PM, Steve Lianoglou< > mailinglist.honeypot at gmail.com> wrote: > >> Hi, >> >> I thought I'd add to the pairwiseAlignment-related activity on the >> list today, so here's a bit of navel gazing. >> >> I *think* I want to calculate alignment scores for subsets of already >> aligned sequences and I don't think there's an easy and efficient way >> to do this without having to realign the subsets of the strings I'm >> interested in. >> >> For instance, say I have a pairwise alignment `pa`, and the result of >> `compareStrings(pattern(pa), subject(pa))` looks like something like >> this: >> >> >> "GTA?TTT?A-----TTTCATATC?TGT?TC? ------------------------------------------------------CAT" >> >> Given values for the substitutionMatrix, gapOpenening, and >> gapExtension penalties used during the pairwiseAlignment, one could >> (in principle), calculate the alignment score of the first half vs. >> the second half, or (say) running windows of length N along the >> alignment. >> >> Like I said, I *think* this is what I want to do and I was just >> curious if I'm not missing something in Biostrings that I can wire >> into ... although it's straight-forward enough to write something >> myself. In the gapOpening == gapExtension case, I can imagine >> massaging the `compareString` to an integer vector and then doing a >> simple matrix multiplication into an augmented substitutionMatrix, but >> I think the more common gapOpening != gapExtension case suggests I >> should really write a helper function in C(++). >> >> Thanks, >> -steve >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> | Memorial Sloan-Kettering Cancer Center >> | Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 12.0 years ago Hervé Pagès 16k

Login before adding your answer.