unexpected behaviour of pairwiseAlignment() in Biostrings
2
0
Entering edit mode
@coghlan-avril-3810
Last seen 7.1 years ago
Dear Patrick, all, I have noticed what I think may perhaps be a bug in the pairwiseAlignment() function in Biostrings. I want to use it to calculate the optimal pairwise alignment between these two strings: s1 = "TCAGTTGCCAAACCCGCT" s2 = "AGGGTTGACATCCGTTTT" I want to use a match score of 10, a mismatch penalty of -10, a gap opening penalty of -12 and a gap extension penalty of -3. I typed: s1 <- "TCAGTTGCCAAACCCGCT" s2 <- "AGGGTTGACATCCGTTTT" library(Biostrings) sigma <- nucleotideSubstitutionMatrix(match = 10, mismatch = -10, baseOnly = TRUE) pairwiseAlignment(s1, s2, substitutionMatrix = sigma, gapOpening = -12, gapExtension = -3, scoreOnly = FALSE, type="local") The output I get is: Local PairwiseAlignedFixedSubject (1 of 1) pattern: [4] GTTGCCAA subject: [4] GTTGACAT score: 52 If you look at the alignment printed out as the optimal alignment, it has 6 matches and 2 mismatches and so should have a score of (6*10)+(2*-10) = 60-20 = 40. This is different from the score printed out (52). When I use my own (home-made) Smith-Waterman R function to calculate the optimal alignment, it gives the optimal alignment score as 52 (in agreement with Biostrings), but gives a different optimal alignment: GTTGCCAAACCCG GTTG--ACATCCG If you look at this alignment, it has 9 matches, two mismatches, and one gap of length two, and so the alignment score should be (9*10)+(2*-10)+(-12-3-3) = 52, which is the score that Biostrings says that the optimal alignment should have. Therefore, I am wondering whether Biostrings is calculating the optimal alignment and optimal alignment score correctly (52 here), but is printing out the wrong alignment as the optimal one ? I would be grateful for your comments on this, I'm wondering if I have misunderstood something. One last small comment on the pairwiseAlignment() function - it took me a while to realise that when you specify a "gap opening penalty" (eg. -12 here) and "gap extension penalty" (-3 here) in its arguments, if you have a gap of length two then the first position in the gap is given a score of -12-3=-15. This took me a little while to figure out (I don't think it's clear from the Biostrings documentation), since in most textbooks the "gap opening penalty" means the score given to the first position in a gap (eg. see Deonier "Computational Genome Analysis page 161) and so I had expected that specifying a "gap opening penalty" of -12 would mean that the score for the first position in the gap would be -12. It would be good if you made this clearer in the documentation for the pairwiseAlignment() function. Kind Regards, Avril Dept. Microbiology University College Cork Ireland [[alternative HTML version deleted]]
Alignment Biostrings Alignment Biostrings • 1.0k views
0
Entering edit mode
@patrick-aboyoun-6734
Last seen 7.1 years ago
United States
Avril, Thanks for the bug report! It turns out that the dynamic programming traceback for local alignments by Biostrings::pairwiseAlignment was broken when the local alignment had a gap in it. I have fixed this bug in BioC 2.6 (Biostrings 2.16.7) and BioC 2.7 (2.17.13). This patched version will be available from bioconductor.org in roughly 36 hours. Patrick On 7/3/10 7:13 AM, Coghlan, Avril wrote: > > Dear Patrick, all, > > I have noticed what I think may perhaps be a bug in the > pairwiseAlignment() function in Biostrings. > I want to use it to calculate the optimal pairwise alignment between > these two strings: > s1 = "TCAGTTGCCAAACCCGCT" > s2 = "AGGGTTGACATCCGTTTT" > I want to use a match score of 10, a mismatch penalty of -10, a gap > opening penalty of -12 and a gap extension penalty of -3. I typed: > > s1 <- "TCAGTTGCCAAACCCGCT" > s2 <- "AGGGTTGACATCCGTTTT" > library(Biostrings) > sigma <- nucleotideSubstitutionMatrix(match = 10, mismatch = -10, > baseOnly = TRUE) > pairwiseAlignment(s1, s2, substitutionMatrix = sigma, gapOpening = > -12, gapExtension = -3, > scoreOnly = FALSE, type="local") > > The output I get is: > Local PairwiseAlignedFixedSubject (1 of 1) > pattern: [4] GTTGCCAA > subject: [4] GTTGACAT > score: 52 > > If you look at the alignment printed out as the optimal alignment, it > has 6 matches and 2 mismatches and so should have a score of > (6*10)+(2*-10) = 60-20 = 40. This is different from the score printed > out (52). > > When I use my own (home-made) Smith-Waterman R function to calculate > the optimal alignment, it gives the optimal alignment score as 52 (in > agreement with Biostrings), but gives a different optimal alignment: > GTTGCCAAACCCG > GTTG--ACATCCG > If you look at this alignment, it has 9 matches, two mismatches, and > one gap of length two, and so the alignment score should be > (9*10)+(2*-10)+(-12-3-3) = 52, which is the score that Biostrings says > that the optimal alignment should have. > > Therefore, I am wondering whether Biostrings is calculating the > optimal alignment and optimal alignment score correctly (52 here), but > is printing out the wrong alignment as the optimal one ? > I would be grateful for your comments on this, I'm wondering if I have > misunderstood something. > > One last small comment on the pairwiseAlignment() function - it took > me a while to realise that when you specify a "gap opening penalty" > (eg. -12 here) and "gap extension penalty" (-3 here) in its arguments, > if you have a gap of length two then the first position in the gap is > given a score of -12-3=-15. This took me a little while to figure out > (I don't think it's clear from the Biostrings documentation), since in > most textbooks the "gap opening penalty" means the score given to the > first position in a gap (eg. see Deonier "Computational Genome > Analysis page 161) and so I had expected that specifying a "gap > opening penalty" of -12 would mean that the score for the first > position in the gap would be -12. It would be good if you made this > clearer in the documentation for the pairwiseAlignment() function. > > Kind Regards, > Avril > > Dept. Microbiology > University College Cork > Ireland > [[alternative HTML version deleted]]
0
Entering edit mode
@coghlan-avril-3810
Last seen 7.1 years ago
Dear Patrick, I've updated my copy of the Biostrings() library but the pairwiseAlignment() function doesn't seem to have been fixed yet - I wonder could you send me the patch directly please? Kind Regards, Avril ________________________________ From: Patrick Aboyoun [mailto:paboyoun@fhcrc.org] Sent: Tue 7/6/2010 9:03 To: Coghlan, Avril Cc: bioconductor@stat.math.ethz.ch Subject: Re: unexpected behaviour of pairwiseAlignment() in Biostrings Avril, Thanks for the bug report! It turns out that the dynamic programming traceback for local alignments by Biostrings::pairwiseAlignment was broken when the local alignment had a gap in it. I have fixed this bug in BioC 2.6 (Biostrings 2.16.7) and BioC 2.7 (2.17.13). This patched version will be available from bioconductor.org in roughly 36 hours. Patrick On 7/3/10 7:13 AM, Coghlan, Avril wrote: Dear Patrick, all, I have noticed what I think may perhaps be a bug in the pairwiseAlignment() function in Biostrings. I want to use it to calculate the optimal pairwise alignment between these two strings: s1 = "TCAGTTGCCAAACCCGCT" s2 = "AGGGTTGACATCCGTTTT" I want to use a match score of 10, a mismatch penalty of -10, a gap opening penalty of -12 and a gap extension penalty of -3. I typed: s1 <- "TCAGTTGCCAAACCCGCT" s2 <- "AGGGTTGACATCCGTTTT" library(Biostrings) sigma <- nucleotideSubstitutionMatrix(match = 10, mismatch = -10, baseOnly = TRUE) pairwiseAlignment(s1, s2, substitutionMatrix = sigma, gapOpening = -12, gapExtension = -3, scoreOnly = FALSE, type="local") The output I get is: Local PairwiseAlignedFixedSubject (1 of 1) pattern: [4] GTTGCCAA subject: [4] GTTGACAT score: 52 If you look at the alignment printed out as the optimal alignment, it has 6 matches and 2 mismatches and so should have a score of (6*10)+(2*-10) = 60-20 = 40. This is different from the score printed out (52). When I use my own (home-made) Smith-Waterman R function to calculate the optimal alignment, it gives the optimal alignment score as 52 (in agreement with Biostrings), but gives a different optimal alignment: GTTGCCAAACCCG GTTG--ACATCCG If you look at this alignment, it has 9 matches, two mismatches, and one gap of length two, and so the alignment score should be (9*10)+(2*-10)+(-12-3-3) = 52, which is the score that Biostrings says that the optimal alignment should have. Therefore, I am wondering whether Biostrings is calculating the optimal alignment and optimal alignment score correctly (52 here), but is printing out the wrong alignment as the optimal one ? I would be grateful for your comments on this, I'm wondering if I have misunderstood something. One last small comment on the pairwiseAlignment() function - it took me a while to realise that when you specify a "gap opening penalty" (eg. -12 here) and "gap extension penalty" (-3 here) in its arguments, if you have a gap of length two then the first position in the gap is given a score of -12-3=-15. This took me a little while to figure out (I don't think it's clear from the Biostrings documentation), since in most textbooks the "gap opening penalty" means the score given to the first position in a gap (eg. see Deonier "Computational Genome Analysis page 161) and so I had expected that specifying a "gap opening penalty" of -12 would mean that the score for the first position in the gap would be -12. It would be good if you made this clearer in the documentation for the pairwiseAlignment() function. Kind Regards, Avril Dept. Microbiology University College Cork Ireland [[alternative HTML version deleted]]
0
Entering edit mode
Hi Avril On Mon, Jul 12, 2010 at 10:53 AM, Coghlan, Avril <a.coghlan at="" ucc.ie=""> wrote: > Dear Patrick, > > I've updated my copy of the Biostrings() library but the pairwiseAlignment() function doesn't seem to have been fixed yet - I wonder could you send me the patch directly please? You could actually grab the source changes yourself if they haven't propagated to the release/package servers. If you're using R 2.11 (and bioc 2.6), the subversion repository can be found here: https://hedgehog.fhcrc.org/bioconductor/branches/RELEASE_2_6/madman/Rp acks/ You could "check out" the Biostrings package via svn and install it from source. Assuming you're using a *unix-esque OS (linux, os x, etc.), you could do this from teh command line: $svn --username readonly co https://hedgehog.fhcrc.org/bioconductor/branches/RELEASE_2_6/madman/Rp acks/Biostrings Biostrings ## The password is also: readonly$ R CMD INSTALL Biostrings and voila ... you would have installed the latest package from source. Hope that helps, -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
0
Entering edit mode
Avril, I checked the versions of Biostrings that are available at bioconductor.org and they are either at or above the version number of the package that contains the patch. I also double-checked the packages in release and devel and both contain the fix. Could you please send session information from your R session? As Steve noted, you can always get the latest version of any package directly from the Bioconductor svn. Cheers, Patrick On 7/12/10 8:08 AM, Steve Lianoglou wrote: > Hi Avril > > On Mon, Jul 12, 2010 at 10:53 AM, Coghlan, Avril<a.coghlan at="" ucc.ie=""> wrote: > >> Dear Patrick, >> >> I've updated my copy of the Biostrings() library but the pairwiseAlignment() function doesn't seem to have been fixed yet - I wonder could you send me the patch directly please? >> > You could actually grab the source changes yourself if they haven't > propagated to the release/package servers. > > If you're using R 2.11 (and bioc 2.6), the subversion repository can > be found here: > > https://hedgehog.fhcrc.org/bioconductor/branches/RELEASE_2_6/madman/ Rpacks/ > > You could "check out" the Biostrings package via svn and install it from source. > > Assuming you're using a *unix-esque OS (linux, os x, etc.), you could > do this from teh command line: > > $svn --username readonly co > https://hedgehog.fhcrc.org/bioconductor/branches/RELEASE_2_6/madman/ Rpacks/Biostrings > Biostrings > ## The password is also: readonly > >$ R CMD INSTALL Biostrings > > and voila ... you would have installed the latest package from source. > > Hope that helps, > >
0
Entering edit mode
Dear Patrick and Steve, Thank you for your helpful replies. I think that perhaps the problem was that I was using an old version of R (R 2.10). I had tried to update the Biostrings library by typing: >source("http://bioconductor.org/biocLite.R" >biocLite("Biostrings") >library("Biostrings") However, the pairwiseAlignment() function still seemed to give the wrong result. I updated my version of R (to R 2.11.1), and found that I could update the Biostrings library by typing the commands above. The pairwiseAlignment() function then seemed to work fine for me, and gave the correct result. I hadn't realised that using an old version of R (R 2.10) could be the problem, but it seemed to be. Thanks again for your help, and for fixing the function so quickly. Kind Regards, Avril Avril Coghlan Dept. Microbiology University College Cork Ireland -----Original Message----- From: Patrick Aboyoun [mailto:paboyoun@fhcrc.org] Sent: 12 July 2010 17:07 To: Coghlan, Avril Cc: Steve Lianoglou; bioconductor at stat.math.ethz.ch Subject: Re: [BioC] unexpected behaviour of pairwiseAlignment() in Biostrings Avril, I checked the versions of Biostrings that are available at bioconductor.org and they are either at or above the version number of the package that contains the patch. I also double-checked the packages in release and devel and both contain the fix. Could you please send session information from your R session? As Steve noted, you can always get the latest version of any package directly from the Bioconductor svn. Cheers, Patrick On 7/12/10 8:08 AM, Steve Lianoglou wrote: > Hi Avril > > On Mon, Jul 12, 2010 at 10:53 AM, Coghlan, Avril<a.coghlan at="" ucc.ie=""> wrote: > >> Dear Patrick, >> >> I've updated my copy of the Biostrings() library but the pairwiseAlignment() function doesn't seem to have been fixed yet - I wonder could you send me the patch directly please? >> > You could actually grab the source changes yourself if they haven't > propagated to the release/package servers. > > If you're using R 2.11 (and bioc 2.6), the subversion repository can > be found here: > > https://hedgehog.fhcrc.org/bioconductor/branches/RELEASE_2_6/madman/Rp ac ks/ > > You could "check out" the Biostrings package via svn and install it from source. > > Assuming you're using a *unix-esque OS (linux, os x, etc.), you could > do this from teh command line: > > $svn --username readonly co > https://hedgehog.fhcrc.org/bioconductor/branches/RELEASE_2_6/madman/Rp ac ks/Biostrings > Biostrings > ## The password is also: readonly > >$ R CMD INSTALL Biostrings > > and voila ... you would have installed the latest package from source. > > Hope that helps, > >