Help about 'pairwiseAlignment' in Biostrings package

0

Entering edit mode

li lilingdu ▴ 450

@li-lilingdu-1884

Last seen 7.6 years ago

Hello list, I'm using 'pairwiseAlignment' function of Biostrings package to do sequences alignment, and found that when do local alignment, if there were several fragments matched between 'pattern' and 'subject' sequence, only one was found by 'pairwiseAlignment' function, see the following example: ####################### set.seed(12345) DNAString(paste(sample(c("A","G","C","T"),200, rep=TRUE),collapse=""))->toquery DNAString('GCGGTGCCCGGGGCATCTCCGCGGCGGAAC')->toinsert to.subject<-c(toquery, toinsert, subseq(toquery, start=30, end=150),toinsert) pairwiseAlignment(subseq(toquery,start=50,end=120), to.subject) ####################### how could I found all the "matches"? LiGang

Alignment Biostrings Alignment Biostrings • 1.8k views

ADD COMMENT • link updated 14.8 years ago by Hervé Pagès 16k • written 14.8 years ago by li lilingdu ▴ 450

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 2 days ago

Seattle, WA, United States

Hi LiGang, Note that by default, pairwiseAlignment() performs "global" alignment, not "local". However, the example you provided looks more like a needle/haystack situation (short pattern that must be fully aligned to a substring of a long subject), so it looks like the type of alignment you are looking for is "global-local". > library(Biostrings) > set.seed(12345) > toquery <- DNAString(paste(sample(c("A","G","C","T"),200, rep=TRUE),collapse="")) > toinsert <- DNAString('GCGGTGCCCGGGGCATCTCCGCGGCGGAAC') > to.subject<-c(toquery, toinsert, subseq(toquery, start=30, end=150),toinsert) > pairwiseAlignment(subseq(toquery,start=50,end=120), to.subject, type="global-local") Global-Local PairwiseAlignedFixedSubject (1 of 1) pattern: [1] CTTGACGCAGATGTTTATACTCCGGACTCCATCAAAGTCTATTACCCCCAGGCTCCTCTGACGTGTTTCA G subject: [251] CTTGACGCAGATGTTTATACTCCGGACTCCATCAAAGTCTATTACCCCCAGGCTCCTCTGACGTGTTTCA G score: 140.7047 Note that this result is *very* different from what you get when doing "global" alignment: > pairwiseAlignment(subseq(toquery,start=50,end=120), to.subject) Global PairwiseAlignedFixedSubject (1 of 1) pattern: [1] CTT--------------------------------...TCTATTACCCCCAGGCTCCTCTGACGTGTTTC AG subject: [1] CTTTGAGCCTAACAGGGGATGGTCCGCCAGTAACG...TCTATTACCCCCAGGCTCCTCTGACGTGTTTC AG score: -1119.295 This is because when doing "global" ("global" is short for "global- global"), pairwiseAlignment() tries to align the pattern to the full subject, so in your case (short pattern) it needs to "stretch" the pattern a lot (by introducing gaps in it) to make it "fit" the subject. Anyway, using "global-local" doesn't solve the problem that, when more than one match achieves the best possible score, pairwiseAlignment() only reports one of them (the last one I believe). This is a known limitation of pairwiseAlignment(). Here are 2 possible workarounds/alternatives: 1. If you really need all the richness/flexibility of the pairwiseAlignment() function (i.e. its ability to support all kinds of scoring scheme thru the 'patternQuality', 'subjectQuality', 'substitutionMatrix', 'fuzzyMatrix', 'gapOpening' and 'gapExtension' arguments), then you can call it again on the truncated subject to find any other potential match: > pairwiseAlignment(subseq(toquery,start=50,end=120), subseq(to.subject, end=250), type="global-local") Global-Local PairwiseAlignedFixedSubject (1 of 1) pattern: [1] CTTGACGCAGATGTTTATACTCCGGACTCCATCAAAGTCTATTACCCCCAGGCTCCTCTGACGTGTTTCA G subject: [50] CTTGACGCAGATGTTTATACTCCGGACTCCATCAAAGTCTATTACCCCCAGGCTCCTCTGACGTGTTTCA G score: 140.7047 You'll need to write your own function that uses a loop to truncate and search again. At each step you'll want to look at the score of the new match you get (the score should only go down) and make sure that it's close enough to the score of the original match. You exit the loop if it's not. Note that this solution doesn't deal properly with the (rare) situation where 2 best matches are overlapping. Also, it's probably not going to be very efficient. 2. If you don't need all the richness/flexibility of pairwiseAlignment(), and if it's ok for you to allow only a small number of indels, then you could always use the matchPattern() function: > matchPattern(subseq(toquery,start=50,end=120), to.subject, max.mismatch=10, with.indels=TRUE) Views on a 381-letter DNAString subject subject: CTTTGAGCCTAACAGGGGATGGTCCGCCAGTAAC...TGTGGCGGTGCCCGGGGCATCTCCGCGGCGGAA C views: start end width [1] 50 120 71 [CTTGACGCAGATGTTTATACTCCGGACT...CCCCCAGGCTCCTCTGACGTGTTTCAG] [2] 251 321 71 [CTTGACGCAGATGTTTATACTCCGGACT...CCCCCAGGCTCCTCTGACGTGTTTCAG] It's less flexible than pairwiseAlignment() but it returns all the matches. It's also faster and more memory efficient. See ?matchPattern for the details. Cheers, H. On 03/08/2011 10:05 PM, ligang wrote: > Hello list, > > I'm using 'pairwiseAlignment' function of Biostrings package to do sequences > alignment, and found that when do local alignment, if there were several > fragments matched between 'pattern' and 'subject' sequence, only one was found > by 'pairwiseAlignment' function, see the following example: > > > ####################### > > set.seed(12345) > DNAString(paste(sample(c("A","G","C","T"),200, rep=TRUE),collapse=""))->toquery > DNAString('GCGGTGCCCGGGGCATCTCCGCGGCGGAAC')->toinsert > to.subject<-c(toquery, toinsert, subseq(toquery, start=30, end=150),toinsert) > pairwiseAlignment(subseq(toquery,start=50,end=120), to.subject) > > > ####################### > > how could I found all the "matches"? > > LiGang > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 14.8 years ago Hervé Pagès 16k

Login before adding your answer.