Hi LiGang,
Note that by default, pairwiseAlignment() performs "global" alignment,
not "local". However, the example you provided looks more like
a needle/haystack situation (short pattern that must be fully aligned
to a substring of a long subject), so it looks like the type of
alignment you are looking for is "global-local".
> library(Biostrings)
> set.seed(12345)
> toquery <- DNAString(paste(sample(c("A","G","C","T"),200,
rep=TRUE),collapse=""))
> toinsert <- DNAString('GCGGTGCCCGGGGCATCTCCGCGGCGGAAC')
> to.subject<-c(toquery, toinsert, subseq(toquery, start=30,
end=150),toinsert)
> pairwiseAlignment(subseq(toquery,start=50,end=120), to.subject,
type="global-local")
Global-Local PairwiseAlignedFixedSubject (1 of 1)
pattern: [1]
CTTGACGCAGATGTTTATACTCCGGACTCCATCAAAGTCTATTACCCCCAGGCTCCTCTGACGTGTTTCA
G
subject: [251]
CTTGACGCAGATGTTTATACTCCGGACTCCATCAAAGTCTATTACCCCCAGGCTCCTCTGACGTGTTTCA
G
score: 140.7047
Note that this result is *very* different from what you get when doing
"global" alignment:
> pairwiseAlignment(subseq(toquery,start=50,end=120), to.subject)
Global PairwiseAlignedFixedSubject (1 of 1)
pattern: [1]
CTT--------------------------------...TCTATTACCCCCAGGCTCCTCTGACGTGTTTC
AG
subject: [1]
CTTTGAGCCTAACAGGGGATGGTCCGCCAGTAACG...TCTATTACCCCCAGGCTCCTCTGACGTGTTTC
AG
score: -1119.295
This is because when doing "global" ("global" is short for "global-
global"), pairwiseAlignment() tries to align the pattern to the full
subject, so in your case (short pattern) it needs to "stretch" the
pattern a lot (by introducing gaps in it) to make it "fit" the
subject.
Anyway, using "global-local" doesn't solve the problem that, when more
than one match achieves the best possible score, pairwiseAlignment()
only reports one of them (the last one I believe). This is a known
limitation of pairwiseAlignment().
Here are 2 possible workarounds/alternatives:
1. If you really need all the richness/flexibility of
the pairwiseAlignment() function (i.e. its ability to support
all kinds of scoring scheme thru the 'patternQuality',
'subjectQuality', 'substitutionMatrix', 'fuzzyMatrix',
'gapOpening' and 'gapExtension' arguments), then
you can call it again on the truncated subject to find any
other potential match:
> pairwiseAlignment(subseq(toquery,start=50,end=120),
subseq(to.subject, end=250), type="global-local")
Global-Local PairwiseAlignedFixedSubject (1 of 1)
pattern: [1]
CTTGACGCAGATGTTTATACTCCGGACTCCATCAAAGTCTATTACCCCCAGGCTCCTCTGACGTGTTTCA
G
subject: [50]
CTTGACGCAGATGTTTATACTCCGGACTCCATCAAAGTCTATTACCCCCAGGCTCCTCTGACGTGTTTCA
G
score: 140.7047
You'll need to write your own function that uses a loop to
truncate
and search again. At each step you'll want to look at the score
of
the new match you get (the score should only go down) and make
sure
that it's close enough to the score of the original match. You
exit
the loop if it's not.
Note that this solution doesn't deal properly with the (rare)
situation where 2 best matches are overlapping. Also, it's
probably not going to be very efficient.
2. If you don't need all the richness/flexibility of
pairwiseAlignment(), and if it's ok for you to allow only a
small
number of indels, then you could always use the matchPattern()
function:
> matchPattern(subseq(toquery,start=50,end=120), to.subject,
max.mismatch=10, with.indels=TRUE)
Views on a 381-letter DNAString subject
subject:
CTTTGAGCCTAACAGGGGATGGTCCGCCAGTAAC...TGTGGCGGTGCCCGGGGCATCTCCGCGGCGGAA
C
views:
start end width
[1] 50 120 71
[CTTGACGCAGATGTTTATACTCCGGACT...CCCCCAGGCTCCTCTGACGTGTTTCAG]
[2] 251 321 71
[CTTGACGCAGATGTTTATACTCCGGACT...CCCCCAGGCTCCTCTGACGTGTTTCAG]
It's less flexible than pairwiseAlignment() but it returns all
the
matches. It's also faster and more memory efficient. See
?matchPattern for the details.
Cheers,
H.
On 03/08/2011 10:05 PM, ligang wrote:
> Hello list,
>
> I'm using 'pairwiseAlignment' function of Biostrings package to do
sequences
> alignment, and found that when do local alignment, if there were
several
> fragments matched between 'pattern' and 'subject' sequence, only one
was found
> by 'pairwiseAlignment' function, see the following example:
>
>
> #######################
>
> set.seed(12345)
> DNAString(paste(sample(c("A","G","C","T"),200,
rep=TRUE),collapse=""))->toquery
> DNAString('GCGGTGCCCGGGGCATCTCCGCGGCGGAAC')->toinsert
> to.subject<-c(toquery, toinsert, subseq(toquery, start=30,
end=150),toinsert)
> pairwiseAlignment(subseq(toquery,start=50,end=120), to.subject)
>
>
> #######################
>
> how could I found all the "matches"?
>
> LiGang
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
>
https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319