Question

inconsistent pairwiseAlignment behaviour (Biostrings) with multiple alignments with identical scores

0

Entering edit mode

vincent.plagnol ▴ 10

@vincentplagnol-8839

Last seen 7.8 years ago

United Kingdom

Dear all,

I do not understand the behaviour of pairwise alignment in the presence of multiple alignments with equal scores. Precisely, see the example at the bottom of this email (same DNA sequence but cut at different locations). In one case the mismatch is placed after the deletion event, and in the other the mismatch is placed after the deletion. And it's really not obvious that these are really the same mutation events based on the output of pairwiseAlignment.

I appreciate that the score is the same but the manual states that

"If more than one pairwise alignment produces the maximum alignment score, then the alignment with the smallest initial deletion whose mismatches occur before its insertions and deletions is chosen."

which should favour, I think, the second option (mismatch occurring before the deletion). So why is the first option chosen? It seems unclear to me. Any rationale on why that choice is made and ideally how to make the behaviour consistent would be much appreciated.

Thank you in advance,

Vincent

> pairwiseAlignment(pattern =
"ATCAAGGAACCATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAG", subject =
"ATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAG")
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [1] ATCAAGGA------------ACCATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAG
subject: [1] ATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAG
score: 27.26148

> pairwiseAlignment(pattern =
"TCTCTGTCATAGGGACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAACCA",
subject =
"TCTCTGTCATAGGGACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACA")
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [1]
TCTCTGTCATAGGGACTCTGGATCCCAGAAGGTGA...TCCCGTCGCTATCAAGGAAC------------CA
subject: [1]
TCTCTGTCATAGGGACTCTGGATCCCAGAAGGTGA...TCCCGTCGCTATCAAGGAATTAAGAGAAGCAACA
score: 70.86008

biostrings pairwisealignment • 1.6k views

ADD COMMENT • link 8.6 years ago vincent.plagnol ▴ 10

score 0 · Answer 1 · 2015-09-23

Hi Vincent,

I think the problem here is that you're assuming that putting the mismatch before or after the deletion wouldn't change the score but I suspect it would. That's because you are replacing a C/T mismatch with a C/A mismatch. Unfortunately pairwiseAlignment() uses a quality-based scoring scheme by default (see man page) which is opaque. AFAIK with this scoring scheme there is no (easy) way for the end-user to know what the cost of the individual mismatches are and to verify/interpret the returned score.

When aligning oligonucleotides, I would recommended that you use a fixed substitution scoring scheme by providing your own substitution matrix. This makes the outcome of pairwiseAlignment() a lot more transparent, predictable, and stable:

mat <- nucleotideSubstitutionMatrix(match=0, mismatch=-1, baseOnly=TRUE)
mat
#    A  C  G  T
# A  0 -1 -1 -1
# C -1  0 -1 -1
# G -1 -1  0 -1
# T -1 -1 -1  0

pattern1 <- DNAString("ATCAAGGAACCATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAG") 
subject1 <- DNAString("ATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAG")

pattern2 <- DNAString("TCTCTGTCATAGGGACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAACCA")
subject2 <- DNAString("TCTCTGTCATAGGGACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACA")

pairwiseAlignment(pattern1, subject1, substitutionMatrix=mat)
# Global PairwiseAlignmentsSingleSubject (1 of 1)
# pattern: [1] ATCAAGGAAC------------CATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAG 
# subject: [1] ATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAG 
# score: -59 

pairwiseAlignment(pattern2, subject2, substitutionMatrix=mat)
# Global PairwiseAlignmentsSingleSubject (1 of 1)
# pattern: [1] TCTCTGTCATAGGGACTCTGGATCCCAGAAGGTGA...TCCCGTCGCTATCAAGGAAC------------CA 
# subject: [1] TCTCTGTCATAGGGACTCTGGATCCCAGAAGGTGA...TCCCGTCGCTATCAAGGAATTAAGAGAAGCAACA 
# score: -59

Now the score can be broken down into: score = -(cost of the deletion = 58) - (cost of the mismatch = 1). Having the mismatch occur before the deletion (C/T mismatch) or after the deletion (C/A mismatch) wouldn't change that score so pairwiseAlignment() chooses the former, because it's the alignment "with the smallest initial deletion whose mismatches occur before its insertions and deletions".

Hope this helps,

H.

score 0 · Answer 2 · 2015-09-24

Thanks Herve this is very helpful.

Indeed as I realized after posting this, the pbm is independent of the mismatch. Even after switching the mismatch back to the reference base, the indel can still be located into two places and my example code selects one or the other in each case.

My assumption was that the location of the deletion would have no impact on the alignment score, but apparently you are correct and there is something subtle going on that contradicts my assumption. mmm... I will try to figure this out.

On the downside, I usually run the alignment with qualities associated with it (ie. proper fastq files). Your solution may fix my pbm, which is great, but it would be a shame to lose the "quality aware" alignment.

In any case many thanks, this gets me on the right track.

V