Hi,
Unfortunately the default scoring scheme is quality-based and somewhat opaque so probably best avoid it. The good news is that you can take full control of the scoring scheme by supplying your own substitution matrix:
library(Biostrings)
my_alphabet <- c(LETTERS, letters, " ")
my_substmat <- diag(rep(1L, length(my_alphabet))) - 1L
dimnames(my_substmat) <- list(my_alphabet, my_alphabet)
With our scoring scheme, the best possible score is 0 and this independently of the lengths of the 2 strings:
pairwiseAlignment("TINA", "TINA", type="global",
substitutionMatrix=my_substmat, gapOpening=5, gapExtension=1)
# Global PairwiseAlignmentsSingleSubject (1 of 1)
# pattern: TINA
# subject: TINA
# score: 0
pairwiseAlignment("TINA ANDERSON", "TINA ANDERSON", type="global",
substitutionMatrix=my_substmat, gapOpening=5, gapExtension=1)
# Global PairwiseAlignmentsSingleSubject (1 of 1)
# pattern: TINA ANDERSON
# subject: TINA ANDERSON
# score: 0
Any difference between the 2 strings (i.e. any substitution or gap) takes the score to the negative side:
## -1 for each substitution (controlled via our subsitution matrix)
pairwiseAlignment("TINA ANDERSON", "TYNA ANDERSOM", type="global",
substitutionMatrix=my_substmat, gapOpening=5, gapExtension=1)
# Global PairwiseAlignmentsSingleSubject (1 of 1)
# pattern: TINA ANDERSON
# subject: TYNA ANDERSOM
# score: -2
## -5 for a gap opening + size_of_gap * (-1) (controlled via the gapOpening
## and gapExtension arguments):
pairwiseAlignment("TINA JOY ANDERSON", "TINA ANDERSON", type="global",
substitutionMatrix=my_substmat, gapOpening=5, gapExtension=1)
# Global PairwiseAlignmentsSingleSubject (1 of 1)
# pattern: TINA JOY ANDERSON
# subject: TINA ----ANDERSON
# score: -9
## With 2 substitutions and a gap of size 4:
pairwiseAlignment("TINA JOY ANDERSON", "TYNA ANDERSOM", type="global",
substitutionMatrix=my_substmat, gapOpening=5, gapExtension=1)
# Global PairwiseAlignmentsSingleSubject (1 of 1)
# pattern: TINA JOY ANDERSON
# subject: TYNA ----ANDERSOM
# score: -11
Note that with this scoring scheme, the worst possible score depends on the length of the 2 strings. If we want to be able to normalize the score, we need to know the worst possible score, which can be tricky... unless we ask pairwiseAlignment()
to do that for us:
fill_with_letter <- function(x, letter)
{
stopifnot(is.character(letter), length(letter) == 1L, nchar(letter) == 1L)
x2 <- strrep(letter, nchar(x))
if (!is.character(x))
x2 <- as(x2, class(x))
x2
}
## Assuming that all substitutions have the same cost.
worst_possible_score <- function(pattern, subject, substitutionMatrix,
gapOpening=10, gapExtension=4)
{
# Change the content of pattern and subject to make them 100% different.
pattern2 <- fill_with_letter(pattern, rownames(substitutionMatrix)[[1L]])
subject2 <- fill_with_letter(subject, rownames(substitutionMatrix)[[2L]])
pairwiseAlignment(pattern2, subject2, type="global",
substitutionMatrix=substitutionMatrix,
gapOpening=gapOpening, gapExtension=gapExtension,
scoreOnly=TRUE)
}
Then we'll be able to normalize the score by applying the following generic transformation to it:
normalize_score <- function(score, best_possible_score, worst_possible_score)
{
(score - worst_possible_score) / (best_possible_score - worst_possible_score)
}
Let's wrap all this in one function that takes as input the object returned by pairwiseAlignment()
:
## Assuming we are doing global alignments and that the best possible
## score is 0!
normalizedScore <- function(pa, substitutionMatrix)
{
stopifnot(is(pa, "PairwiseAlignments"), type(pa) == "global")
pattern <- unaligned(pattern(pa))
subject <- unaligned(subject(pa))
wps <- worst_possible_score(pattern, subject, substitutionMatrix,
pa@gapOpening, pa@gapExtension)
normalize_score(score(pa), 0L, wps)
}
Then:
pa1 <- pairwiseAlignment("TINA", "TINA", type="global",
substitutionMatrix=my_substmat,
gapOpening=5, gapExtension=1)
normalizedScore(pa1, my_substmat)
# [1] 1
pa2 <- pairwiseAlignment("TINA JOY ANDERSON", "TYNA ANDERSOM", type="global",
substitutionMatrix=my_substmat,
gapOpening=5, gapExtension=1)
normalizedScore(pa2, my_substmat)
# [1] 0.5
pa3 <- pairwiseAlignment("ABCDEF", "abcd", type="global",
substitutionMatrix=my_substmat,
gapOpening=5, gapExtension=1)
normalizedScore(pa3, my_substmat)
# [1] 0
Not really as straightforward as it should be :-/
H.