Question

pairwisealignment how to normalize score?

0

Entering edit mode

marie • 0

@marie-22260

Last seen 6.2 years ago

Hi - I would like to normalize the score returned from the following so that scores are between 0 and 1

pairwiseAlignment(pattern = "TINA JOY ANDERSON", subject = "TINA ANDERSON", type = "global", gapOpening = 1, gapExtension = 1, scoreOnly = TRUE)

is this possible?

Some understanding of the formula used to get the score would also be great... I had thought this would implement something like the Affine Gap distance...

Thanks in advance. Marie

biostrings • 1.1k views

ADD COMMENT • link updated 6.2 years ago by Hervé Pagès 16k • written 6.2 years ago by marie • 0

score 0 · Answer 1 · 2019-11-13

Hi,

Unfortunately the default scoring scheme is quality-based and somewhat opaque so probably best avoid it. The good news is that you can take full control of the scoring scheme by supplying your own substitution matrix:

library(Biostrings)
my_alphabet <- c(LETTERS, letters, " ")
my_substmat <- diag(rep(1L, length(my_alphabet))) - 1L
dimnames(my_substmat) <- list(my_alphabet, my_alphabet)

With our scoring scheme, the best possible score is 0 and this independently of the lengths of the 2 strings:

pairwiseAlignment("TINA", "TINA", type="global",
                  substitutionMatrix=my_substmat, gapOpening=5, gapExtension=1)
# Global PairwiseAlignmentsSingleSubject (1 of 1)
# pattern: TINA
# subject: TINA
# score: 0 

pairwiseAlignment("TINA ANDERSON", "TINA ANDERSON", type="global",
                  substitutionMatrix=my_substmat, gapOpening=5, gapExtension=1)
# Global PairwiseAlignmentsSingleSubject (1 of 1)
# pattern: TINA ANDERSON
# subject: TINA ANDERSON
# score: 0

Any difference between the 2 strings (i.e. any substitution or gap) takes the score to the negative side:

## -1 for each substitution (controlled via our subsitution matrix)
pairwiseAlignment("TINA ANDERSON", "TYNA ANDERSOM", type="global",
                  substitutionMatrix=my_substmat, gapOpening=5, gapExtension=1)
# Global PairwiseAlignmentsSingleSubject (1 of 1)
# pattern: TINA ANDERSON
# subject: TYNA ANDERSOM
# score: -2 

## -5 for a gap opening + size_of_gap * (-1) (controlled via the gapOpening
## and gapExtension arguments):
pairwiseAlignment("TINA JOY ANDERSON", "TINA ANDERSON", type="global",
                  substitutionMatrix=my_substmat, gapOpening=5, gapExtension=1)
# Global PairwiseAlignmentsSingleSubject (1 of 1)
# pattern: TINA JOY ANDERSON
# subject: TINA ----ANDERSON
# score: -9 

## With 2 substitutions and a gap of size 4:
pairwiseAlignment("TINA JOY ANDERSON", "TYNA ANDERSOM", type="global",
                  substitutionMatrix=my_substmat, gapOpening=5, gapExtension=1)
# Global PairwiseAlignmentsSingleSubject (1 of 1)
# pattern: TINA JOY ANDERSON
# subject: TYNA ----ANDERSOM
# score: -11

Note that with this scoring scheme, the worst possible score depends on the length of the 2 strings. If we want to be able to normalize the score, we need to know the worst possible score, which can be tricky... unless we ask pairwiseAlignment() to do that for us:

fill_with_letter <- function(x, letter)
{
    stopifnot(is.character(letter), length(letter) == 1L, nchar(letter) == 1L)
    x2 <- strrep(letter, nchar(x))
    if (!is.character(x))
        x2 <- as(x2, class(x))
    x2
}

## Assuming that all substitutions have the same cost.
worst_possible_score <- function(pattern, subject, substitutionMatrix,
                                 gapOpening=10, gapExtension=4)
{
    # Change the content of pattern and subject to make them 100% different.
    pattern2 <- fill_with_letter(pattern, rownames(substitutionMatrix)[[1L]])
    subject2 <- fill_with_letter(subject, rownames(substitutionMatrix)[[2L]])
    pairwiseAlignment(pattern2, subject2, type="global",
                      substitutionMatrix=substitutionMatrix,
                      gapOpening=gapOpening, gapExtension=gapExtension,
                      scoreOnly=TRUE)
}

Then we'll be able to normalize the score by applying the following generic transformation to it:

normalize_score <- function(score, best_possible_score, worst_possible_score)
{
    (score - worst_possible_score) / (best_possible_score - worst_possible_score)
}

Let's wrap all this in one function that takes as input the object returned by pairwiseAlignment():

## Assuming we are doing global alignments and that the best possible
## score is 0!
normalizedScore <- function(pa, substitutionMatrix)
{
    stopifnot(is(pa, "PairwiseAlignments"), type(pa) == "global")
    pattern <- unaligned(pattern(pa))
    subject <- unaligned(subject(pa))
    wps <- worst_possible_score(pattern, subject, substitutionMatrix,
                                pa@gapOpening, pa@gapExtension)
    normalize_score(score(pa), 0L, wps)
}

Then:

pa1 <- pairwiseAlignment("TINA", "TINA", type="global",
                         substitutionMatrix=my_substmat,
                         gapOpening=5, gapExtension=1)
normalizedScore(pa1, my_substmat)
# [1] 1
pa2 <- pairwiseAlignment("TINA JOY ANDERSON", "TYNA ANDERSOM", type="global",
                         substitutionMatrix=my_substmat,
                         gapOpening=5, gapExtension=1)
normalizedScore(pa2, my_substmat)
# [1] 0.5
pa3 <- pairwiseAlignment("ABCDEF", "abcd", type="global",
                         substitutionMatrix=my_substmat,
                         gapOpening=5, gapExtension=1)
normalizedScore(pa3, my_substmat)
# [1] 0

Not really as straightforward as it should be :-/

H.