Question

pairwiseAlignment() - Similarity NA% ?

0

Entering edit mode

boris.steipe • 0

@borissteipe-7050

Last seen 7.1 years ago

Canada

pairwiseAlignment() returns NA values for similarity length and similarity %:

MWE:

p1 <- AAString(in2seq("VDVYEFIHSTGSIMKRKKDDWVNATHILK"))
q1 <- AAString(in2seq("IPVFEYPLNGQYIMIDCETGMVHFTGIWK"))
ali <- pairwiseAlignment(p1, q1,
                         substitutionMatrix = "BLOSUM62",
                         gapOpening = 1, gapExtension = 1)
writePairwiseAlignments(ali)

----

[...]
# Length: 34
# Identity: 10/34 (29.4%)
# Similarity:    NA/34 (NA%)
# Gaps:          10/34 (29.4%)
# Score: 36
[...]

(EMBOSS needle with the same parameters gives 48% similarity.)

Is it expected that pairwiseAlignment() produces NA in this case?

Thanks!

Boris

-----

R > sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Biostrings_2.36.4    XVector_0.8.0        IRanges_2.2.9        S4Vectors_0.6.6     
[5] BiocGenerics_0.14.0  BiocInstaller_1.18.4 seqinr_3.1-3         ade4_1.7-2          
[9] stringr_1.0.0       

loaded via a namespace (and not attached):
[1] zlibbioc_1.14.0 magrittr_1.5    tools_3.2.2     stringi_0.5-5

biostrings pairwisealignment • 1.3k views

ADD COMMENT • link updated 8.5 years ago by Hervé Pagès 16k • written 8.5 years ago by boris.steipe • 0

score 0 · Answer 1 · 2015-10-16

Hi Boris,

writePairwiseAlignments() doesn't display the percent sequence similarity because there is no consensus on how this should be calculated. Note that it's the same situation for the percent sequence identity (see ?pid) even though writePairwiseAlignments() displays it.

See ?pid for the 4 different types of percent sequence identity that are supported. Also see selectMethod("pid", "PairwiseAlignments") for the implementation. As you can see the numerator is always 100*nmatch(x) (where x is the object returned by pairwiseAlignment()) and the denominator varies:

PID1: nchar(x)
PID2: nmatch(x) + nmismatch(x)
PID3: pmin(nchar(unaligned(pattern(x))), nchar(unaligned(subject(x))))
PID4: (nchar(unaligned(pattern(x))) + nchar(unaligned(subject(x)))) / 2

It should be easy to define your own similarity function along these lines. If none of these denominators is what you want, you can always use a denominator that that you feel is more appropriate to your use case e.g. nchar(unaligned(pattern(x)))' or 'nchar(unaligned(subject(x))).

Once you have a similarity function that returns the same as EMBOSS, I would gladly take it and add it to Biostrings.

Cheers,

H.