Question

How to find multiple overlaps with pairwiseAlignment()?

0

Entering edit mode

kgorczak ▴ 10

@kgorczak-8529

Last seen 6.2 years ago

Belgium

I have two DNA sequences with two overlaps (one 557 bp long and one 140 bp long). Both overlaps are exactly the same between the DNA sequences (no mismatches between them). I would like to use the pairwiseAlignment() function in order to find them. However, that function returns always only one overlap, the first found one (in my case, the one of length 557 bp). Is it possible to find both overlapping sequences and their lengths?

For example:

DNA1: ATCGTTTAAGCCCGGTTATAC.............TAGCAAAGGTAAA...........

DNA2: ATCGTTTAAGCCCGGTTATAC..........................TAGCAAAGGTAAA

There are two overlapping sequences:

seq1: ATCGTTTAAGCCCGGTTATAC

seq2: TAGCAAAGGTAAA

pairwiseAlignment() will return ATCGTTTAAGCCCGGTTATAC (length: 21 bp). How can I get the information about the second overlapping sequence?

biostrings pairwisealignment • 1.4k views

ADD COMMENT • link updated 6.2 years ago by Hervé Pagès 16k • written 6.2 years ago by kgorczak ▴ 10

score 0 · Answer 1 · 2018-01-26

Hi Katarzyna,

I think that you need to perform a local alignment for this (type="local"). Also it looks like you'll get best results by specifying gapExtension=0:

library(Biostrings)
dna1 <-
    DNAString("NNNNNNATCGTTAAGCCCNNNNNTAGCAAAGGTAAANNNNN")
dna2 <-
    DNAString("++ATCGTTAAGCCC+++++++++++++++TAGCAAAGGTAAA+")
pa <- pairwiseAlignment(dna1, dna2, type="local",
                        gapExtension=0)
pa
# Local PairwiseAlignmentsSingleSubject (1 of 1)
# pattern: [7] ATCGTTAAGCCC---------------NNNNNTAGCAAAGGTAAA 
# subject: [3] ATCGTTAAGCCC+++++++++++++++-----TAGCAAAGGTAAA 
# score: 29.5439

See more details with:

writePairwiseAlignments(pa)
########################################
# Program: Biostrings (version 2.46.0), a Bioconductor package
# Rundate: Fri Jan 26 12:01:59 2018
########################################
#=======================================
#
# Aligned_sequences: 2
# 1: P1
# 2: S1
# Matrix: NA
# Gap_penalty: 10.0
# Extend_penalty: 0.0
#
# Length: 45
# Identity:      25/45 (55.6%)
# Similarity:    NA/45 (NA%)
# Gaps:          20/45 (44.4%)
# Score: 29.5439
#
#
#=======================================

P1                 7 ATCGTTAAGCCC---------------NNNNNTAGCAAAGGTAAA     36
                     ||||||||||||                    |||||||||||||
S1                 3 ATCGTTAAGCCC+++++++++++++++-----TAGCAAAGGTAAA     42

#---------------------------------------
#---------------------------------------

Unfortunately, programmatically getting the ranges of the overlaps is a lot trickier than one wishes:

pattern_ins <- indel(pattern(pa))[[1]]
subject_ins <- indel(subject(pa))[[1]]
relative_gap <- union(pattern_ins, subject_ins)
stopifnot(length(relative_gap) == 1)  # only one gap is expected
first_overlap_width <- start(relative_gap) - 1
pattern_gap <- IRanges(start(ranges(pattern(pa))) + first_overlap_width,
                       width=width(subject_ins))
subject_gap <- IRanges(start(ranges(subject(pa))) + first_overlap_width,
                       width=width(pattern_ins))
pattern_overlaps <- setdiff(ranges(pattern(pa)), pattern_gap)
subject_overlaps <- setdiff(ranges(subject(pa)), subject_gap)

Then:

Views(dna1, pattern_overlaps)
#   Views on a 41-letter DNAString subject
# subject: NNNNNNATCGTTAAGCCCNNNNNTAGCAAAGGTAAANNNNN
# views:
#     start end width
# [1]     7  18    12 [ATCGTTAAGCCC]
# [2]    24  36    13 [TAGCAAAGGTAAA]

Views(dna2, subject_overlaps)
#   Views on a 43-letter DNAString subject
# subject: ++ATCGTTAAGCCC+++++++++++++++TAGCAAAGGTAAA+
# views:
#     start end width
# [1]     3  14    12 [ATCGTTAAGCCC]
# [2]    30  42    13 [TAGCAAAGGTAAA]

Hope this helps,

H.