How to find multiple overlaps with pairwiseAlignment()?
1
0
Entering edit mode
kgorczak ▴ 10
@kgorczak-8529
Last seen 6.8 years ago
Belgium

I have two DNA sequences with two overlaps (one 557 bp long and one 140 bp long). Both overlaps are exactly the same between the DNA sequences (no mismatches between them). I would like to use the pairwiseAlignment() function in order to find them. However, that function returns always only one overlap, the first found one (in my case, the one of length 557 bp). Is it possible to find both overlapping sequences and their lengths?

For example:

DNA1: ATCGTTTAAGCCCGGTTATAC.............TAGCAAAGGTAAA...........

DNA2: ATCGTTTAAGCCCGGTTATAC..........................TAGCAAAGGTAAA

There are two overlapping sequences:

seq1: ATCGTTTAAGCCCGGTTATAC

seq2: TAGCAAAGGTAAA

pairwiseAlignment() will return ATCGTTTAAGCCCGGTTATAC (length: 21 bp). How can I get the information about the second overlapping sequence?

biostrings pairwisealignment • 1.6k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 16 hours ago
Seattle, WA, United States

Hi Katarzyna,

I think that you need to perform a local alignment for this (type="local"). Also it looks like you'll get best results by specifying gapExtension=0:

library(Biostrings)
dna1 <-
    DNAString("NNNNNNATCGTTAAGCCCNNNNNTAGCAAAGGTAAANNNNN")
dna2 <-
    DNAString("++ATCGTTAAGCCC+++++++++++++++TAGCAAAGGTAAA+")
pa <- pairwiseAlignment(dna1, dna2, type="local",
                        gapExtension=0)
pa
# Local PairwiseAlignmentsSingleSubject (1 of 1)
# pattern: [7] ATCGTTAAGCCC---------------NNNNNTAGCAAAGGTAAA 
# subject: [3] ATCGTTAAGCCC+++++++++++++++-----TAGCAAAGGTAAA 
# score: 29.5439

See more details with:

writePairwiseAlignments(pa)
########################################
# Program: Biostrings (version 2.46.0), a Bioconductor package
# Rundate: Fri Jan 26 12:01:59 2018
########################################
#=======================================
#
# Aligned_sequences: 2
# 1: P1
# 2: S1
# Matrix: NA
# Gap_penalty: 10.0
# Extend_penalty: 0.0
#
# Length: 45
# Identity:      25/45 (55.6%)
# Similarity:    NA/45 (NA%)
# Gaps:          20/45 (44.4%)
# Score: 29.5439
#
#
#=======================================

P1                 7 ATCGTTAAGCCC---------------NNNNNTAGCAAAGGTAAA     36
                     ||||||||||||                    |||||||||||||
S1                 3 ATCGTTAAGCCC+++++++++++++++-----TAGCAAAGGTAAA     42

#---------------------------------------
#---------------------------------------

Unfortunately, programmatically getting the ranges of the overlaps is a lot trickier than one wishes:

pattern_ins <- indel(pattern(pa))[[1]]
subject_ins <- indel(subject(pa))[[1]]
relative_gap <- union(pattern_ins, subject_ins)
stopifnot(length(relative_gap) == 1)  # only one gap is expected
first_overlap_width <- start(relative_gap) - 1
pattern_gap <- IRanges(start(ranges(pattern(pa))) + first_overlap_width,
                       width=width(subject_ins))
subject_gap <- IRanges(start(ranges(subject(pa))) + first_overlap_width,
                       width=width(pattern_ins))
pattern_overlaps <- setdiff(ranges(pattern(pa)), pattern_gap)
subject_overlaps <- setdiff(ranges(subject(pa)), subject_gap)

Then:

Views(dna1, pattern_overlaps)
#   Views on a 41-letter DNAString subject
# subject: NNNNNNATCGTTAAGCCCNNNNNTAGCAAAGGTAAANNNNN
# views:
#     start end width
# [1]     7  18    12 [ATCGTTAAGCCC]
# [2]    24  36    13 [TAGCAAAGGTAAA]

Views(dna2, subject_overlaps)
#   Views on a 43-letter DNAString subject
# subject: ++ATCGTTAAGCCC+++++++++++++++TAGCAAAGGTAAA+
# views:
#     start end width
# [1]     3  14    12 [ATCGTTAAGCCC]
# [2]    30  42    13 [TAGCAAAGGTAAA]

Hope this helps,

H.

ADD COMMENT
0
Entering edit mode

Hi Hervé,

Thank you for the very clear answer. Do I understand correctly that your script will only find two overlapping regions?

Kind regards,

K.

 

ADD REPLY
0
Entering edit mode

Yes, the code I provided only works if there are only 2 overlapping regions that match with no indels (but mismatches should be ok). I'm afraid generalizing it to support N overlapping regions could get quite complicated...

Cheers,  

H.

ADD REPLY

Login before adding your answer.

Traffic: 882 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6