Hi Katarzyna,
I think that you need to perform a local alignment for this (type="local"
). Also it looks like you'll get best results by specifying gapExtension=0
:
library(Biostrings)
dna1 <-
DNAString("NNNNNNATCGTTAAGCCCNNNNNTAGCAAAGGTAAANNNNN")
dna2 <-
DNAString("++ATCGTTAAGCCC+++++++++++++++TAGCAAAGGTAAA+")
pa <- pairwiseAlignment(dna1, dna2, type="local",
gapExtension=0)
pa
# Local PairwiseAlignmentsSingleSubject (1 of 1)
# pattern: [7] ATCGTTAAGCCC---------------NNNNNTAGCAAAGGTAAA
# subject: [3] ATCGTTAAGCCC+++++++++++++++-----TAGCAAAGGTAAA
# score: 29.5439
See more details with:
writePairwiseAlignments(pa)
########################################
# Program: Biostrings (version 2.46.0), a Bioconductor package
# Rundate: Fri Jan 26 12:01:59 2018
########################################
#=======================================
#
# Aligned_sequences: 2
# 1: P1
# 2: S1
# Matrix: NA
# Gap_penalty: 10.0
# Extend_penalty: 0.0
#
# Length: 45
# Identity: 25/45 (55.6%)
# Similarity: NA/45 (NA%)
# Gaps: 20/45 (44.4%)
# Score: 29.5439
#
#
#=======================================
P1 7 ATCGTTAAGCCC---------------NNNNNTAGCAAAGGTAAA 36
|||||||||||| |||||||||||||
S1 3 ATCGTTAAGCCC+++++++++++++++-----TAGCAAAGGTAAA 42
#---------------------------------------
#---------------------------------------
Unfortunately, programmatically getting the ranges of the overlaps is a lot trickier than one wishes:
pattern_ins <- indel(pattern(pa))[[1]]
subject_ins <- indel(subject(pa))[[1]]
relative_gap <- union(pattern_ins, subject_ins)
stopifnot(length(relative_gap) == 1) # only one gap is expected
first_overlap_width <- start(relative_gap) - 1
pattern_gap <- IRanges(start(ranges(pattern(pa))) + first_overlap_width,
width=width(subject_ins))
subject_gap <- IRanges(start(ranges(subject(pa))) + first_overlap_width,
width=width(pattern_ins))
pattern_overlaps <- setdiff(ranges(pattern(pa)), pattern_gap)
subject_overlaps <- setdiff(ranges(subject(pa)), subject_gap)
Then:
Views(dna1, pattern_overlaps)
# Views on a 41-letter DNAString subject
# subject: NNNNNNATCGTTAAGCCCNNNNNTAGCAAAGGTAAANNNNN
# views:
# start end width
# [1] 7 18 12 [ATCGTTAAGCCC]
# [2] 24 36 13 [TAGCAAAGGTAAA]
Views(dna2, subject_overlaps)
# Views on a 43-letter DNAString subject
# subject: ++ATCGTTAAGCCC+++++++++++++++TAGCAAAGGTAAA+
# views:
# start end width
# [1] 3 14 12 [ATCGTTAAGCCC]
# [2] 30 42 13 [TAGCAAAGGTAAA]
Hope this helps,
H.
Hi Hervé,
Thank you for the very clear answer. Do I understand correctly that your script will only find two overlapping regions?
Kind regards,
K.
Yes, the code I provided only works if there are only 2 overlapping regions that match with no indels (but mismatches should be ok). I'm afraid generalizing it to support N overlapping regions could get quite complicated...
Cheers,
H.