Hi,
The interface for extracting useful information from the PairwiseAlignments object returned by pairwiseAlignment()
is admittedly awkward and lacks flexibility. I'm not aware of any easy way to extract the sequences (nucleotide content in your case) of the indels so below I'm providing a function that you can use for that.
indelSeqs <- function(x)
{
stopifnot(is(x, "PairwiseAlignments"), length(x) == 1L)
aligned_pattern <- pattern(x)
aligned_subject <- subject(x)
## Compute insertion ranges with respect to aligned pattern.
ins_ranges <- indel(aligned_subject)[[1L]]
shift <- c(0L, head(cumsum(width(ins_ranges)), n=-1L))
ins_ranges <- shift(ins_ranges, shift)
## Compute deletion ranges with respect to aligned subject.
del_ranges <- indel(aligned_pattern)[[1L]]
shift <- c(0L, head(cumsum(width(del_ranges)), n=-1L))
del_ranges <- shift(del_ranges, shift)
## Turn 'aligned_pattern' and 'aligned_subject' into XString
## objects.
ins_seqs_class <- paste0(seqtype(aligned_pattern), "String")
aligned_pattern <- as(as.character(aligned_pattern),
ins_seqs_class)
del_seqs_class <- paste0(seqtype(aligned_subject), "String")
aligned_subject <- as(as.character(aligned_subject),
del_seqs_class)
## Extract the indel sequences.
ins_seqs <- extractAt(aligned_pattern, ins_ranges)
del_seqs <- extractAt(aligned_subject, del_ranges)
list(insertion=ins_seqs, deletion=del_seqs)
}
The function takes the PairwiseAlignments object returned by pairwiseAlignment()
and returns a named list of 2 DNAStringSet objects (AAStringSet if the pattern and subject are amino-acid sequences):
library(Biostrings)
pattern <- DNAString("AGGGGGGTCTCATCTCGGTCTC")
subject <- DNAString("AGGGTGGGTCTCTCTCTCTC")
pa <- pairwiseAlignment(pattern, subject)
pa
# Global PairwiseAlignmentsSingleSubject (1 of 1)
# pattern: [1] AGGG-GGGTCTCATCTCGGTCTC
# subject: [1] AGGGTGGGTCTC-TCTC--TCTC
# sc ore: -8.346632
#writePairwiseAlignments(pa)
indelSeqs(pa)
# $insertion
# A DNAStringSet instance of length 2
# width seq
# [1] 1 A
# [2] 2 GG
#
# $deletion
# A DNAStringSet instance of length 1
# width seq
# [1] 1 T
By convention "insertion" and "deletion" mean "insertion in the subject" and "deletion from the subject", respectively. This is consistent with the convention used for the cigars in BAM files where I and D stand for insertion and deletion with respect to the reference.
Note that the "insertion" and "deletion" components are guaranteed to have 1 sequence per range in indel(subject(pa))[[1]]
and indel(pattern(pa))[[1]]
, respectively, and in the same order:
indel(subject(pa))[[1]]
# IRanges of length 2
# start end width
# [1] 13 13 1
# [2] 17 18 2
indelSeqs(pa)$insertion # parallel to indel(subject(pa))[[1]]
# A DNAStringSet instance of length 2
# width seq
# [1] 1 A
# [2] 2 GG
indel(pattern(pa))[[1]]
# IRanges of length 1
# start end width
# [1] 5 5 1
indelSeqs(pa)$deletion # parallel to indel(pattern(pa))[[1]]
# A DNAStringSet instance of length 1
# width seq
# [1] 1 T
Finally note that indelSeqs()
only takes a PairwiseAlignments object of length 1. If you supply more than 1 pattern in your call to pairwiseAlignment()
, you'll get a PairwiseAlignments object of length the number of patterns you supplied. Then to use indelSeqs()
you'll have to subset the PairwiseAlignments object first e.g. indelSeqs(pa[4])
to get the indel sequences between the 4-th pattern and the subject. In other words, indelSeqs()
is not vectorized.
The function was lightly tested only so please let me know if you find any problem with it.
Cheers,
H.