Question

Biostrings: problem to access indel-details (nucleotide sequence) form pairwiseAlignment()

0

Entering edit mode

naktang1 • 0

@naktang1-8825

Last seen 4.7 years ago

Thailand

How to get the nucleotide at the indel position after I use pairwise pairwiseAlignment()

It is easy way to get the position of indel of pattern or subject in pairwiseAlignment()

For example,

start(indel(pattern(align)))) ## can get indel position in pattern

start(indel(subject(align))))  ## can get indel position in subject

However, how to get the nucleotide at indel position If I have an A nucleotide as insertion occur

so I want to get

A nucleotide in pattern in one column of new dataframe and - symbol in subject in another column of a new dataframe so how to do that

I also read this link https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2009-July/000466.html but didn't get any idea.

bioconductor biostrings • 1.5k views

ADD COMMENT • link updated 8.6 years ago by Hervé Pagès 16k • written 8.6 years ago by naktang1 • 0

score 0 · Answer 1 · 2015-09-22

Hi,

The interface for extracting useful information from the PairwiseAlignments object returned by pairwiseAlignment() is admittedly awkward and lacks flexibility. I'm not aware of any easy way to extract the sequences (nucleotide content in your case) of the indels so below I'm providing a function that you can use for that.

indelSeqs <- function(x)
{
    stopifnot(is(x, "PairwiseAlignments"), length(x) == 1L)
    aligned_pattern <- pattern(x)
    aligned_subject <- subject(x)

    ## Compute insertion ranges with respect to aligned pattern.
    ins_ranges <- indel(aligned_subject)[[1L]]
    shift <- c(0L, head(cumsum(width(ins_ranges)), n=-1L))
    ins_ranges <- shift(ins_ranges, shift)

    ## Compute deletion ranges with respect to aligned subject.
    del_ranges <- indel(aligned_pattern)[[1L]]
    shift <- c(0L, head(cumsum(width(del_ranges)), n=-1L))
    del_ranges <- shift(del_ranges, shift)

    ## Turn 'aligned_pattern' and 'aligned_subject' into XString
    ## objects.
    ins_seqs_class <- paste0(seqtype(aligned_pattern), "String")
    aligned_pattern <- as(as.character(aligned_pattern),
                          ins_seqs_class)
    del_seqs_class <- paste0(seqtype(aligned_subject), "String")
    aligned_subject <- as(as.character(aligned_subject),
                          del_seqs_class)

    ## Extract the indel sequences.
    ins_seqs <- extractAt(aligned_pattern, ins_ranges)
    del_seqs <- extractAt(aligned_subject, del_ranges)
    list(insertion=ins_seqs, deletion=del_seqs)
}

The function takes the PairwiseAlignments object returned by pairwiseAlignment() and returns a named list of 2 DNAStringSet objects (AAStringSet if the pattern and subject are amino-acid sequences):

library(Biostrings)
pattern <- DNAString("AGGGGGGTCTCATCTCGGTCTC")
subject <- DNAString("AGGGTGGGTCTCTCTCTCTC")
pa <- pairwiseAlignment(pattern, subject)
pa
# Global PairwiseAlignmentsSingleSubject (1 of 1)
# pattern: [1] AGGG-GGGTCTCATCTCGGTCTC 
# subject: [1] AGGGTGGGTCTC-TCTC--TCTC 
# sc ore: -8.346632 

#writePairwiseAlignments(pa)

indelSeqs(pa)
# $insertion
#   A DNAStringSet instance of length 2
#     width seq
# [1]     1 A
# [2]     2 GG
#
# $deletion
#   A DNAStringSet instance of length 1
#     width seq
# [1]     1 T

By convention "insertion" and "deletion" mean "insertion in the subject" and "deletion from the subject", respectively. This is consistent with the convention used for the cigars in BAM files where I and D stand for insertion and deletion with respect to the reference.

Note that the "insertion" and "deletion" components are guaranteed to have 1 sequence per range in indel(subject(pa))[[1]] and indel(pattern(pa))[[1]], respectively, and in the same order:

indel(subject(pa))[[1]]
# IRanges of length 2
#     start end width
# [1]    13  13     1
# [2]    17  18     2

indelSeqs(pa)$insertion  # parallel to indel(subject(pa))[[1]]
#   A DNAStringSet instance of length 2
#     width seq
# [1]     1 A
# [2]     2 GG

indel(pattern(pa))[[1]]
# IRanges of length 1
#     start end width
# [1]     5   5     1

indelSeqs(pa)$deletion   # parallel to indel(pattern(pa))[[1]]
#   A DNAStringSet instance of length 1
#     width seq
# [1]     1 T

Finally note that indelSeqs() only takes a PairwiseAlignments object of length 1. If you supply more than 1 pattern in your call to pairwiseAlignment(), you'll get a PairwiseAlignments object of length the number of patterns you supplied. Then to use indelSeqs() you'll have to subset the PairwiseAlignments object first e.g. indelSeqs(pa[4]) to get the indel sequences between the 4-th pattern and the subject. In other words, indelSeqs() is not vectorized.

The function was lightly tested only so please let me know if you find any problem with it.

Cheers,

H.