Biostrings: problem to access indel-details (nucleotide sequence) form pairwiseAlignment()
1
0
Entering edit mode
naktang1 • 0
@naktang1-8825
Last seen 4.7 years ago
Thailand

How to get the nucleotide at the indel position after I use pairwise pairwiseAlignment()

It is easy way to get the position of indel  of pattern or subject in pairwiseAlignment()

For example,

start(indel(pattern(align)))) ## can get indel position in pattern
start(indel(subject(align))))  ## can get indel position in subject

 

However, how to get the nucleotide at indel position If I have an A nucleotide as insertion occur

so I want to get

A nucleotide in pattern in one column of new dataframe and - symbol in subject in another column of a new dataframe so how to do that

 

I also read this link https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2009-July/000466.html but didn't get any idea.

bioconductor biostrings • 1.5k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 5 hours ago
Seattle, WA, United States

Hi,

The interface for extracting useful information from the PairwiseAlignments object returned by pairwiseAlignment() is admittedly awkward and lacks flexibility. I'm not aware of any easy way to extract the sequences (nucleotide content in your case) of the indels so below I'm providing a function that you can use for that.

indelSeqs <- function(x)
{
    stopifnot(is(x, "PairwiseAlignments"), length(x) == 1L)
    aligned_pattern <- pattern(x)
    aligned_subject <- subject(x)

    ## Compute insertion ranges with respect to aligned pattern.
    ins_ranges <- indel(aligned_subject)[[1L]]
    shift <- c(0L, head(cumsum(width(ins_ranges)), n=-1L))
    ins_ranges <- shift(ins_ranges, shift)

    ## Compute deletion ranges with respect to aligned subject.
    del_ranges <- indel(aligned_pattern)[[1L]]
    shift <- c(0L, head(cumsum(width(del_ranges)), n=-1L))
    del_ranges <- shift(del_ranges, shift)

    ## Turn 'aligned_pattern' and 'aligned_subject' into XString
    ## objects.
    ins_seqs_class <- paste0(seqtype(aligned_pattern), "String")
    aligned_pattern <- as(as.character(aligned_pattern),
                          ins_seqs_class)
    del_seqs_class <- paste0(seqtype(aligned_subject), "String")
    aligned_subject <- as(as.character(aligned_subject),
                          del_seqs_class)

    ## Extract the indel sequences.
    ins_seqs <- extractAt(aligned_pattern, ins_ranges)
    del_seqs <- extractAt(aligned_subject, del_ranges)
    list(insertion=ins_seqs, deletion=del_seqs)
}

The function takes the PairwiseAlignments object returned by pairwiseAlignment() and returns a named list of 2 DNAStringSet objects (AAStringSet if the pattern and subject are amino-acid sequences):

library(Biostrings)
pattern <- DNAString("AGGGGGGTCTCATCTCGGTCTC")
subject <- DNAString("AGGGTGGGTCTCTCTCTCTC")
pa <- pairwiseAlignment(pattern, subject)
pa
# Global PairwiseAlignmentsSingleSubject (1 of 1)
# pattern: [1] AGGG-GGGTCTCATCTCGGTCTC 
# subject: [1] AGGGTGGGTCTC-TCTC--TCTC 
# sc ore: -8.346632 

#writePairwiseAlignments(pa)

indelSeqs(pa)
# $insertion
#   A DNAStringSet instance of length 2
#     width seq
# [1]     1 A
# [2]     2 GG
#
# $deletion
#   A DNAStringSet instance of length 1
#     width seq
# [1]     1 T

By convention "insertion" and "deletion" mean "insertion in the subject" and "deletion from the subject", respectively. This is consistent with the convention used for the cigars in BAM files where I and D stand for insertion and deletion with respect to the reference.

Note that the "insertion" and "deletion" components are guaranteed to have 1 sequence per range in indel(subject(pa))[[1]] and indel(pattern(pa))[[1]], respectively, and in the same order:

indel(subject(pa))[[1]]
# IRanges of length 2
#     start end width
# [1]    13  13     1
# [2]    17  18     2

indelSeqs(pa)$insertion  # parallel to indel(subject(pa))[[1]]
#   A DNAStringSet instance of length 2
#     width seq
# [1]     1 A
# [2]     2 GG

indel(pattern(pa))[[1]]
# IRanges of length 1
#     start end width
# [1]     5   5     1

indelSeqs(pa)$deletion   # parallel to indel(pattern(pa))[[1]]
#   A DNAStringSet instance of length 1
#     width seq
# [1]     1 T

Finally note that indelSeqs() only takes a PairwiseAlignments object of length 1. If you supply more than 1 pattern in your call to pairwiseAlignment(), you'll get a PairwiseAlignments object of length the number of patterns you supplied. Then to use indelSeqs() you'll have to subset the PairwiseAlignments object first e.g. indelSeqs(pa[4]) to get the indel sequences between the 4-th pattern and the subject. In other words, indelSeqs() is not vectorized.

The function was lightly tested only so please let me know if you find any problem with it.

Cheers,

H.

ADD COMMENT

Login before adding your answer.

Traffic: 921 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6