Biostrings pairwiseAligment() output shorter than source sequence
3
0
Entering edit mode
@jeanpeccoud-8564
Last seen 8.7 years ago

 

Hi, when doing 100 000+ pairwise alignments with pairwiseAlignment(), I noticed that some (only 3) aligned sequences retrieved from the function were shorter than the input sequences. 

Here is one offending sequence pair:

> seqs1
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                GL349802 
"GTCTGTACAGTCTGAACTGTACCATAGAAAATATATAGGTACTACTGTTACTAGTTACTCCAACTGTTGTAGGGTTTGTTGTATAAGTCATCGTCATGACTCATGACGAAGTATTAATTATGCCATGATGGTCAGATGTTGCACGGTATTCTTCAACTGTTATTTGTGCGCCATCTTGATCTTATGTTCTTAAATAACGCCTTAGGTACTGGCTATACACCTACTGTCTTAGCGTAAATCGTGTTGACAATCACACATAACATTTTTATCATTGAACATTATTATTTCTCTTGAGTTGTTAATTTATTGTTAAACCCAACACGGACGTTTCATCGTTAATTATCAATAACTGTTTTGAGTGACTATTTCGATTTTCGTTTACTGTACGCGTGTAACTATACGTTAACTATACGTTTTGTGGCGGACAGCAGACCACGATGGAAAACGAGAGAACGTACGTAGTACCTAGCTTGTTGCAAACAGCGGCACCGAACGCTTGCACCGAGATCGAGCGTGGCAGGTTCGACTTTCTGGAGGTACCGGACGTCGATATGCGACATTGTTGCGAGTCGCTGTTAGAAATATTAAAAATCGAGAAAAACCGAAACGAGGGCAACGAAGGACAGACCATGTCCATCAGGTTGCAGTTGACCAGCAAGGACAAGAGGTTTCTAAACATATGCGAGGAGGCAGCGTCGCATGGGCACATAAATTGCCTGAAACTCGCTCACGAGATTGGCGTCCCTTGGAACGATAAGGCGTGCAAACGGGCTGCGAGTTCGGGGCATCTGGAGTGTCTACGATACGCACGGGAAAACGGGTGCCCGTGGGACGAGAATACGTGCAGCAATGCCGCATTAAAGGGTCACATTGACTGCCTAATATACGCACGGGAAAACGGGTGCCCTTGGAATGAGAATACTTGCGGCCATGCCGCATTGAACGGTCACATGGACTGCCTGATATACGCACGGGAAAACGGGTGCCCTTGGGACCATAATACATGTACCGGTGCCGCAATGAAGGGTCGCATAGACTGCCTGATATACGCACGGGAAAACGGGTGCCCTTGGAACCTTAATACATGTAACGGTGCCGCAATGGAGGGTCACAAGGACTGCCTGGTTTATGCACGGGAAAACGGGTGCCCGTGGAACGAGTTAACGTGCAGCTTTGCCGCATTTTCCGGTCACACTGACTGCCTGATATACGCACGGGAAAACGGGTGCCCGTGGGACAAGTTTACGTGCAACGAAGCCGCATTAAAGGGTCACATTGACTGCCTGATATACGCACGGGAAAACGGGTGCCCGTGGGACACGGATACATGCAACTTTGCAGCGAAATACGGACACAAGGACTGCCTAGTTTACGCACGGGAAAACGGTTGCCCTTGGAATGAGAAAACTTGCAGCTATGCCGCATTGAACGGTCACATGGACTGCTTGATATACGCACGGGAAAACGGGTGCCCGTGGGACGCGAATACGTGCACCTTTGCAGCGAAGAACGGACACAAGGACTGCCTAGTTTACGCACGGGAATACGGGTGCCCTTGGAACGACTGGACATGCTTAAGTGCCGCAACAAACGGTCACATGGACTGCCTGGTATACGCCCGGGAAAACGGGTGCCCGTGGAATGAGAAAACTTGCAGCTATGCCGCTTTGAATGGTCACATGGCCTGCCTGATGTACTCACGGGAAAACGGGTGCCCGTGGGACGCGGATACGTGCAACTTTGCCGCGAAGAACGGCCACAAGGACTGCCTAGTGTACGCACGGGAAAACGGGTGCCCTTGGAATGAGAAAACTTGCAGGAATGCCGCATTGAACGGTGACATAGACTGCCTGATATACGCACGGTTAAACGGGTGCCCGTGGGACGAGAAAACGTGCAGCTTTGCCGCATTTTCCGGTCACATTGACTGCCTGATATACGCACGGGAAAACGGGTGTCCGTGGGACGCGGATACGTGCAACTTTTCCGC" 
> seqs2
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       GL350841 
"GTCTGTACAGTCTGAACTGTACCATAGAAAATATATAGGTACTACTGTTACTAGTTACTCCAACTGTTGTAGGGTTTGTTGTATAAGTCGTCGTCATGACTCATGACGAAATATTAATTATGCCATGATGGTCAGATCTTGCACGGTTAGCTTAGGTTATTCTTCAACTGTTATTCGTGCGCCATCTTGATCTTATGTTCTTAAAAACGCCTTAGGTACTGGCTATACACCTACTGTCTTAGCGTAAATCGTGTTGACAATCACACATAACATTTTTATCATTGAACATTATTATTTCTCTTGAGTTATTAATTTATTGTTAAAGTCAACACGGACGTTTCATCGTTAATTATCAATAACTGTTTTGAGTGACTATTTCGATTTTCGTTTACTGTACGCGTGTAAGCACTCAACTATACGTTTTGTGGCGGACAGCAGACCACGATGGCAAACGAGAGAACGTACGTAGTACCTAGCTTGTTGCAAACGGCAGCACCGAACGCTTGCACCGAGATCGAGCGTGGCAGGTTCGACTTTCTGGAGGTACCGGACGTCGATATGCGACATTGTTGCGAGTCGCTGTTAGAAATATTAAAAATCGAGAAAAACCGAAACGAGGGCAACGAAGGACAGACCATGTTCATCAGGTTGCAGTTGACCAGCAAGGACAAGAGGTTTCTAAACATATGCGAGGAGGCAGCGTCGCACGGGCACATAAATTGCCTGAAACTCGCTCACGAGATTGGCGTCCCTTGGAACGATAAGGCGTGCAAACGGGCTGCGAGTTCGGGGCATCTGGAGTGTCTACGATACGCACGGGAAAACGGGTGCCCGTGGAATGAGAAAACTTGCGGCCATGCCGCATTGAACGGTCACATGGACTGCCTGATATACGCACGGGAAAACGGGTGCCCTTGGAACTTTGATACATGCGCCAATGCCGCAATGGAGGGTCACAAGGACTGCCTGGTTTATGCACGGGAAAACGGGTGCCCGTGGAATGAGTATACGTGCACCTTTGCCGCATTTTCCGGTCACATTGACTGCCTGATATACGCACGGGAAAACGGGTGCCCGTGGGGCAAGTATACGTGCAGCGAAGCCGCATCAAAGGGTCACATTGACTGCCTGATATACGCACGGGAAAACGGGTGCCCGTGGGACGCAGATACGTGCAACTTTGCAGCGAAATACGGACACAAGGACTGCCTAGTTTACGCACGGGAAAACGGTTGCCCTTGGAATGAGAAAACTTGCAGCTATGCCGCATTGAACGGTCACATGGACTGCTTGATATACGCACGGGAAAACGGGTGCCCGTGGGACGCGAATACGTGCACCTTTGCAGCAAAGAACGGACACAAGGACTGCCTAGTTTACGCACGGGAATACGGGTGCCCTTGGAACGACTGGACATGCTTAAGTGCCGCAACAAACGGTCACATGGACTGCCTGGTATACGCCCGGGAAAACGGGTGCCCGTGGAATGAGAAAACTTGCAGCTATGCCGCTTTGAATGGTCACATGGACTGCCTGATGTACTCACGGGAAAACGGGTGCCCGTGGGACGCGGATACGTGCAACTTTGCCGCGAAGAACGGCCACAAGGACTGCCTAGTTTACGCACGGGAAAACGGGTGCCCTTGGAATGAGAAAACTTGCAGGAATGCCGCATTGAACGGTGACATAGACTGCCTGATATACGCACGGTTAAACGGGTGCCCGTGGGACGAGAAAACGTGCAGCTTTGCCGCATTTTCCGGTCACATTGACTGCCTGATATACGCACGGGAAAACGGGTGTCCGTGGGACGCGGATACGTGCAACTTTTCCGCGAAGAACGGCCACAAGGACTGCCTAGTTTACGCACGGGAAAACGGGTGCCCTTGGAATGAGAAAACTTGCAGCTATGCCGCATTGAACGGTCACATGGACTGCCTGATATTTGCACGGGAAAAAGGGTGCCCGTGGAACGAGAATACGTGCAGCTTTGCCGC" 

I called aln = pairwiseAlignment(seqs1,seqs2) and put the results into a string with:  char = as.character(pattern(aln)).

I then removed the deletions with char = gsub("-","",char)

Here is what I get for "char":

> char
[1] "GTCTGTACAGTCTGAACTGTACCATAGAAAATATATAGGTACTACTGTTACTAGTTACTCCAACTGTTGTAGGGTTTGTTGTATAAGTCGTCGTCATGACTCATGACGAAATATTAATTATGCCATGATGGTCAGATCTTGCACGGTTAGCTTAGGTTATTCTTCAACTGTTATTCGTGCGCCATCTTGATCTTATGTTCTTAAAAACGCCTTAGGTACTGGCTATACACCTACTGTCTTAGCGTAAATCGTGTTGACAATCACACATAACATTTTTATCATTGAACATTATTATTTCTCTTGAGTTATTAATTTATTGTTAAAGTCAACACGGACGTTTCATCGTTAATTATCAATAACTGTTTTGAGTGACTATTTCGATTTTCGTTTACTGTACGCGTGTAAGCACTCAACTATACGTTTTGTGGCGGACAGCAGACCACGATGGCAAACGAGAGAACGTACGTAGTACCTAGCTTGTTGCAAACGGCAGCACCGAACGCTTGCACCGAGATCGAGCGTGGCAGGTTCGACTTTCTGGAGGTACCGGACGTCGATATGCGACATTGTTGCGAGTCGCTGTTAGAAATATTAAAAATCGAGAAAAACCGAAACGAGGGCAACGAAGGACAGACCATGTTCATCAGGTTGCAGTTGACCAGCAAGGACAAGAGGTTTCTAAACATATGCGAGGAGGCAGCGTCGCACGGGCACATAAATTGCCTGAAACTCGCTCACGAGATTGGCGTCCCTTGGAACGATAAGGCGTGCAAACGGGCTGCGAGTTCGGGGCATCTGGAGTGTCTACGATACGCACGGGAAAACGGGTGCCCGTGGAATGAGAAAACTTGCGGCCATGCCGCATTGAACGGTCACATGGACTGCCTGATATACGCACGGGAAAACGGGTGCCCTTGGAACTTTGATACATGCGCCAATGCCGCAATGGAGGGTCACAAGGACTGCCTGGTTTATGCACGGGAAAACGGGTGCCCGTGGAATGAGTATACGTGCACCTTTGCCGCATTTTCCGGTCACATTGACTGCCTGATATACGCACGGGAAAACGGGTGCCCGTGGGGCAAGTATACGTGCAGCGAAGCCGCATCAAAGGGTCACATTGACTGCCTGATATACGCACGGGAAAACGGGTGCCCGTGGGACGCAGATACGTGCAACTTTGCAGCGAAATACGGACACAAGGACTGCCTAGTTTACGCACGGGAAAACGGTTGCCCTTGGAATGAGAAAACTTGCAGCTATGCCGCATTGAACGGTCACATGGACTGCTTGATATACGCACGGGAAAACGGGTGCCCGTGGGACGCGAATACGTGCACCTTTGCAGCAAAGAACGGACACAAGGACTGCCTAGTTTACGCACGGGAATACGGGTGCCCTTGGAACGACTGGACATGCTTAAGTGCCGCAACAAACGGTCACATGGACTGCCTGGTATACGCCCGGGAAAACGGGTGCCCGTGGAATGAGAAAACTTGCAGCTATGCCGCTTTGAATGGTCACATGGACTGCCTGATGTACTCACGGGAAAACGGGTGCCCGTGGGACGCGGATACGTGCAACTTTGCCGCGAAGAACGGCCACAAGGACTGCCTAGTTTACGCACGGGAAAACGGGTGCCCTTGGAATGAGAAAACTTGCAGGAATGCCGCATTGAACGGTGACATAGACTGCCTGATATACGCACGGTTAAACGGGTGCCCGTGGGACGAGAAAACGTGCAGCTTTGCCGCATTTTCCGGTCACATTGACTGCCTGATATACGCACGGGAAAACGGGTGTCCGTGGGACGCGGATACGTGCAACTTTTCCGC"

Which omits the last 162 bases of seqs2. It appears that these bases correspond to a repeated pattern. 

Thanks for looking into it. 

Jean

Edit : the issue appears also with the "local" alignment type. And interestingly, seqs1 is not truncated. It seems that seqs2 was truncated before the alignment. 

Biostrings • 2.4k views
ADD COMMENT
1
Entering edit mode
@herve-pages-1542
Last seen 28 minutes ago
Seattle, WA, United States

Hi Jean,

First there is a printing issue. On a simpler example:

aln <- pairwiseAlignment("CACAGTTT", "AACACAG")
aln
# Global PairwiseAlignmentsSingleSubject (1 of 1)
# pattern: [1] CACAG 
# subject: [3] CACAG 
# score: -30.09122 

As you can see, the aligned pattern and subject are trimmed when printed. See this old post from Patrick Aboyoun, the original author of pairwiseAlignment(), about this:

  https://stat.ethz.ch/pipermail/bioconductor/2010-February/031577.html

Even more confusing is that the pattern() and subject() accesors also do that trimming:

> pattern(aln)
[1] CACAG 
> subject(aln)
[3] CACAG 

They probably shouldn't but that's how they've behaved since the beginning.

Calling writePairwiseAlignments(aln) displays the alignment in a more user-friendly way:

########################################
# Program: Biostrings (version 2.36.2), a Bioconductor package
# Rundate: Mon Aug 10 16:07:42 2015
########################################
#=======================================
#
# Aligned_sequences: 2
# 1: P1
# 2: S1
# Matrix: NA
# Gap_penalty: 14.0
# Extend_penalty: 4.0
#
# Length: 10
# Identity:       5/10 (50.0%)
# Similarity:    NA/10 (NA%)
# Gaps:           5/10 (50.0%)
# Score: -30.09122
#
#
#=======================================

P1                 1 --CACAGTTT      8
                       |||||   
S1                 1 AACACAG---      7

#---------------------------------------
#---------------------------------------

If you want to compute the non-trimmed aligned pattern and subject from aln, a dirty workaround is to do:

Biostrings:::.makePostalignedSeqs(aln)[[1L]]
#   A BStringSet instance of length 2
#     width seq                                               names               
# [1]    10 --CACAGTTT                                        
# [2]    10 AACACAG---                    

Internal utility Biostrings:::.makePostalignedSeqs() is what writePairwiseAlignments() uses to produce the user-friendly output above.

Not a satisfactory solution though. Maybe we should add a trim.gaps argument or something like that to the pattern() and subject() accesors to let the user decide what to do with the gaps at the ends of the alignments?

H.

ADD COMMENT
0
Entering edit mode
@jeanpeccoud-8564
Last seen 8.7 years ago

Thanks for the reply. So it's because of a trailing gap? It would indeed be great to have the option to retrieve the whole aligned sequences (I think it should be the default behavior). 

Jean

ADD COMMENT
0
Entering edit mode
@jeanpeccoud-8564
Last seen 8.7 years ago

Actually, the workaround isn't ideal since Biostrings:::.makePostalignedSeqs() method only works for PairwiseAlignments object of length 1. I align pairs of sequences by batches of 1000, which his much faster. 

Is there another option for me?

Thanks.

Jean

ADD COMMENT
0
Entering edit mode

Hi Jean,

To extract the aligned sequences from a PairwiseAlignments object of length > 1 you can use the following code (which basically calls Biostrings:::.makePostalignedSeqs() in a loop):

alignedSequences <- function(x)
{
    seqs <- lapply(seq_along(x),
              function(i) Biostrings:::.makePostalignedSeqs(x[i])[[1L]])
    seqs <- XVector:::unlist_list_of_XVectorList(class(seqs[[1L]]), seqs, FALSE)
    list(aligned_patterns=seqs[c(TRUE, FALSE)],
         aligned_subjects=seqs[c(FALSE, TRUE)])
}

Then:

aln <- pairwiseAlignment(c("AA", "ACCTTT"), c("AAGG", "CCTTT"))
alignedSequences(aln)
# $aligned_patterns
#   A BStringSet instance of length 2
#     width seq                                               names               
# [1]     4 AA--                                              
# [2]     6 ACCTTT                                            
#
# $aligned_subjects
#   A BStringSet instance of length 2
#     width seq                                               names               
# [1]     4 AAGG                                              
# [2]     6 -CCTTT                                            

Unfortunately this is pretty slow (might take a couple of minutes on each of your batches of 1000 aligned pairs).

I agree this is not ideal and we will need to work on something better to let the user retrieve the whole aligned sequences.

H.

ADD REPLY
0
Entering edit mode

Thanks. For now I think I'll use other tools to align my sequences. An alternative solution would be to extract the size of the leading gaps. It doesn't seem that the indel() method does it. 

I need the size of the leading gap or else the alignment is useless of me. That's because I'm doing stuff at specific positions of the aligned sequences (which are determined before the alignment), and this won't work if the aligned sequences are truncated at the left.

Jean

ADD REPLY
0
Entering edit mode

You can get the size of the leading gap with:

patternLeadingGap <- function(x)
{
  stopifnot(is(x, "PairwiseAlignments"), type(x) == "global")
  start(subject(x)) - 1L
}
subjectLeadingGap <- function(x)
{
  stopifnot(is(x, "PairwiseAlignments"), type(x) == "global")
  start(pattern(x)) - 1L
}

For example:

p <- DNAStringSet(c("AA", "ACCTTT", "AAAAAC"))
s <- DNAStringSet(c("AAGG", "CCTTT", "GGGAAAAA"))
aln <- pairwiseAlignment(p, s, gapOpening=1, gapExtension=1)
patternLeadingGap(aln)
# [1] 0 0 3
subjectLeadingGap(aln)
# [1] 0 1 0

This is very fast.

H.

ADD REPLY
0
Entering edit mode

Great! Thanks.

ADD REPLY

Login before adding your answer.

Traffic: 709 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6