Question

obtaining a complete global alignment via pairwiseAlignment

2

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 9.6 years ago

Hello, I'm interested in using the pairwiseAlignment function in Biostrings to perform simple alignments of short sequences. I noticed that even the global alignment mode returns an alignment with gapped ends trimmed off. For example: > s1 = DNAString ("AAGGAA") > s2 = DNAString ("GG") > pairwiseAlignment (s1, s2, type = "global") Global PairwiseAlignmentsSingleSubject (1 of 1) pattern: [3] GG subject: [1] GG score: -32.03649 Is it possible to obtain the complete alignment including the (possibly) gapped ends? In this case, that would be: --GG-- AAGGAA For my purposes, having the complete gapped alignment is important. My question seems related to a previous post: https://stat.ethz.ch/pipermail/bioconductor/2012-February/043577.html Thank you in advance. Best wishes, Rob -- Robert K. Bradley, Assistant Member Fred Hutchinson Cancer Research Center http://labs.fhcrc.org/bradleyr/ -- output of sessionInfo(): > sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] C attached base packages: [1] parallel grid grDevices utils datasets stats graphics methods base other attached packages: [1] BSgenome.Hsapiens.UCSC.hg19_1.3.19 BSgenome_1.28.0 org.Hs.eg.db_2.9.0 beanplot_1.1 [5] IlluminaHumanMethylation450kmanifest_0.4.0 goseq_1.12.0 geneLenDataBase_0.99.11 BiasedUrn_1.05 [9] GO.db_2.9.0 RSQLite_0.11.4 DBI_0.2-7 fastcluster_1.1.11 [13] minfi_1.6.0 reshape_0.8.4 plyr_1.8 lattice_0.20-15 [17] FDb.InfiniumMethylation.hg19_1.0.1 rtracklayer_1.20.2 Biostrings_2.28.0 GenomicFeatures_1.12.2 [21] AnnotationDbi_1.22.6 Biobase_2.20.0 GenomicRanges_1.12.4 IRanges_1.18.1 [25] VennDiagram_1.6.0 RColorBrewer_1.0-5 BiocGenerics_0.6.0 gplots_2.11.0.1 [29] MASS_7.3-26 KernSmooth_2.23-10 caTools_1.14 gdata_2.12.0.2 [33] gtools_2.7.1 loaded via a namespace (and not attached): [1] Matrix_1.0-12 R.methodsS3_1.4.2 RCurl_1.95-4.1 Rsamtools_1.12.3 XML_3.95-0.2 biomaRt_2.16.0 bitops_1.0-5 illuminaio_0.2.0 [9] limma_3.16.5 matrixStats_0.8.1 mclust_4.1 mgcv_1.7-24 multtest_2.16.0 nlme_3.1-109 nor1mix_1.1-4 preprocessCore_1.22.0 [17] siggenes_1.34.0 splines_3.0.1 stats4_3.0.1 survival_2.37-4 tools_3.0.1 zlibbioc_1.6.0 -- Sent via the guest posting facility at bioconductor.org.

Alignment GO Cancer BSgenome BSgenome Alignment GO Cancer BSgenome BSgenome • 1.3k views

ADD COMMENT • link updated 10.9 years ago by Martin Morgan 25k • written 10.9 years ago by Guest User ★ 13k

score 0 · Answer 1 · 2013-06-15

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 4 days ago

United States

On 06/14/2013 07:57 PM, Robert K.Bradley [guest] wrote: > > Hello, > > I'm interested in using the pairwiseAlignment function in Biostrings to perform simple alignments of short sequences. I noticed that even the global alignment mode returns an alignment with gapped ends trimmed off. For example: > >> s1 = DNAString ("AAGGAA") >> s2 = DNAString ("GG") >> pairwiseAlignment (s1, s2, type = "global") > Global PairwiseAlignmentsSingleSubject (1 of 1) > pattern: [3] GG > subject: [1] GG > score: -32.03649 > > Is it possible to obtain the complete alignment including the (possibly) gapped ends? In this case, that would be: > --GG-- > AAGGAA > For my purposes, having the complete gapped alignment is important. Hi Robert -- Not an expert on pairwiseAlignment, but I think the return value is a class that contains more information than it is showing > xx = pairwiseAlignment(s2, s1, type="global") > class(xx) [1] "PairwiseAlignmentsSingleSubject" attr(,"package") [1] "Biostrings" > class?PairwiseAlignmentsSingleSubject > aligned(xx) A DNAStringSet instance of length 1 width seq [1] 6 --GG-- Martin > > My question seems related to a previous post: > https://stat.ethz.ch/pipermail/bioconductor/2012-February/043577.html > > Thank you in advance. > > Best wishes, > Rob > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD COMMENT • link 10.9 years ago Martin Morgan 25k

0

Entering edit mode

Thanks Martin. as.character (aligned (...)) does return a gapped string for one of the two aligned sequences, but even that command trims off unaligned characters. For example: For example: > s1 = DNAString ("AGGGG") > s2 = DNAString ("GGGGA") > xx = pairwiseAlignment (s1, s2, gapOpening = -1) > as.character (xx) [1] "GGGG-" while the whole gapped string is "AGGGG-". I apologize in advance if I'm missing something obvious here. Best, Rob On 6/15/13 6:45 AM, Martin Morgan wrote: > On 06/14/2013 07:57 PM, Robert K.Bradley [guest] wrote: >> >> Hello, >> >> I'm interested in using the pairwiseAlignment function in Biostrings >> to perform simple alignments of short sequences. I noticed that even >> the global alignment mode returns an alignment with gapped ends >> trimmed off. For example: >> >>> s1 = DNAString ("AAGGAA") >>> s2 = DNAString ("GG") >>> pairwiseAlignment (s1, s2, type = "global") >> Global PairwiseAlignmentsSingleSubject (1 of 1) >> pattern: [3] GG >> subject: [1] GG >> score: -32.03649 >> >> Is it possible to obtain the complete alignment including the >> (possibly) gapped ends? In this case, that would be: >> --GG-- >> AAGGAA >> For my purposes, having the complete gapped alignment is important. > > Hi Robert -- Not an expert on pairwiseAlignment, but I think the return > value is a class that contains more information than it is showing > > > xx = pairwiseAlignment(s2, s1, type="global") > > class(xx) > [1] "PairwiseAlignmentsSingleSubject" > attr(,"package") > [1] "Biostrings" > > class?PairwiseAlignmentsSingleSubject > > aligned(xx) > A DNAStringSet instance of length 1 > width seq > [1] 6 --GG-- > > Martin > >> >> My question seems related to a previous post: >> https://stat.ethz.ch/pipermail/bioconductor/2012-February/043577.html >> >> Thank you in advance. >> >> Best wishes, >> Rob >> > >

ADD REPLY • link 10.9 years ago Robert K. Bradley ▴ 40

0

Entering edit mode

On 06/15/2013 09:18 AM, Robert K. Bradley wrote: > Thanks Martin. as.character (aligned (...)) does return a gapped string for one > of the two aligned sequences, but even that command trims off unaligned > characters. For example: > For example: > > s1 = DNAString ("AGGGG") > > s2 = DNAString ("GGGGA") > > xx = pairwiseAlignment (s1, s2, gapOpening = -1) > > as.character (xx) > [1] "GGGG-" > while the whole gapped string is "AGGGG-". > > I apologize in advance if I'm missing something obvious here. Not really doing much better on my end; I'll offer this up then step aside until someone with more expertise chimes in. Maybe this is closer to what you're after, padding aligned(xx) with and left or right overhang of pattern lraligned <- function(pattern, subject, ...) { xx <- pairwiseAlignment (pattern, subject, gapOpening = -1) aln <- aligned(xx) lpad <- substr(pattern, 1L, start(pattern(xx)) - 1L) rpad <- substr(pattern, end(pattern(xx)) + 1L, nchar(pattern)) paste0(lpad, aln, rpad) } > lraligned(DNAString ("AGGGG"), DNAString ("GGGGA"), type="global") [1] "AGGGG-" > lraligned(DNAString ("GGGGA"), DNAString ("AGGGG"), type="global") [1] "-GGGGA" > lraligned(DNAString("AAGGAA"), DNAString("GG")) [1] "AAGGAA" > lraligned(DNAString("GG"), DNAString("AAGGAA")) [1] "--GG--" Martin > > Best, > Rob > > On 6/15/13 6:45 AM, Martin Morgan wrote: >> On 06/14/2013 07:57 PM, Robert K.Bradley [guest] wrote: >>> >>> Hello, >>> >>> I'm interested in using the pairwiseAlignment function in Biostrings >>> to perform simple alignments of short sequences. I noticed that even >>> the global alignment mode returns an alignment with gapped ends >>> trimmed off. For example: >>> >>>> s1 = DNAString ("AAGGAA") >>>> s2 = DNAString ("GG") >>>> pairwiseAlignment (s1, s2, type = "global") >>> Global PairwiseAlignmentsSingleSubject (1 of 1) >>> pattern: [3] GG >>> subject: [1] GG >>> score: -32.03649 >>> >>> Is it possible to obtain the complete alignment including the >>> (possibly) gapped ends? In this case, that would be: >>> --GG-- >>> AAGGAA >>> For my purposes, having the complete gapped alignment is important. >> >> Hi Robert -- Not an expert on pairwiseAlignment, but I think the return >> value is a class that contains more information than it is showing >> >> > xx = pairwiseAlignment(s2, s1, type="global") >> > class(xx) >> [1] "PairwiseAlignmentsSingleSubject" >> attr(,"package") >> [1] "Biostrings" >> > class?PairwiseAlignmentsSingleSubject >> > aligned(xx) >> A DNAStringSet instance of length 1 >> width seq >> [1] 6 --GG-- >> >> Martin >> >>> >>> My question seems related to a previous post: >>> https://stat.ethz.ch/pipermail/bioconductor/2012-February/043577.html >>> >>> Thank you in advance. >>> >>> Best wishes, >>> Rob >>> >> >> -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 10.9 years ago Martin Morgan 25k

0

Entering edit mode

Hi guys, It was an early decision of the author of pairwiseAlignment() to not display the indels that occur at the ends of the aligned sequences. The consequence of this is that the alignment "looks" like it's a local alignment even though it's global. Using writePairwiseAlignments() to display the alignments disambiguates this. With your 1st example: s1 <- DNAString("AAGGAA") s2 <- DNAString("GG") xx <- pairwiseAlignment(s1, s2, type="global") Then: > xx Global PairwiseAlignmentsSingleSubject (1 of 1) pattern: [3] GG subject: [1] GG score: -32.03649 > writePairwiseAlignments(xx) ######################################## # Program: Biostrings (version 2.29.7), a Bioconductor package # Rundate: Wed Jun 19 11:15:27 2013 ######################################## #======================================= # # Aligned_sequences: 2 # 1: P1 # 2: S1 # Matrix: NA # Gap_penalty: 14.0 # Extend_penalty: 4.0 # # Length: 6 # Identity: 2/6 (33.3%) # Similarity: NA/6 (NA%) # Gaps: 4/6 (66.7%) # Score: -32.03649 # # #======================================= P1 1 AAGGAA 6 || S1 1 --GG-- 2 #--------------------------------------- #--------------------------------------- With your 2nd example: s1 <- DNAString("AGGGG") s2 <- DNAString("GGGGA") xx <- pairwiseAlignment(s1, s2, gapOpening=-1) Then: > xx Global PairwiseAlignmentsSingleSubject (1 of 1) pattern: [2] GGGG subject: [1] GGGG score: -2.072976 > writePairwiseAlignments(xx) ######################################## # Program: Biostrings (version 2.29.7), a Bioconductor package # Rundate: Wed Jun 19 11:19:40 2013 ######################################## #======================================= # # Aligned_sequences: 2 # 1: P1 # 2: S1 # Matrix: NA # Gap_penalty: 5.0 # Extend_penalty: 4.0 # # Length: 6 # Identity: 4/6 (66.7%) # Similarity: NA/6 (NA%) # Gaps: 2/6 (33.3%) # Score: -2.072976 # # #======================================= P1 1 AGGGG- 5 |||| S1 1 -GGGGA 5 #--------------------------------------- #--------------------------------------- It's sad though that 'aligned(pattern(xx))' and 'aligned(subject(xx))' also trim those indels that are at the ends because they are really part of the aligned sequences. However this behavior has been around for so long now that I'm not sure we want to touch it. What we could to is add an extra argument to the aligned() getter, say 'full.alignment', with default to FALSE, to let the user extract the whole thing. Would that be useful? Cheers, H. On 06/15/2013 10:19 AM, Martin Morgan wrote: > On 06/15/2013 09:18 AM, Robert K. Bradley wrote: >> Thanks Martin. as.character (aligned (...)) does return a gapped >> string for one >> of the two aligned sequences, but even that command trims off unaligned >> characters. For example: >> For example: >> > s1 = DNAString ("AGGGG") >> > s2 = DNAString ("GGGGA") >> > xx = pairwiseAlignment (s1, s2, gapOpening = -1) >> > as.character (xx) >> [1] "GGGG-" >> while the whole gapped string is "AGGGG-". >> >> I apologize in advance if I'm missing something obvious here. > > Not really doing much better on my end; I'll offer this up then step > aside until someone with more expertise chimes in. Maybe this is closer > to what you're after, padding aligned(xx) with and left or right > overhang of pattern > > lraligned <- function(pattern, subject, ...) { > xx <- pairwiseAlignment (pattern, subject, gapOpening = -1) > aln <- aligned(xx) > lpad <- substr(pattern, 1L, start(pattern(xx)) - 1L) > rpad <- substr(pattern, end(pattern(xx)) + 1L, nchar(pattern)) > paste0(lpad, aln, rpad) > } > > > lraligned(DNAString ("AGGGG"), DNAString ("GGGGA"), type="global") > [1] "AGGGG-" > > lraligned(DNAString ("GGGGA"), DNAString ("AGGGG"), type="global") > [1] "-GGGGA" > > lraligned(DNAString("AAGGAA"), DNAString("GG")) > [1] "AAGGAA" > > lraligned(DNAString("GG"), DNAString("AAGGAA")) > [1] "--GG--" > > Martin > >> >> Best, >> Rob >> >> On 6/15/13 6:45 AM, Martin Morgan wrote: >>> On 06/14/2013 07:57 PM, Robert K.Bradley [guest] wrote: >>>> >>>> Hello, >>>> >>>> I'm interested in using the pairwiseAlignment function in Biostrings >>>> to perform simple alignments of short sequences. I noticed that even >>>> the global alignment mode returns an alignment with gapped ends >>>> trimmed off. For example: >>>> >>>>> s1 = DNAString ("AAGGAA") >>>>> s2 = DNAString ("GG") >>>>> pairwiseAlignment (s1, s2, type = "global") >>>> Global PairwiseAlignmentsSingleSubject (1 of 1) >>>> pattern: [3] GG >>>> subject: [1] GG >>>> score: -32.03649 >>>> >>>> Is it possible to obtain the complete alignment including the >>>> (possibly) gapped ends? In this case, that would be: >>>> --GG-- >>>> AAGGAA >>>> For my purposes, having the complete gapped alignment is important. >>> >>> Hi Robert -- Not an expert on pairwiseAlignment, but I think the return >>> value is a class that contains more information than it is showing >>> >>> > xx = pairwiseAlignment(s2, s1, type="global") >>> > class(xx) >>> [1] "PairwiseAlignmentsSingleSubject" >>> attr(,"package") >>> [1] "Biostrings" >>> > class?PairwiseAlignmentsSingleSubject >>> > aligned(xx) >>> A DNAStringSet instance of length 1 >>> width seq >>> [1] 6 --GG-- >>> >>> Martin >>> >>>> >>>> My question seems related to a previous post: >>>> https://stat.ethz.ch/pipermail/bioconductor/2012-February/043577.html >>>> >>>> Thank you in advance. >>>> >>>> Best wishes, >>>> Rob >>>> >>> >>> > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 10.9 years ago Hervé Pagès 16k

0

Entering edit mode

Thanks Herv?. Adding such an option 'full.alignment' would be extremely useful to me, since it would let me get access to the complete global alignment with minimal string parsing. On 6/19/13 11:33 AM, Hervé Pagès wrote: > Hi guys, > > It was an early decision of the author of pairwiseAlignment() to not > display the indels that occur at the ends of the aligned sequences. > The consequence of this is that the alignment "looks" like it's a > local alignment even though it's global. > > Using writePairwiseAlignments() to display the alignments disambiguates > this. > > With your 1st example: > > s1 <- DNAString("AAGGAA") > s2 <- DNAString("GG") > xx <- pairwiseAlignment(s1, s2, type="global") > > Then: > > > xx > Global PairwiseAlignmentsSingleSubject (1 of 1) > pattern: [3] GG > subject: [1] GG > score: -32.03649 > > > writePairwiseAlignments(xx) > ######################################## > # Program: Biostrings (version 2.29.7), a Bioconductor package > # Rundate: Wed Jun 19 11:15:27 2013 > ######################################## > #======================================= > # > # Aligned_sequences: 2 > # 1: P1 > # 2: S1 > # Matrix: NA > # Gap_penalty: 14.0 > # Extend_penalty: 4.0 > # > # Length: 6 > # Identity: 2/6 (33.3%) > # Similarity: NA/6 (NA%) > # Gaps: 4/6 (66.7%) > # Score: -32.03649 > # > # > #======================================= > > P1 1 AAGGAA 6 > || > S1 1 --GG-- 2 > > > #--------------------------------------- > #--------------------------------------- > > With your 2nd example: > > s1 <- DNAString("AGGGG") > s2 <- DNAString("GGGGA") > xx <- pairwiseAlignment(s1, s2, gapOpening=-1) > > Then: > > > xx > Global PairwiseAlignmentsSingleSubject (1 of 1) > pattern: [2] GGGG > subject: [1] GGGG > score: -2.072976 > > > writePairwiseAlignments(xx) > ######################################## > # Program: Biostrings (version 2.29.7), a Bioconductor package > # Rundate: Wed Jun 19 11:19:40 2013 > ######################################## > #======================================= > # > # Aligned_sequences: 2 > # 1: P1 > # 2: S1 > # Matrix: NA > # Gap_penalty: 5.0 > # Extend_penalty: 4.0 > # > # Length: 6 > # Identity: 4/6 (66.7%) > # Similarity: NA/6 (NA%) > # Gaps: 2/6 (33.3%) > # Score: -2.072976 > # > # > #======================================= > > P1 1 AGGGG- 5 > |||| > S1 1 -GGGGA 5 > > > #--------------------------------------- > #--------------------------------------- > > It's sad though that 'aligned(pattern(xx))' and 'aligned(subject(xx))' > also trim those indels that are at the ends because they are really > part of the aligned sequences. However this behavior has been around for > so long now that I'm not sure we want to touch it. What we could to > is add an extra argument to the aligned() getter, say 'full.alignment', > with default to FALSE, to let the user extract the whole thing. > > Would that be useful? > > Cheers, > H. > > > On 06/15/2013 10:19 AM, Martin Morgan wrote: >> On 06/15/2013 09:18 AM, Robert K. Bradley wrote: >>> Thanks Martin. as.character (aligned (...)) does return a gapped >>> string for one >>> of the two aligned sequences, but even that command trims off unaligned >>> characters. For example: >>> For example: >>> > s1 = DNAString ("AGGGG") >>> > s2 = DNAString ("GGGGA") >>> > xx = pairwiseAlignment (s1, s2, gapOpening = -1) >>> > as.character (xx) >>> [1] "GGGG-" >>> while the whole gapped string is "AGGGG-". >>> >>> I apologize in advance if I'm missing something obvious here. >> >> Not really doing much better on my end; I'll offer this up then step >> aside until someone with more expertise chimes in. Maybe this is closer >> to what you're after, padding aligned(xx) with and left or right >> overhang of pattern >> >> lraligned <- function(pattern, subject, ...) { >> xx <- pairwiseAlignment (pattern, subject, gapOpening = -1) >> aln <- aligned(xx) >> lpad <- substr(pattern, 1L, start(pattern(xx)) - 1L) >> rpad <- substr(pattern, end(pattern(xx)) + 1L, nchar(pattern)) >> paste0(lpad, aln, rpad) >> } >> >> > lraligned(DNAString ("AGGGG"), DNAString ("GGGGA"), type="global") >> [1] "AGGGG-" >> > lraligned(DNAString ("GGGGA"), DNAString ("AGGGG"), type="global") >> [1] "-GGGGA" >> > lraligned(DNAString("AAGGAA"), DNAString("GG")) >> [1] "AAGGAA" >> > lraligned(DNAString("GG"), DNAString("AAGGAA")) >> [1] "--GG--" >> >> Martin >> >>> >>> Best, >>> Rob >>> >>> On 6/15/13 6:45 AM, Martin Morgan wrote: >>>> On 06/14/2013 07:57 PM, Robert K.Bradley [guest] wrote: >>>>> >>>>> Hello, >>>>> >>>>> I'm interested in using the pairwiseAlignment function in Biostrings >>>>> to perform simple alignments of short sequences. I noticed that even >>>>> the global alignment mode returns an alignment with gapped ends >>>>> trimmed off. For example: >>>>> >>>>>> s1 = DNAString ("AAGGAA") >>>>>> s2 = DNAString ("GG") >>>>>> pairwiseAlignment (s1, s2, type = "global") >>>>> Global PairwiseAlignmentsSingleSubject (1 of 1) >>>>> pattern: [3] GG >>>>> subject: [1] GG >>>>> score: -32.03649 >>>>> >>>>> Is it possible to obtain the complete alignment including the >>>>> (possibly) gapped ends? In this case, that would be: >>>>> --GG-- >>>>> AAGGAA >>>>> For my purposes, having the complete gapped alignment is important. >>>> >>>> Hi Robert -- Not an expert on pairwiseAlignment, but I think the return >>>> value is a class that contains more information than it is showing >>>> >>>> > xx = pairwiseAlignment(s2, s1, type="global") >>>> > class(xx) >>>> [1] "PairwiseAlignmentsSingleSubject" >>>> attr(,"package") >>>> [1] "Biostrings" >>>> > class?PairwiseAlignmentsSingleSubject >>>> > aligned(xx) >>>> A DNAStringSet instance of length 1 >>>> width seq >>>> [1] 6 --GG-- >>>> >>>> Martin >>>> >>>>> >>>>> My question seems related to a previous post: >>>>> https://stat.ethz.ch/pipermail/bioconductor/2012-February/043577.html >>>>> >>>>> Thank you in advance. >>>>> >>>>> Best wishes, >>>>> Rob >>>>> >>>> >>>> >> >> >

ADD REPLY • link 10.9 years ago Robert K. Bradley ▴ 40