Entering edit mode
Hi Marcin,
On 09/27/2013 07:48 AM, Marcin Imielinski wrote:
> Sorry, forgot to cc you. Thanks in advance!
>
> ---------- Forwarded message ----------
> From: *Marcin Imielinski* <marcin at="" broad.mit.edu=""> <mailto:marcin at="" broad.mit.edu="">>
> Date: Fri, Sep 27, 2013 at 10:40 AM
> Subject: BioStrings PairwiseAlignment deletion() insertion()
function
> To: bioconductor at r-project.org <mailto:bioconductor at="" r-project.org="">
>
>
> Hi -
>
> I'm confused about the output of deletion(pa) and insertion(pa)
> functions for pa = pairwiseAlignment(). My understanding is that
they
> should output IRanges corresponding to gaps in the pattern
(deletion())
> and gaps in the subject (insertion()) in terms of alignment
coordinates.
>
> However, it appears that the outputted ranges can overlap. For
example,
> the alignment (below) of a 101 letter pattern and 404 letter
subject.
> The deletion ranges overlap. What sequence positions do these
ranges
> refer to (pattern, subject, or alignment)?
>
> Is this is a bug or am I misinterpreting this output?
It's a questionable design choice but the ranges describing the
deletions are reported with respect to the "original" (aka unaligned)
pattern, not to the aligned pattern. For example the 1st range you
see in 'deletion(pa)' means there is a deletion of 3 letters (when
going from subject to pattern) and that this deletion starts at
position
4 in the original pattern. With such conventions, the *ranges*
describing the deletions can overlap but the deletions don't, which
I must admit is very confusing.
As you noticed, those ranges can be shifted to make them refer to the
aligned pattern:
> offset <- c(0L, cumsum(head(width(deletion(pa)[[1]]), n=-1)))
> shift(deletion(pa)[[1]], offset)
IRanges of length 9
start end width
[1] 4 6 3
[2] 21 39 19
[3] 49 76 28
[4] 84 128 45
[5] 135 172 38
[6] 183 186 4
[7] 195 204 10
[8] 212 215 4
[9] 245 284 40
Hope this helps,
H.
>
> Thanks
> Marcin
>
> ##############################################################
>
> > str[1]
> [1]
> "TCCTTGCACATTGATAAGTTCACATCTGAAATTTGCATGACATAAACATACAGTTGAGAAGGAGAGA
ACGTATGCCCTATGGTAAATATTGACATTTTAAA"
> > str[2]
> [1]
> "CTGGGCTTTCGATGAAATAGTTCATTTATCTGTGGGTAGATATTACTTACTGGTTGAGTTAAACTGG
GTTAAACATCAATTCTATTTCCATTTTTCATTTTTATAAATAGGTACTGAGAATCTTTGTTCATATAAAT
AGATGGATAGGATTAGCCACTTCTTTGAATTTCTTTTTCAAGTTTCATGCCAAGATTCACATCATAACAC
ATGTAACTGCATGTCTGGATGGAGAACAGATGTACCTATGCAGCGGCAGGGACATCAACACTCTCACTGA
TGAATTGGCCGAGGAATGAGGAATAGCACAAATCAGCTACGGAACATTGACAAACTGGGAGCTAAACTTT
GCTTCATGCCTGTGAGGCAGTATTTTGATGAGCGGTGGATGCCCAGTGCTTCCTTGT"
> >
> > pa = pairwiseAlignment(str[1], str[2])
> > deletion(pa)
> IRangesList of length 1
> [[1]]
> IRanges of length 9
> start end width
> [1] 4 6 3
> [2] 18 36 19
> [3] 27 54 28
> [4] 34 78 45
> [5] 40 77 38
> [6] 50 53 4
> [7] 58 67 10
> [8] 65 68 4
> [9] 94 133 40
>
> > insertion(pa)
> IRangesList of length 1
> [[1]]
> IRanges of length 1
> start end width
> [1] 234 235 2
>
> > nchar(pa)
> [1] 292
>
> > as.character(aligned(pattern(pa)))
> [1]
> "TCC---TTGCACATTGATAA-------------------GTTCACATC
----------------------------TGAAATT
---------------------------------------------TGCATG
--------------------------------------ACATAAACAT----ACAGTTGA----------
GAAGGAG----AGAACGTATGCCCTATGGTAAATATTGAC
----------------------------------------ATTTTAAA"
> > as.character(aligned(subject(pa)))
> [1]
> "TCCATTTTTCATTTTTATAAATAGGTACTGAGAATCTTTGTTCATATAAATAGATGGATAGGATTAG
CCACTTCTTTGAATTTCTTTTTCAAGTTTCATGCCAAGATTCACATCATAACACATGTAACTGCATGTCT
GGATGGAGAACAGATGTACCTATGCAGCGGCAGGGACATCAACACTCTCACTGATGAATTGGCCGAGGAA
TGAGGAATAGCACAAATCAGCTACGG--
AACATTGACAAACTGGGAGCTAAACTTTGCTTCATGCCTGTGAGGCAGTATTTTGAT"
>
> ### here is what I would expect deletion(pa) to output ... notice
that
> it resembles
> ### the above deletion(pa) output with a shift corresponding to
> cumsum(width)
> ### is this a bug?
>
> > as(which(strsplit(as.character(aligned(subject(pa))), '')[[1]] ==
> "-"), 'IRanges')
> IRanges of length 9
> start end width
> [1] 4 6 3
> [2] 21 39 19
> [3] 49 76 28
> [4] 84 128 45
> [5] 135 172 38
> [6] 183 186 4
> [7] 195 204 10
> [8] 212 215 4
> [9] 245 284 40
>
>
> > sessionInfo()
> R version 3.0.0 (2013-04-03)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] C
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets
methods
> [8] base
>
> other attached packages:
> [1] Biostrings_2.28.0 IRanges_1.18.4 BiocGenerics_0.6.0
> [4] BiocInstaller_1.10.3 multicore_0.1-7
>
> loaded via a namespace (and not attached):
> [1] stats4_3.0.0 tools_3.0.0
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319