Hi Herve,
I ran into some issues when using readGAlignmentPairs with the
ScanBamParam
'which' argument. I included an example below. I can see how changing
'which' can result in different pairings but was wondering whether the
error can be avoided and problematic pairings dropped instead.
It looks like the issue is introduced in revision 77808 (no error in
77791). I was hoping you have an idea what causes the problem,
otherwise I
can try to forward a test case.
Thanks for your help.
Leonard
> gr
GRanges with 2 ranges and 0 metadata columns:
seqnames ranges strand
<rle> <iranges> <rle>
[1] 10 [75758072, 75758133] *
[2] 10 [75802841, 75802911] *
---
seqlengths:
10
NA
>
> readGAlignmentPairs(file, param = ScanBamParam(which = gr[1]))
GAlignmentPairs with 0 alignment pairs and 0 metadata columns:
seqnames strand : ranges -- ranges
<rle> <rle> : <iranges> -- <iranges>
---
seqlengths:
1 2 3 ... GL000247.1 GL000248.1 GL000249.1
249250621 243199373 198022430 ... 36422 39786 38502
>
> readGAlignmentPairs(file, param = ScanBamParam(which = gr[2]))
GAlignmentPairs with 3 alignment pairs and 0 metadata columns:
seqnames strand : ranges -- ranges
<rle> <rle> : <iranges> -- <iranges>
[1] 10 + : [75758115, 75802896] -- [75802892, 75830482]
[2] 10 - : [75802908, 75830498] -- [75758111, 75802892]
[3] 10 - : [75802908, 75830498] -- [75802878, 75830468]
---
seqlengths:
1 2 3 ... GL000247.1 GL000248.1 GL000249.1
249250621 243199373 198022430 ... 36422 39786 38502
>
> readGAlignmentPairs(file, param = ScanBamParam(which = gr))
Error in makeGAlignmentPairs(galn, use.names = use.names, use.mcols =
use.mcols) :
findMateAlignment() returned an invalid 'mate' vector
In addition: Warning message:
In findMateAlignment(x) :
4 alignments with ambiguous pairing were dumped.
Use 'getDumpedAlignments()' to retrieve them from the dump
environment.
>
> sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets
methods
[8] base
other attached packages:
[1] Rsamtools_1.13.39 Biostrings_2.29.16 GenomicRanges_1.13.41
[4] XVector_0.1.1 IRanges_1.19.31 BiocGenerics_0.7.4
loaded via a namespace (and not attached):
[1] bitops_1.0-6 stats4_3.0.0 zlibbioc_1.7.0
>
[[alternative HTML version deleted]]
Dear Herve and list,
I realized the error might have been caused due to the same bam record
being read in multiple times when passing multiple ranges in the which
argument.
I don't think it is necessarily obvious to users that
readGAlignments/Pairs
returns a concatenation of records that were read in for each range in
the
which argument. Instead one might expect to obtain the union of
records
that overlap any of the ranges included in which. Do you think it
would be
helpful to issue a warning in case records are read in multiple times?
Best wishes,
Leonard
On Tue, Sep 10, 2013 at 5:58 PM, Leonard Goldstein
<goldstel@gene.com>wrote:
> Hi Herve,
>
>
> I ran into some issues when using readGAlignmentPairs with the
> ScanBamParam 'which' argument. I included an example below. I can
see how
> changing 'which' can result in different pairings but was wondering
whether
> the error can be avoided and problematic pairings dropped instead.
>
>
> It looks like the issue is introduced in revision 77808 (no error in
> 77791). I was hoping you have an idea what causes the problem,
otherwise I
> can try to forward a test case.
>
>
> Thanks for your help.
>
>
> Leonard
>
>
> > gr
> GRanges with 2 ranges and 0 metadata columns:
> seqnames ranges strand
> <rle> <iranges> <rle>
> [1] 10 [75758072, 75758133] *
> [2] 10 [75802841, 75802911] *
> ---
> seqlengths:
> 10
> NA
> >
> > readGAlignmentPairs(file, param = ScanBamParam(which = gr[1]))
> GAlignmentPairs with 0 alignment pairs and 0 metadata columns:
> seqnames strand : ranges -- ranges
> <rle> <rle> : <iranges> -- <iranges>
> ---
> seqlengths:
> 1 2 3 ... GL000247.1 GL000248.1
GL000249.1
> 249250621 243199373 198022430 ... 36422 39786
38502
> >
> > readGAlignmentPairs(file, param = ScanBamParam(which = gr[2]))
> GAlignmentPairs with 3 alignment pairs and 0 metadata columns:
> seqnames strand : ranges -- ranges
> <rle> <rle> : <iranges> -- <iranges>
> [1] 10 + : [75758115, 75802896] -- [75802892, 75830482]
> [2] 10 - : [75802908, 75830498] -- [75758111, 75802892]
> [3] 10 - : [75802908, 75830498] -- [75802878, 75830468]
> ---
> seqlengths:
> 1 2 3 ... GL000247.1 GL000248.1
GL000249.1
> 249250621 243199373 198022430 ... 36422 39786
38502
> >
> > readGAlignmentPairs(file, param = ScanBamParam(which = gr))
> Error in makeGAlignmentPairs(galn, use.names = use.names, use.mcols
=
> use.mcols) :
> findMateAlignment() returned an invalid 'mate' vector
> In addition: Warning message:
> In findMateAlignment(x) :
> 4 alignments with ambiguous pairing were dumped.
> Use 'getDumpedAlignments()' to retrieve them from the dump
environment.
> >
> > sessionInfo()
> R version 3.0.0 (2013-04-03)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets
methods
> [8] base
>
> other attached packages:
> [1] Rsamtools_1.13.39 Biostrings_2.29.16
GenomicRanges_1.13.41
> [4] XVector_0.1.1 IRanges_1.19.31 BiocGenerics_0.7.4
>
> loaded via a namespace (and not attached):
> [1] bitops_1.0-6 stats4_3.0.0 zlibbioc_1.7.0
> >
>
>
[[alternative HTML version deleted]]
There seem to be at least two "gotchas" with the 'which' argument and
readGAlignmentPairs: (1) as Leonard said, there's no easy way to
handle
cases where one record overlaps multiple which arguments and (2) any
read
pairs with one end inside which and the other out will be discarded.
Excluding per-chromosome iteration, 'which' is a bit dangerous when
dealing
with read pairs. Somehow we should make this obvious to the user.
On Wed, Sep 11, 2013 at 11:02 AM, Leonard Goldstein <
goldstein.leonard@gene.com> wrote:
> Dear Herve and list,
>
>
> I realized the error might have been caused due to the same bam
record
> being read in multiple times when passing multiple ranges in the
which
> argument.
>
>
> I don't think it is necessarily obvious to users that
readGAlignments/Pairs
> returns a concatenation of records that were read in for each range
in the
> which argument. Instead one might expect to obtain the union of
records
> that overlap any of the ranges included in which. Do you think it
would be
> helpful to issue a warning in case records are read in multiple
times?
>
>
> Best wishes,
>
>
> Leonard
>
>
>
> On Tue, Sep 10, 2013 at 5:58 PM, Leonard Goldstein
<goldstel@gene.com> >wrote:
>
> > Hi Herve,
> >
> >
> > I ran into some issues when using readGAlignmentPairs with the
> > ScanBamParam 'which' argument. I included an example below. I can
see how
> > changing 'which' can result in different pairings but was
wondering
> whether
> > the error can be avoided and problematic pairings dropped instead.
> >
> >
> > It looks like the issue is introduced in revision 77808 (no error
in
> > 77791). I was hoping you have an idea what causes the problem,
otherwise
> I
> > can try to forward a test case.
> >
> >
> > Thanks for your help.
> >
> >
> > Leonard
> >
> >
> > > gr
> > GRanges with 2 ranges and 0 metadata columns:
> > seqnames ranges strand
> > <rle> <iranges> <rle>
> > [1] 10 [75758072, 75758133] *
> > [2] 10 [75802841, 75802911] *
> > ---
> > seqlengths:
> > 10
> > NA
> > >
> > > readGAlignmentPairs(file, param = ScanBamParam(which = gr[1]))
> > GAlignmentPairs with 0 alignment pairs and 0 metadata columns:
> > seqnames strand : ranges -- ranges
> > <rle> <rle> : <iranges> -- <iranges>
> > ---
> > seqlengths:
> > 1 2 3 ... GL000247.1 GL000248.1
GL000249.1
> > 249250621 243199373 198022430 ... 36422 39786
38502
> > >
> > > readGAlignmentPairs(file, param = ScanBamParam(which = gr[2]))
> > GAlignmentPairs with 3 alignment pairs and 0 metadata columns:
> > seqnames strand : ranges -- ranges
> > <rle> <rle> : <iranges> -- <iranges>
> > [1] 10 + : [75758115, 75802896] -- [75802892, 75830482]
> > [2] 10 - : [75802908, 75830498] -- [75758111, 75802892]
> > [3] 10 - : [75802908, 75830498] -- [75802878, 75830468]
> > ---
> > seqlengths:
> > 1 2 3 ... GL000247.1 GL000248.1
GL000249.1
> > 249250621 243199373 198022430 ... 36422 39786
38502
> > >
> > > readGAlignmentPairs(file, param = ScanBamParam(which = gr))
> > Error in makeGAlignmentPairs(galn, use.names = use.names,
use.mcols =
> > use.mcols) :
> > findMateAlignment() returned an invalid 'mate' vector
> > In addition: Warning message:
> > In findMateAlignment(x) :
> > 4 alignments with ambiguous pairing were dumped.
> > Use 'getDumpedAlignments()' to retrieve them from the dump
> environment.
> > >
> > > sessionInfo()
> > R version 3.0.0 (2013-04-03)
> > Platform: x86_64-unknown-linux-gnu (64-bit)
> >
> > locale:
> > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> > [7] LC_PAPER=C LC_NAME=C
> > [9] LC_ADDRESS=C LC_TELEPHONE=C
> > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> >
> > attached base packages:
> > [1] parallel stats graphics grDevices utils datasets
methods
> > [8] base
> >
> > other attached packages:
> > [1] Rsamtools_1.13.39 Biostrings_2.29.16
GenomicRanges_1.13.41
> > [4] XVector_0.1.1 IRanges_1.19.31 BiocGenerics_0.7.4
> >
> > loaded via a namespace (and not attached):
> > [1] bitops_1.0-6 stats4_3.0.0 zlibbioc_1.7.0
> > >
> >
> >
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
[[alternative HTML version deleted]]
Hi guys,
Sorry for the late answer.
On 09/11/2013 11:51 AM, Michael Lawrence wrote:
> There seem to be at least two "gotchas" with the 'which' argument
and
> readGAlignmentPairs: (1) as Leonard said, there's no easy way to
handle
> cases where one record overlaps multiple which arguments
This is not just with readGAlignmentPairs, it's also with
readGAlignments and, at a lower level, with scanBam(), which
the 2 formers are based on. If a record overlaps 2 ranges in the
'which' arg, scanBam() loads it twice.
I can't think of an easy way to address this unless we change the
semantic of the 'which' argument. For example instead of loading
records that overlap we could make the choice to only load records
that have their POS (i.e. start) within the ranges specified thru
'which'. It introduces a dissymmetry (because it looks at the start
of the record and not at its end) but it has the advantage of making
sure records are only loaded once (unless the user passes a 'which'
argument that contains overlapping ranges but then s/he's really
looking for it ;-) ).
> and (2) any read
> pairs with one end inside which and the other out will be discarded.
No easy work around for that one with the current implementation of
readGAlignmentPairs which just passes its 'which' argument down to
scanBam().
> Excluding per-chromosome iteration, 'which' is a bit dangerous when
dealing
> with read pairs.
It's also dangerous when dealing with single end reads, and it has
been for a while.
Cheers,
H.
> Somehow we should make this obvious to the user.
>
>
> On Wed, Sep 11, 2013 at 11:02 AM, Leonard Goldstein <
> goldstein.leonard at gene.com> wrote:
>
>> Dear Herve and list,
>>
>>
>> I realized the error might have been caused due to the same bam
record
>> being read in multiple times when passing multiple ranges in the
which
>> argument.
>>
>>
>> I don't think it is necessarily obvious to users that
readGAlignments/Pairs
>> returns a concatenation of records that were read in for each range
in the
>> which argument. Instead one might expect to obtain the union of
records
>> that overlap any of the ranges included in which. Do you think it
would be
>> helpful to issue a warning in case records are read in multiple
times?
>>
>>
>> Best wishes,
>>
>>
>> Leonard
>>
>>
>>
>> On Tue, Sep 10, 2013 at 5:58 PM, Leonard Goldstein <goldstel at="" gene.com="">>> wrote:
>>
>>> Hi Herve,
>>>
>>>
>>> I ran into some issues when using readGAlignmentPairs with the
>>> ScanBamParam 'which' argument. I included an example below. I can
see how
>>> changing 'which' can result in different pairings but was
wondering
>> whether
>>> the error can be avoided and problematic pairings dropped instead.
>>>
>>>
>>> It looks like the issue is introduced in revision 77808 (no error
in
>>> 77791). I was hoping you have an idea what causes the problem,
otherwise
>> I
>>> can try to forward a test case.
>>>
>>>
>>> Thanks for your help.
>>>
>>>
>>> Leonard
>>>
>>>
>>>> gr
>>> GRanges with 2 ranges and 0 metadata columns:
>>> seqnames ranges strand
>>> <rle> <iranges> <rle>
>>> [1] 10 [75758072, 75758133] *
>>> [2] 10 [75802841, 75802911] *
>>> ---
>>> seqlengths:
>>> 10
>>> NA
>>>>
>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr[1]))
>>> GAlignmentPairs with 0 alignment pairs and 0 metadata columns:
>>> seqnames strand : ranges -- ranges
>>> <rle> <rle> : <iranges> -- <iranges>
>>> ---
>>> seqlengths:
>>> 1 2 3 ... GL000247.1 GL000248.1
GL000249.1
>>> 249250621 243199373 198022430 ... 36422 39786
38502
>>>>
>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr[2]))
>>> GAlignmentPairs with 3 alignment pairs and 0 metadata columns:
>>> seqnames strand : ranges --
ranges
>>> <rle> <rle> : <iranges> --
<iranges>
>>> [1] 10 + : [75758115, 75802896] -- [75802892, 75830482]
>>> [2] 10 - : [75802908, 75830498] -- [75758111, 75802892]
>>> [3] 10 - : [75802908, 75830498] -- [75802878, 75830468]
>>> ---
>>> seqlengths:
>>> 1 2 3 ... GL000247.1 GL000248.1
GL000249.1
>>> 249250621 243199373 198022430 ... 36422 39786
38502
>>>>
>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr))
>>> Error in makeGAlignmentPairs(galn, use.names = use.names,
use.mcols =
>>> use.mcols) :
>>> findMateAlignment() returned an invalid 'mate' vector
>>> In addition: Warning message:
>>> In findMateAlignment(x) :
>>> 4 alignments with ambiguous pairing were dumped.
>>> Use 'getDumpedAlignments()' to retrieve them from the dump
>> environment.
>>>>
>>>> sessionInfo()
>>> R version 3.0.0 (2013-04-03)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>>> [7] LC_PAPER=C LC_NAME=C
>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] parallel stats graphics grDevices utils datasets
methods
>>> [8] base
>>>
>>> other attached packages:
>>> [1] Rsamtools_1.13.39 Biostrings_2.29.16
GenomicRanges_1.13.41
>>> [4] XVector_0.1.1 IRanges_1.19.31 BiocGenerics_0.7.4
>>>
>>> loaded via a namespace (and not attached):
>>> [1] bitops_1.0-6 stats4_3.0.0 zlibbioc_1.7.0
>>>>
>>>
>>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
Yes, I knew there weren't any easy solutions. I just think that
somehow we
need to point these out to users, because it's not exactly obvious.
On Tue, Sep 17, 2013 at 11:52 AM, Hervé Pagès <hpages@fhcrc.org>
wrote:
> Hi guys,
>
> Sorry for the late answer.
>
>
> On 09/11/2013 11:51 AM, Michael Lawrence wrote:
>
>> There seem to be at least two "gotchas" with the 'which' argument
and
>> readGAlignmentPairs: (1) as Leonard said, there's no easy way to
handle
>> cases where one record overlaps multiple which arguments
>>
>
> This is not just with readGAlignmentPairs, it's also with
> readGAlignments and, at a lower level, with scanBam(), which
> the 2 formers are based on. If a record overlaps 2 ranges in the
> 'which' arg, scanBam() loads it twice.
>
> I can't think of an easy way to address this unless we change the
> semantic of the 'which' argument. For example instead of loading
> records that overlap we could make the choice to only load records
> that have their POS (i.e. start) within the ranges specified thru
> 'which'. It introduces a dissymmetry (because it looks at the start
> of the record and not at its end) but it has the advantage of making
> sure records are only loaded once (unless the user passes a 'which'
> argument that contains overlapping ranges but then s/he's really
> looking for it ;-) ).
>
>
> and (2) any read
>> pairs with one end inside which and the other out will be
discarded.
>>
>
> No easy work around for that one with the current implementation of
> readGAlignmentPairs which just passes its 'which' argument down to
> scanBam().
>
>
> Excluding per-chromosome iteration, 'which' is a bit dangerous when
>> dealing
>> with read pairs.
>>
>
> It's also dangerous when dealing with single end reads, and it has
> been for a while.
>
> Cheers,
> H.
>
>
> Somehow we should make this obvious to the user.
>>
>>
>> On Wed, Sep 11, 2013 at 11:02 AM, Leonard Goldstein <
>> goldstein.leonard@gene.com> wrote:
>>
>> Dear Herve and list,
>>>
>>>
>>> I realized the error might have been caused due to the same bam
record
>>> being read in multiple times when passing multiple ranges in the
which
>>> argument.
>>>
>>>
>>> I don't think it is necessarily obvious to users that
>>> readGAlignments/Pairs
>>> returns a concatenation of records that were read in for each
range in
>>> the
>>> which argument. Instead one might expect to obtain the union of
records
>>> that overlap any of the ranges included in which. Do you think it
would
>>> be
>>> helpful to issue a warning in case records are read in multiple
times?
>>>
>>>
>>> Best wishes,
>>>
>>>
>>> Leonard
>>>
>>>
>>>
>>> On Tue, Sep 10, 2013 at 5:58 PM, Leonard Goldstein
<goldstel@gene.com>>>
>>>> wrote:
>>>>
>>>
>>> Hi Herve,
>>>>
>>>>
>>>> I ran into some issues when using readGAlignmentPairs with the
>>>> ScanBamParam 'which' argument. I included an example below. I can
see
>>>> how
>>>> changing 'which' can result in different pairings but was
wondering
>>>>
>>> whether
>>>
>>>> the error can be avoided and problematic pairings dropped
instead.
>>>>
>>>>
>>>> It looks like the issue is introduced in revision 77808 (no error
in
>>>> 77791). I was hoping you have an idea what causes the problem,
otherwise
>>>>
>>> I
>>>
>>>> can try to forward a test case.
>>>>
>>>>
>>>> Thanks for your help.
>>>>
>>>>
>>>> Leonard
>>>>
>>>>
>>>> gr
>>>>>
>>>> GRanges with 2 ranges and 0 metadata columns:
>>>> seqnames ranges strand
>>>> <rle> <iranges> <rle>
>>>> [1] 10 [75758072, 75758133] *
>>>> [2] 10 [75802841, 75802911] *
>>>> ---
>>>> seqlengths:
>>>> 10
>>>> NA
>>>>
>>>>>
>>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr[1]))
>>>>>
>>>> GAlignmentPairs with 0 alignment pairs and 0 metadata columns:
>>>> seqnames strand : ranges -- ranges
>>>> <rle> <rle> : <iranges> -- <iranges>
>>>> ---
>>>> seqlengths:
>>>> 1 2 3 ... GL000247.1 GL000248.1
GL000249.1
>>>> 249250621 243199373 198022430 ... 36422 39786
38502
>>>>
>>>>>
>>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr[2]))
>>>>>
>>>> GAlignmentPairs with 3 alignment pairs and 0 metadata columns:
>>>> seqnames strand : ranges --
ranges
>>>> <rle> <rle> : <iranges> --
<iranges>
>>>> [1] 10 + : [75758115, 75802896] -- [75802892,
75830482]
>>>> [2] 10 - : [75802908, 75830498] -- [75758111,
75802892]
>>>> [3] 10 - : [75802908, 75830498] -- [75802878,
75830468]
>>>> ---
>>>> seqlengths:
>>>> 1 2 3 ... GL000247.1 GL000248.1
GL000249.1
>>>> 249250621 243199373 198022430 ... 36422 39786
38502
>>>>
>>>>>
>>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr))
>>>>>
>>>> Error in makeGAlignmentPairs(galn, use.names = use.names,
use.mcols =
>>>> use.mcols) :
>>>> findMateAlignment() returned an invalid 'mate' vector
>>>> In addition: Warning message:
>>>> In findMateAlignment(x) :
>>>> 4 alignments with ambiguous pairing were dumped.
>>>> Use 'getDumpedAlignments()' to retrieve them from the dump
>>>>
>>> environment.
>>>
>>>>
>>>>> sessionInfo()
>>>>>
>>>> R version 3.0.0 (2013-04-03)
>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>>>> [7] LC_PAPER=C LC_NAME=C
>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] parallel stats graphics grDevices utils datasets
methods
>>>> [8] base
>>>>
>>>> other attached packages:
>>>> [1] Rsamtools_1.13.39 Biostrings_2.29.16
GenomicRanges_1.13.41
>>>> [4] XVector_0.1.1 IRanges_1.19.31
BiocGenerics_0.7.4
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] bitops_1.0-6 stats4_3.0.0 zlibbioc_1.7.0
>>>>
>>>>>
>>>>>
>>>>
>>>>
>>> [[alternative HTML version deleted]]
>>>
>>> ______________________________**_________________
>>> Bioconductor mailing list
>>> Bioconductor@r-project.org
>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.="" ethz.ch="" mailman="" listinfo="" bioconductor="">
>>> Search the archives:
>>> http://news.gmane.org/gmane.**science.biology.informatics.**conduc
tor<http: news.gmane.org="" gmane.science.biology.informatics.conductor="">
>>>
>>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________**_________________
>> Bioconductor mailing list
>> Bioconductor@r-project.org
>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor="">
>> Search the archives: http://news.gmane.org/gmane.**
>> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor="">
>>
>>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages@fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
[[alternative HTML version deleted]]
The new pairing algorithm used by readGAlignmentsList() does find
mates
outside the specified range. When only one of the pairs overlaps the
'which', it looks for a potential mate in the rest of the file.
fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
param <- ScanBamParam(what="flag",
which=GRanges("seq1", IRanges(1, width=40)))
readGAlignmentPairs() finds no mates in this region.
gapairs <- readGAlignmentPairs(BamFile(fl), param=param)
> gapairs
GAlignmentPairs with 0 alignment pairs and 0 metadata columns:
seqnames strand : ranges -- ranges
<rle> <rle> : <iranges> -- <iranges>
readGAlignmentsList() finds two mate pairs and several non-mates in
this
region. The 'mates' metadata column indicates which are legitimate
mates
as per the algorithem described at ?readGAlignmentsListFromBam. More
control over what records are returned can be done by specifying flags
in ScanBamParam.
galist <- readGAlignmentsList(BamFile(fl, asMates=TRUE), param=param)
> galist
GAlignmentsList of length 17:
[[1]]
GAlignments with 2 alignments and 2 metadata columns:
seqnames strand cigar qwidth start end width ngap | mates flag
[1] seq1 - 35M 35 229 263 35 0 | 1 83
[2] seq1 + 35M 35 37 71 35 0 | 1 163
[[2]]
GAlignments with 2 alignments and 2 metadata columns:
seqnames strand cigar qwidth start end width ngap | mates flag
[1] seq1 - 35M 35 185 219 35 0 | 1 147
[2] seq1 + 35M 35 36 70 35 0 | 1 99
[[3]]
GAlignments with 1 alignment and 2 metadata columns:
seqnames strand cigar qwidth start end width ngap | mates flag
[1] seq1 + 36M 36 6 41 36 0 | 0 137
> elementLengths(galist)
[1] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
The first two list elements contain mates.
> sapply(galist, function(x) sum((mcols(x)$mates)) > 0)
[1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE
FALSE
[13] FALSE FALSE FALSE FALSE FALSE
Here are descriptions of the first few flags as per this site:
http://picard.sourceforge.net/explain-flags.html
Flag 137 indicates an unmapped mate which is why this record was not
mated.
83:
Summary:
read paired
read mapped in proper pair
read reverse strand
first in pair
163:
Summary:
read paired
read mapped in proper pair
mate reverse strand
second in pair
147:
Summary:
read paired
read mapped in proper pair
read reverse strand
second in pair
99:
Summary:
read paired
read mapped in proper pair
mate reverse strand
first in pair
137:
Summary:
read paired
mate unmapped
second in pair
Valerie
On 09/17/2013 01:55 PM, Michael Lawrence wrote:
> Yes, I knew there weren't any easy solutions. I just think that
somehow we
> need to point these out to users, because it's not exactly obvious.
>
>
> On Tue, Sep 17, 2013 at 11:52 AM, Hervé Pagès <hpages at="" fhcrc.org="">
wrote:
>
>> Hi guys,
>>
>> Sorry for the late answer.
>>
>>
>> On 09/11/2013 11:51 AM, Michael Lawrence wrote:
>>
>>> There seem to be at least two "gotchas" with the 'which' argument
and
>>> readGAlignmentPairs: (1) as Leonard said, there's no easy way to
handle
>>> cases where one record overlaps multiple which arguments
>>>
>>
>> This is not just with readGAlignmentPairs, it's also with
>> readGAlignments and, at a lower level, with scanBam(), which
>> the 2 formers are based on. If a record overlaps 2 ranges in the
>> 'which' arg, scanBam() loads it twice.
>>
>> I can't think of an easy way to address this unless we change the
>> semantic of the 'which' argument. For example instead of loading
>> records that overlap we could make the choice to only load records
>> that have their POS (i.e. start) within the ranges specified thru
>> 'which'. It introduces a dissymmetry (because it looks at the start
>> of the record and not at its end) but it has the advantage of
making
>> sure records are only loaded once (unless the user passes a 'which'
>> argument that contains overlapping ranges but then s/he's really
>> looking for it ;-) ).
>>
>>
>> and (2) any read
>>> pairs with one end inside which and the other out will be
discarded.
>>>
>>
>> No easy work around for that one with the current implementation of
>> readGAlignmentPairs which just passes its 'which' argument down to
>> scanBam().
>>
>>
>> Excluding per-chromosome iteration, 'which' is a bit dangerous
when
>>> dealing
>>> with read pairs.
>>>
>>
>> It's also dangerous when dealing with single end reads, and it has
>> been for a while.
>>
>> Cheers,
>> H.
>>
>>
>> Somehow we should make this obvious to the user.
>>>
>>>
>>> On Wed, Sep 11, 2013 at 11:02 AM, Leonard Goldstein <
>>> goldstein.leonard at gene.com> wrote:
>>>
>>> Dear Herve and list,
>>>>
>>>>
>>>> I realized the error might have been caused due to the same bam
record
>>>> being read in multiple times when passing multiple ranges in the
which
>>>> argument.
>>>>
>>>>
>>>> I don't think it is necessarily obvious to users that
>>>> readGAlignments/Pairs
>>>> returns a concatenation of records that were read in for each
range in
>>>> the
>>>> which argument. Instead one might expect to obtain the union of
records
>>>> that overlap any of the ranges included in which. Do you think it
would
>>>> be
>>>> helpful to issue a warning in case records are read in multiple
times?
>>>>
>>>>
>>>> Best wishes,
>>>>
>>>>
>>>> Leonard
>>>>
>>>>
>>>>
>>>> On Tue, Sep 10, 2013 at 5:58 PM, Leonard Goldstein <goldstel at="" gene.com="">>>>
>>>>> wrote:
>>>>>
>>>>
>>>> Hi Herve,
>>>>>
>>>>>
>>>>> I ran into some issues when using readGAlignmentPairs with the
>>>>> ScanBamParam 'which' argument. I included an example below. I
can see
>>>>> how
>>>>> changing 'which' can result in different pairings but was
wondering
>>>>>
>>>> whether
>>>>
>>>>> the error can be avoided and problematic pairings dropped
instead.
>>>>>
>>>>>
>>>>> It looks like the issue is introduced in revision 77808 (no
error in
>>>>> 77791). I was hoping you have an idea what causes the problem,
otherwise
>>>>>
>>>> I
>>>>
>>>>> can try to forward a test case.
>>>>>
>>>>>
>>>>> Thanks for your help.
>>>>>
>>>>>
>>>>> Leonard
>>>>>
>>>>>
>>>>> gr
>>>>>>
>>>>> GRanges with 2 ranges and 0 metadata columns:
>>>>> seqnames ranges strand
>>>>> <rle> <iranges> <rle>
>>>>> [1] 10 [75758072, 75758133] *
>>>>> [2] 10 [75802841, 75802911] *
>>>>> ---
>>>>> seqlengths:
>>>>> 10
>>>>> NA
>>>>>
>>>>>>
>>>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr[1]))
>>>>>>
>>>>> GAlignmentPairs with 0 alignment pairs and 0 metadata columns:
>>>>> seqnames strand : ranges -- ranges
>>>>> <rle> <rle> : <iranges> -- <iranges>
>>>>> ---
>>>>> seqlengths:
>>>>> 1 2 3 ... GL000247.1 GL000248.1
GL000249.1
>>>>> 249250621 243199373 198022430 ... 36422 39786
38502
>>>>>
>>>>>>
>>>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr[2]))
>>>>>>
>>>>> GAlignmentPairs with 3 alignment pairs and 0 metadata columns:
>>>>> seqnames strand : ranges --
ranges
>>>>> <rle> <rle> : <iranges> --
<iranges>
>>>>> [1] 10 + : [75758115, 75802896] -- [75802892,
75830482]
>>>>> [2] 10 - : [75802908, 75830498] -- [75758111,
75802892]
>>>>> [3] 10 - : [75802908, 75830498] -- [75802878,
75830468]
>>>>> ---
>>>>> seqlengths:
>>>>> 1 2 3 ... GL000247.1 GL000248.1
GL000249.1
>>>>> 249250621 243199373 198022430 ... 36422 39786
38502
>>>>>
>>>>>>
>>>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr))
>>>>>>
>>>>> Error in makeGAlignmentPairs(galn, use.names = use.names,
use.mcols =
>>>>> use.mcols) :
>>>>> findMateAlignment() returned an invalid 'mate' vector
>>>>> In addition: Warning message:
>>>>> In findMateAlignment(x) :
>>>>> 4 alignments with ambiguous pairing were dumped.
>>>>> Use 'getDumpedAlignments()' to retrieve them from the dump
>>>>>
>>>> environment.
>>>>
>>>>>
>>>>>> sessionInfo()
>>>>>>
>>>>> R version 3.0.0 (2013-04-03)
>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>
>>>>> locale:
>>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>>>>> [7] LC_PAPER=C LC_NAME=C
>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>
>>>>> attached base packages:
>>>>> [1] parallel stats graphics grDevices utils datasets
methods
>>>>> [8] base
>>>>>
>>>>> other attached packages:
>>>>> [1] Rsamtools_1.13.39 Biostrings_2.29.16
GenomicRanges_1.13.41
>>>>> [4] XVector_0.1.1 IRanges_1.19.31
BiocGenerics_0.7.4
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] bitops_1.0-6 stats4_3.0.0 zlibbioc_1.7.0
>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> ______________________________**_________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat="" .ethz.ch="" mailman="" listinfo="" bioconductor="">
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.**science.biology.informatics.**condu
ctor<http: news.gmane.org="" gmane.science.biology.informatics.conductor="">
>>>>
>>>>
>>> [[alternative HTML version deleted]]
>>>
>>> ______________________________**_________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.="" ethz.ch="" mailman="" listinfo="" bioconductor="">
>>> Search the archives: http://news.gmane.org/gmane.**
>>> science.biology.informatics.**conductor<http: news.gmane.org="" gman="" e.science.biology.informatics.conductor="">
>>>
>>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone: (206) 667-5791
>> Fax: (206) 667-1319
>>
>
> [[alternative HTML version deleted]]
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
I'm a big fan of the new paired-end streaming. It's the way to go over
which-based partitioning. Nice job!
On Tue, Sep 17, 2013 at 2:46 PM, Valerie Obenchain
<vobencha@fhcrc.org>wrote:
> The new pairing algorithm used by readGAlignmentsList() does find
mates
> outside the specified range. When only one of the pairs overlaps the
> 'which', it looks for a potential mate in the rest of the file.
>
> fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
> param <- ScanBamParam(what="flag",
> which=GRanges("seq1", IRanges(1, width=40)))
>
> readGAlignmentPairs() finds no mates in this region.
>
> gapairs <- readGAlignmentPairs(BamFile(**fl), param=param)
> > gapairs
>
> GAlignmentPairs with 0 alignment pairs and 0 metadata columns:
> seqnames strand : ranges -- ranges
> <rle> <rle> : <iranges> -- <iranges>
>
> readGAlignmentsList() finds two mate pairs and several non-mates in
this
> region. The 'mates' metadata column indicates which are legitimate
mates as
> per the algorithem described at ?readGAlignmentsListFromBam. More
control
> over what records are returned can be done by specifying flags in
> ScanBamParam.
>
> galist <- readGAlignmentsList(BamFile(**fl, asMates=TRUE),
param=param)
> > galist
> GAlignmentsList of length 17:
> [[1]]
> GAlignments with 2 alignments and 2 metadata columns:
> seqnames strand cigar qwidth start end width ngap | mates flag
> [1] seq1 - 35M 35 229 263 35 0 | 1 83
> [2] seq1 + 35M 35 37 71 35 0 | 1 163
>
> [[2]]
> GAlignments with 2 alignments and 2 metadata columns:
> seqnames strand cigar qwidth start end width ngap | mates flag
> [1] seq1 - 35M 35 185 219 35 0 | 1 147
> [2] seq1 + 35M 35 36 70 35 0 | 1 99
>
> [[3]]
> GAlignments with 1 alignment and 2 metadata columns:
> seqnames strand cigar qwidth start end width ngap | mates flag
> [1] seq1 + 36M 36 6 41 36 0 | 0 137
>
>
> > elementLengths(galist)
> [1] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>
> The first two list elements contain mates.
>
>> sapply(galist, function(x) sum((mcols(x)$mates)) > 0)
>>
> [1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE
> FALSE
> [13] FALSE FALSE FALSE FALSE FALSE
>
>
> Here are descriptions of the first few flags as per this site:
> http://picard.sourceforge.net/**explain-
flags.html<http: picard.sourceforge.net="" explain-flags.html="">
> Flag 137 indicates an unmapped mate which is why this record was not
mated.
>
>
> 83:
> Summary:
> read paired
> read mapped in proper pair
> read reverse strand
> first in pair
>
> 163:
> Summary:
> read paired
> read mapped in proper pair
> mate reverse strand
> second in pair
>
> 147:
> Summary:
> read paired
> read mapped in proper pair
> read reverse strand
> second in pair
>
> 99:
> Summary:
> read paired
> read mapped in proper pair
> mate reverse strand
> first in pair
>
> 137:
> Summary:
> read paired
> mate unmapped
> second in pair
>
>
> Valerie
>
>
>
> On 09/17/2013 01:55 PM, Michael Lawrence wrote:
>
>> Yes, I knew there weren't any easy solutions. I just think that
somehow we
>> need to point these out to users, because it's not exactly obvious.
>>
>>
>> On Tue, Sep 17, 2013 at 11:52 AM, Hervé Pagès <hpages@fhcrc.org>
wrote:
>>
>> Hi guys,
>>>
>>> Sorry for the late answer.
>>>
>>>
>>> On 09/11/2013 11:51 AM, Michael Lawrence wrote:
>>>
>>> There seem to be at least two "gotchas" with the 'which' argument
and
>>>> readGAlignmentPairs: (1) as Leonard said, there's no easy way to
handle
>>>> cases where one record overlaps multiple which arguments
>>>>
>>>>
>>> This is not just with readGAlignmentPairs, it's also with
>>> readGAlignments and, at a lower level, with scanBam(), which
>>> the 2 formers are based on. If a record overlaps 2 ranges in the
>>> 'which' arg, scanBam() loads it twice.
>>>
>>> I can't think of an easy way to address this unless we change the
>>> semantic of the 'which' argument. For example instead of loading
>>> records that overlap we could make the choice to only load records
>>> that have their POS (i.e. start) within the ranges specified thru
>>> 'which'. It introduces a dissymmetry (because it looks at the
start
>>> of the record and not at its end) but it has the advantage of
making
>>> sure records are only loaded once (unless the user passes a
'which'
>>> argument that contains overlapping ranges but then s/he's really
>>> looking for it ;-) ).
>>>
>>>
>>> and (2) any read
>>>
>>>> pairs with one end inside which and the other out will be
discarded.
>>>>
>>>>
>>> No easy work around for that one with the current implementation
of
>>> readGAlignmentPairs which just passes its 'which' argument down to
>>> scanBam().
>>>
>>>
>>> Excluding per-chromosome iteration, 'which' is a bit dangerous
when
>>>
>>>> dealing
>>>> with read pairs.
>>>>
>>>>
>>> It's also dangerous when dealing with single end reads, and it has
>>> been for a while.
>>>
>>> Cheers,
>>> H.
>>>
>>>
>>> Somehow we should make this obvious to the user.
>>>
>>>>
>>>>
>>>> On Wed, Sep 11, 2013 at 11:02 AM, Leonard Goldstein <
>>>> goldstein.leonard@gene.com> wrote:
>>>>
>>>> Dear Herve and list,
>>>>
>>>>>
>>>>>
>>>>> I realized the error might have been caused due to the same bam
record
>>>>> being read in multiple times when passing multiple ranges in the
which
>>>>> argument.
>>>>>
>>>>>
>>>>> I don't think it is necessarily obvious to users that
>>>>> readGAlignments/Pairs
>>>>> returns a concatenation of records that were read in for each
range in
>>>>> the
>>>>> which argument. Instead one might expect to obtain the union of
records
>>>>> that overlap any of the ranges included in which. Do you think
it would
>>>>> be
>>>>> helpful to issue a warning in case records are read in multiple
times?
>>>>>
>>>>>
>>>>> Best wishes,
>>>>>
>>>>>
>>>>> Leonard
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Sep 10, 2013 at 5:58 PM, Leonard Goldstein
<goldstel@gene.com>>>>>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>> Hi Herve,
>>>>>
>>>>>>
>>>>>>
>>>>>> I ran into some issues when using readGAlignmentPairs with the
>>>>>> ScanBamParam 'which' argument. I included an example below. I
can see
>>>>>> how
>>>>>> changing 'which' can result in different pairings but was
wondering
>>>>>>
>>>>>> whether
>>>>>
>>>>> the error can be avoided and problematic pairings dropped
instead.
>>>>>>
>>>>>>
>>>>>> It looks like the issue is introduced in revision 77808 (no
error in
>>>>>> 77791). I was hoping you have an idea what causes the problem,
>>>>>> otherwise
>>>>>>
>>>>>> I
>>>>>
>>>>> can try to forward a test case.
>>>>>>
>>>>>>
>>>>>> Thanks for your help.
>>>>>>
>>>>>>
>>>>>> Leonard
>>>>>>
>>>>>>
>>>>>> gr
>>>>>>
>>>>>>>
>>>>>>> GRanges with 2 ranges and 0 metadata columns:
>>>>>> seqnames ranges strand
>>>>>> <rle> <iranges> <rle>
>>>>>> [1] 10 [75758072, 75758133] *
>>>>>> [2] 10 [75802841, 75802911] *
>>>>>> ---
>>>>>> seqlengths:
>>>>>> 10
>>>>>> NA
>>>>>>
>>>>>>
>>>>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr[1]))
>>>>>>>
>>>>>>> GAlignmentPairs with 0 alignment pairs and 0 metadata
columns:
>>>>>> seqnames strand : ranges -- ranges
>>>>>> <rle> <rle> : <iranges> -- <iranges>
>>>>>> ---
>>>>>> seqlengths:
>>>>>> 1 2 3 ... GL000247.1 GL000248.1
>>>>>> GL000249.1
>>>>>> 249250621 243199373 198022430 ... 36422 39786
>>>>>> 38502
>>>>>>
>>>>>>
>>>>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr[2]))
>>>>>>>
>>>>>>> GAlignmentPairs with 3 alignment pairs and 0 metadata
columns:
>>>>>> seqnames strand : ranges --
ranges
>>>>>> <rle> <rle> : <iranges> --
<iranges>
>>>>>> [1] 10 + : [75758115, 75802896] -- [75802892,
75830482]
>>>>>> [2] 10 - : [75802908, 75830498] -- [75758111,
75802892]
>>>>>> [3] 10 - : [75802908, 75830498] -- [75802878,
75830468]
>>>>>> ---
>>>>>> seqlengths:
>>>>>> 1 2 3 ... GL000247.1 GL000248.1
>>>>>> GL000249.1
>>>>>> 249250621 243199373 198022430 ... 36422 39786
>>>>>> 38502
>>>>>>
>>>>>>
>>>>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr))
>>>>>>>
>>>>>>> Error in makeGAlignmentPairs(galn, use.names = use.names,
use.mcols
>>>>>> =
>>>>>> use.mcols) :
>>>>>> findMateAlignment() returned an invalid 'mate' vector
>>>>>> In addition: Warning message:
>>>>>> In findMateAlignment(x) :
>>>>>> 4 alignments with ambiguous pairing were dumped.
>>>>>> Use 'getDumpedAlignments()' to retrieve them from the
dump
>>>>>>
>>>>>> environment.
>>>>>
>>>>>
>>>>>> sessionInfo()
>>>>>>>
>>>>>>> R version 3.0.0 (2013-04-03)
>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>>
>>>>>> locale:
>>>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>>>>>> [7] LC_PAPER=C LC_NAME=C
>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] parallel stats graphics grDevices utils datasets
>>>>>> methods
>>>>>> [8] base
>>>>>>
>>>>>> other attached packages:
>>>>>> [1] Rsamtools_1.13.39 Biostrings_2.29.16
GenomicRanges_1.13.41
>>>>>> [4] XVector_0.1.1 IRanges_1.19.31
BiocGenerics_0.7.4
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] bitops_1.0-6 stats4_3.0.0 zlibbioc_1.7.0
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________****_________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor@r-project.org
>>>>> https://stat.ethz.ch/mailman/****listinfo/bioconductor<https: s="" tat.ethz.ch="" mailman="" **listinfo="" bioconductor="">
>>>>> <https: **="" stat.ethz.ch="" mailman="" **listinfo="" bioconductor<https:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor="">
>>>>> >
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.****science.biology.informatics.****
>>>>> conductor<http: news.gmane.org="" gmane.**science.biology.informat="" ics.**conductor="">
>>>>> <http: news.gmane.**org="" gmane.science.biology.**informatics.con="" ductor<http:="" news.gmane.org="" gmane.science.biology.informatics.conduct="" or="">
>>>>> >
>>>>>
>>>>>
>>>>> [[alternative HTML version deleted]]
>>>>
>>>> ______________________________****_________________
>>>> Bioconductor mailing list
>>>> Bioconductor@r-project.org
>>>> https://stat.ethz.ch/mailman/****listinfo/bioconductor<https: st="" at.ethz.ch="" mailman="" **listinfo="" bioconductor="">
>>>> <https: **="" stat.ethz.ch="" mailman="" **listinfo="" bioconductor<https:="" s="" tat.ethz.ch="" mailman="" listinfo="" bioconductor="">
>>>> >
>>>> Search the archives: http://news.gmane.org/gmane.**
>>>> science.biology.informatics.****conductor<http: news.gmane.**="">>>> org/gmane.science.biology.**informatics.conductor<http: news.gma="" ne.org="" gmane.science.biology.informatics.conductor="">
>>>> >
>>>>
>>>>
>>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages@fhcrc.org
>>> Phone: (206) 667-5791
>>> Fax: (206) 667-1319
>>>
>>>
>> [[alternative HTML version deleted]]
>>
>>
>>
>> ______________________________**_________________
>> Bioconductor mailing list
>> Bioconductor@r-project.org
>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor="">
>> Search the archives: http://news.gmane.org/gmane.**
>> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor="">
>>
>>
>
[[alternative HTML version deleted]]
Leonard was thinking that the problem of importing the same alignment
multiple times when caught by multiple which ranges could be solved by
maintaining a hash of what reads have been seen before. This could be
done
with the read ID + pos + strand + cigar, we think, but even easier
would be
to get at the file offset. It looks like samtools keeps that hidden,
but we
could just do some copy-and-pasting of the bam_iter_t struct...
On Tue, Sep 17, 2013 at 3:29 PM, Michael Lawrence <michafla@gene.com>
wrote:
> I'm a big fan of the new paired-end streaming. It's the way to go
over
> which-based partitioning. Nice job!
>
>
> On Tue, Sep 17, 2013 at 2:46 PM, Valerie Obenchain
<vobencha@fhcrc.org>wrote:
>
>> The new pairing algorithm used by readGAlignmentsList() does find
mates
>> outside the specified range. When only one of the pairs overlaps
the
>> 'which', it looks for a potential mate in the rest of the file.
>>
>> fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
>> param <- ScanBamParam(what="flag",
>> which=GRanges("seq1", IRanges(1, width=40)))
>>
>> readGAlignmentPairs() finds no mates in this region.
>>
>> gapairs <- readGAlignmentPairs(BamFile(**fl), param=param)
>> > gapairs
>>
>> GAlignmentPairs with 0 alignment pairs and 0 metadata columns:
>> seqnames strand : ranges -- ranges
>> <rle> <rle> : <iranges> -- <iranges>
>>
>> readGAlignmentsList() finds two mate pairs and several non-mates in
this
>> region. The 'mates' metadata column indicates which are legitimate
mates as
>> per the algorithem described at ?readGAlignmentsListFromBam. More
control
>> over what records are returned can be done by specifying flags in
>> ScanBamParam.
>>
>> galist <- readGAlignmentsList(BamFile(**fl, asMates=TRUE),
param=param)
>> > galist
>> GAlignmentsList of length 17:
>> [[1]]
>> GAlignments with 2 alignments and 2 metadata columns:
>> seqnames strand cigar qwidth start end width ngap | mates
flag
>> [1] seq1 - 35M 35 229 263 35 0 | 1 83
>> [2] seq1 + 35M 35 37 71 35 0 | 1
163
>>
>> [[2]]
>> GAlignments with 2 alignments and 2 metadata columns:
>> seqnames strand cigar qwidth start end width ngap | mates
flag
>> [1] seq1 - 35M 35 185 219 35 0 | 1 147
>> [2] seq1 + 35M 35 36 70 35 0 | 1
99
>>
>> [[3]]
>> GAlignments with 1 alignment and 2 metadata columns:
>> seqnames strand cigar qwidth start end width ngap | mates
flag
>> [1] seq1 + 36M 36 6 41 36 0 | 0
137
>>
>>
>> > elementLengths(galist)
>> [1] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>>
>> The first two list elements contain mates.
>>
>>> sapply(galist, function(x) sum((mcols(x)$mates)) > 0)
>>>
>> [1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE
>> FALSE
>> [13] FALSE FALSE FALSE FALSE FALSE
>>
>>
>> Here are descriptions of the first few flags as per this site:
>> http://picard.sourceforge.net/**explain-
flags.html<http: picard.sourceforge.net="" explain-flags.html="">
>> Flag 137 indicates an unmapped mate which is why this record was
not
>> mated.
>>
>>
>> 83:
>> Summary:
>> read paired
>> read mapped in proper pair
>> read reverse strand
>> first in pair
>>
>> 163:
>> Summary:
>> read paired
>> read mapped in proper pair
>> mate reverse strand
>> second in pair
>>
>> 147:
>> Summary:
>> read paired
>> read mapped in proper pair
>> read reverse strand
>> second in pair
>>
>> 99:
>> Summary:
>> read paired
>> read mapped in proper pair
>> mate reverse strand
>> first in pair
>>
>> 137:
>> Summary:
>> read paired
>> mate unmapped
>> second in pair
>>
>>
>> Valerie
>>
>>
>>
>> On 09/17/2013 01:55 PM, Michael Lawrence wrote:
>>
>>> Yes, I knew there weren't any easy solutions. I just think that
somehow
>>> we
>>> need to point these out to users, because it's not exactly
obvious.
>>>
>>>
>>> On Tue, Sep 17, 2013 at 11:52 AM, Hervé Pagès <hpages@fhcrc.org>
wrote:
>>>
>>> Hi guys,
>>>>
>>>> Sorry for the late answer.
>>>>
>>>>
>>>> On 09/11/2013 11:51 AM, Michael Lawrence wrote:
>>>>
>>>> There seem to be at least two "gotchas" with the 'which'
argument and
>>>>> readGAlignmentPairs: (1) as Leonard said, there's no easy way to
handle
>>>>> cases where one record overlaps multiple which arguments
>>>>>
>>>>>
>>>> This is not just with readGAlignmentPairs, it's also with
>>>> readGAlignments and, at a lower level, with scanBam(), which
>>>> the 2 formers are based on. If a record overlaps 2 ranges in the
>>>> 'which' arg, scanBam() loads it twice.
>>>>
>>>> I can't think of an easy way to address this unless we change the
>>>> semantic of the 'which' argument. For example instead of loading
>>>> records that overlap we could make the choice to only load
records
>>>> that have their POS (i.e. start) within the ranges specified thru
>>>> 'which'. It introduces a dissymmetry (because it looks at the
start
>>>> of the record and not at its end) but it has the advantage of
making
>>>> sure records are only loaded once (unless the user passes a
'which'
>>>> argument that contains overlapping ranges but then s/he's really
>>>> looking for it ;-) ).
>>>>
>>>>
>>>> and (2) any read
>>>>
>>>>> pairs with one end inside which and the other out will be
discarded.
>>>>>
>>>>>
>>>> No easy work around for that one with the current implementation
of
>>>> readGAlignmentPairs which just passes its 'which' argument down
to
>>>> scanBam().
>>>>
>>>>
>>>> Excluding per-chromosome iteration, 'which' is a bit dangerous
when
>>>>
>>>>> dealing
>>>>> with read pairs.
>>>>>
>>>>>
>>>> It's also dangerous when dealing with single end reads, and it
has
>>>> been for a while.
>>>>
>>>> Cheers,
>>>> H.
>>>>
>>>>
>>>> Somehow we should make this obvious to the user.
>>>>
>>>>>
>>>>>
>>>>> On Wed, Sep 11, 2013 at 11:02 AM, Leonard Goldstein <
>>>>> goldstein.leonard@gene.com> wrote:
>>>>>
>>>>> Dear Herve and list,
>>>>>
>>>>>>
>>>>>>
>>>>>> I realized the error might have been caused due to the same bam
record
>>>>>> being read in multiple times when passing multiple ranges in
the which
>>>>>> argument.
>>>>>>
>>>>>>
>>>>>> I don't think it is necessarily obvious to users that
>>>>>> readGAlignments/Pairs
>>>>>> returns a concatenation of records that were read in for each
range in
>>>>>> the
>>>>>> which argument. Instead one might expect to obtain the union of
>>>>>> records
>>>>>> that overlap any of the ranges included in which. Do you think
it
>>>>>> would
>>>>>> be
>>>>>> helpful to issue a warning in case records are read in multiple
times?
>>>>>>
>>>>>>
>>>>>> Best wishes,
>>>>>>
>>>>>>
>>>>>> Leonard
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 10, 2013 at 5:58 PM, Leonard Goldstein
<goldstel@gene.com>>>>>>
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>> Hi Herve,
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I ran into some issues when using readGAlignmentPairs with the
>>>>>>> ScanBamParam 'which' argument. I included an example below. I
can see
>>>>>>> how
>>>>>>> changing 'which' can result in different pairings but was
wondering
>>>>>>>
>>>>>>> whether
>>>>>>
>>>>>> the error can be avoided and problematic pairings dropped
instead.
>>>>>>>
>>>>>>>
>>>>>>> It looks like the issue is introduced in revision 77808 (no
error in
>>>>>>> 77791). I was hoping you have an idea what causes the problem,
>>>>>>> otherwise
>>>>>>>
>>>>>>> I
>>>>>>
>>>>>> can try to forward a test case.
>>>>>>>
>>>>>>>
>>>>>>> Thanks for your help.
>>>>>>>
>>>>>>>
>>>>>>> Leonard
>>>>>>>
>>>>>>>
>>>>>>> gr
>>>>>>>
>>>>>>>>
>>>>>>>> GRanges with 2 ranges and 0 metadata columns:
>>>>>>> seqnames ranges strand
>>>>>>> <rle> <iranges> <rle>
>>>>>>> [1] 10 [75758072, 75758133] *
>>>>>>> [2] 10 [75802841, 75802911] *
>>>>>>> ---
>>>>>>> seqlengths:
>>>>>>> 10
>>>>>>> NA
>>>>>>>
>>>>>>>
>>>>>>>> readGAlignmentPairs(file, param = ScanBamParam(which =
gr[1]))
>>>>>>>>
>>>>>>>> GAlignmentPairs with 0 alignment pairs and 0 metadata
columns:
>>>>>>> seqnames strand : ranges -- ranges
>>>>>>> <rle> <rle> : <iranges> -- <iranges>
>>>>>>> ---
>>>>>>> seqlengths:
>>>>>>> 1 2 3 ... GL000247.1 GL000248.1
>>>>>>> GL000249.1
>>>>>>> 249250621 243199373 198022430 ... 36422 39786
>>>>>>> 38502
>>>>>>>
>>>>>>>
>>>>>>>> readGAlignmentPairs(file, param = ScanBamParam(which =
gr[2]))
>>>>>>>>
>>>>>>>> GAlignmentPairs with 3 alignment pairs and 0 metadata
columns:
>>>>>>> seqnames strand : ranges --
ranges
>>>>>>> <rle> <rle> : <iranges> --
<iranges>
>>>>>>> [1] 10 + : [75758115, 75802896] -- [75802892,
75830482]
>>>>>>> [2] 10 - : [75802908, 75830498] -- [75758111,
75802892]
>>>>>>> [3] 10 - : [75802908, 75830498] -- [75802878,
75830468]
>>>>>>> ---
>>>>>>> seqlengths:
>>>>>>> 1 2 3 ... GL000247.1 GL000248.1
>>>>>>> GL000249.1
>>>>>>> 249250621 243199373 198022430 ... 36422 39786
>>>>>>> 38502
>>>>>>>
>>>>>>>
>>>>>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr))
>>>>>>>>
>>>>>>>> Error in makeGAlignmentPairs(galn, use.names = use.names,
>>>>>>> use.mcols =
>>>>>>> use.mcols) :
>>>>>>> findMateAlignment() returned an invalid 'mate' vector
>>>>>>> In addition: Warning message:
>>>>>>> In findMateAlignment(x) :
>>>>>>> 4 alignments with ambiguous pairing were dumped.
>>>>>>> Use 'getDumpedAlignments()' to retrieve them from the
dump
>>>>>>>
>>>>>>> environment.
>>>>>>
>>>>>>
>>>>>>> sessionInfo()
>>>>>>>>
>>>>>>>> R version 3.0.0 (2013-04-03)
>>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>>>
>>>>>>> locale:
>>>>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>>>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>>>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>>>>>>> [7] LC_PAPER=C LC_NAME=C
>>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>>>
>>>>>>> attached base packages:
>>>>>>> [1] parallel stats graphics grDevices utils datasets
>>>>>>> methods
>>>>>>> [8] base
>>>>>>>
>>>>>>> other attached packages:
>>>>>>> [1] Rsamtools_1.13.39 Biostrings_2.29.16
GenomicRanges_1.13.41
>>>>>>> [4] XVector_0.1.1 IRanges_1.19.31
BiocGenerics_0.7.4
>>>>>>>
>>>>>>> loaded via a namespace (and not attached):
>>>>>>> [1] bitops_1.0-6 stats4_3.0.0 zlibbioc_1.7.0
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> [[alternative HTML version deleted]]
>>>>>>
>>>>>> ______________________________****_________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor@r-project.org
>>>>>> https://stat.ethz.ch/mailman/****listinfo/bioconductor<https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor="">
>>>>>> <https: **="" stat.ethz.ch="" mailman="" **listinfo="" bioconductor<https:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor="">
>>>>>> >
>>>>>> Search the archives:
>>>>>>
http://news.gmane.org/gmane.****science.biology.informatics.****
>>>>>> conductor<http: news.gmane.org="" gmane.**science.biology.informa="" tics.**conductor="">
>>>>>> <http: news.gmane.**org="" gmane.science.biology.**="">>>>>> informatics.conductor<http: news.gmane.org="" gmane.science.biolo="" gy.informatics.conductor="">
>>>>>> >
>>>>>>
>>>>>>
>>>>>> [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________****_________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor@r-project.org
>>>>> https://stat.ethz.ch/mailman/****listinfo/bioconductor<https: s="" tat.ethz.ch="" mailman="" **listinfo="" bioconductor="">
>>>>> <https: **="" stat.ethz.ch="" mailman="" **listinfo="" bioconductor<https:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor="">
>>>>> >
>>>>> Search the archives: http://news.gmane.org/gmane.**
>>>>> science.biology.informatics.****conductor<http: news.gmane.**="">>>>> org/gmane.science.biology.**informatics.conductor<http: news.gm="" ane.org="" gmane.science.biology.informatics.conductor="">
>>>>> >
>>>>>
>>>>>
>>>>> --
>>>> Hervé Pagès
>>>>
>>>> Program in Computational Biology
>>>> Division of Public Health Sciences
>>>> Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N, M1-B514
>>>> P.O. Box 19024
>>>> Seattle, WA 98109-1024
>>>>
>>>> E-mail: hpages@fhcrc.org
>>>> Phone: (206) 667-5791
>>>> Fax: (206) 667-1319
>>>>
>>>>
>>> [[alternative HTML version deleted]]
>>>
>>>
>>>
>>> ______________________________**_________________
>>> Bioconductor mailing list
>>> Bioconductor@r-project.org
>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.="" ethz.ch="" mailman="" listinfo="" bioconductor="">
>>> Search the archives: http://news.gmane.org/gmane.**
>>> science.biology.informatics.**conductor<http: news.gmane.org="" gman="" e.science.biology.informatics.conductor="">
>>>
>>>
>>
>
[[alternative HTML version deleted]]
Yes, good idea. We were thinking along similar lines and will look
into
this over the next few weeks.
Val
On 09/17/2013 04:15 PM, Michael Lawrence wrote:
> Leonard was thinking that the problem of importing the same
alignment
> multiple times when caught by multiple which ranges could be solved
by
> maintaining a hash of what reads have been seen before. This could
be
> done with the read ID + pos + strand + cigar, we think, but even
easier
> would be to get at the file offset. It looks like samtools keeps
that
> hidden, but we could just do some copy-and-pasting of the bam_iter_t
> struct...
>
>
>
>
> On Tue, Sep 17, 2013 at 3:29 PM, Michael Lawrence <michafla at="" gene.com=""> <mailto:michafla at="" gene.com="">> wrote:
>
> I'm a big fan of the new paired-end streaming. It's the way to
go
> over which-based partitioning. Nice job!
>
>
> On Tue, Sep 17, 2013 at 2:46 PM, Valerie Obenchain
> <vobencha at="" fhcrc.org="" <mailto:vobencha="" at="" fhcrc.org="">> wrote:
>
> The new pairing algorithm used by readGAlignmentsList() does
> find mates outside the specified range. When only one of the
> pairs overlaps the 'which', it looks for a potential mate in
the
> rest of the file.
>
> fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
> param <- ScanBamParam(what="flag",
> which=GRanges("seq1", IRanges(1,
width=40)))
>
> readGAlignmentPairs() finds no mates in this region.
>
> gapairs <- readGAlignmentPairs(BamFile(__fl), param=param)
> > gapairs
>
> GAlignmentPairs with 0 alignment pairs and 0 metadata
columns:
> seqnames strand : ranges -- ranges
> <rle> <rle> : <iranges> -- <iranges>
>
> readGAlignmentsList() finds two mate pairs and several non-
mates
> in this region. The 'mates' metadata column indicates which
are
> legitimate mates as per the algorithem described at
> ?readGAlignmentsListFromBam. More control over what records
are
> returned can be done by specifying flags in ScanBamParam.
>
> galist <- readGAlignmentsList(BamFile(__fl, asMates=TRUE),
> param=param)
> > galist
> GAlignmentsList of length 17:
> [[1]]
> GAlignments with 2 alignments and 2 metadata columns:
> seqnames strand cigar qwidth start end width ngap |
mates
> flag
> [1] seq1 - 35M 35 229 263 35
> <tel:35%20%20%20229%20263%20%20%20%2035> 0 | 1 83
> [2] seq1 + 35M 35 37 71 35 0 |
1
> 163
>
> [[2]]
> GAlignments with 2 alignments and 2 metadata columns:
> seqnames strand cigar qwidth start end width ngap |
mates
> flag
> [1] seq1 - 35M 35 185 219 35
> <tel:35%20%20%20185%20219%20%20%20%2035> 0 | 1 147
> [2] seq1 + 35M 35 36 70 35 0 |
1
> 99
>
> [[3]]
> GAlignments with 1 alignment and 2 metadata columns:
> seqnames strand cigar qwidth start end width ngap |
mates
> flag
> [1] seq1 + 36M 36 6 41 36 0 |
0
> 137
>
>
> > elementLengths(galist)
> [1] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>
> The first two list elements contain mates.
>
> sapply(galist, function(x) sum((mcols(x)$mates)) > 0)
>
> [1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> FALSE FALSE FALSE
> [13] FALSE FALSE FALSE FALSE FALSE
>
>
> Here are descriptions of the first few flags as per this
site:
> http://picard.sourceforge.net/__explain-flags.html
> <http: picard.sourceforge.net="" explain-flags.html="">
> Flag 137 indicates an unmapped mate which is why this record
was
> not mated.
>
>
> 83:
> Summary:
> read paired
> read mapped in proper pair
> read reverse strand
> first in pair
>
> 163:
> Summary:
> read paired
> read mapped in proper pair
> mate reverse strand
> second in pair
>
> 147:
> Summary:
> read paired
> read mapped in proper pair
> read reverse strand
> second in pair
>
> 99:
> Summary:
> read paired
> read mapped in proper pair
> mate reverse strand
> first in pair
>
> 137:
> Summary:
> read paired
> mate unmapped
> second in pair
>
>
> Valerie
>
>
>
> On 09/17/2013 01:55 PM, Michael Lawrence wrote:
>
> Yes, I knew there weren't any easy solutions. I just
think
> that somehow we
> need to point these out to users, because it's not
exactly
> obvious.
>
>
> On Tue, Sep 17, 2013 at 11:52 AM, Hervé Pagès
> <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>
wrote:
>
> Hi guys,
>
> Sorry for the late answer.
>
>
> On 09/11/2013 11:51 AM, Michael Lawrence wrote:
>
> There seem to be at least two "gotchas" with the
> 'which' argument and
> readGAlignmentPairs: (1) as Leonard said,
there's no
> easy way to handle
> cases where one record overlaps multiple which
arguments
>
>
> This is not just with readGAlignmentPairs, it's also
with
> readGAlignments and, at a lower level, with
scanBam(), which
> the 2 formers are based on. If a record overlaps 2
> ranges in the
> 'which' arg, scanBam() loads it twice.
>
> I can't think of an easy way to address this unless
we
> change the
> semantic of the 'which' argument. For example
instead of
> loading
> records that overlap we could make the choice to
only
> load records
> that have their POS (i.e. start) within the ranges
> specified thru
> 'which'. It introduces a dissymmetry (because it
looks
> at the start
> of the record and not at its end) but it has the
> advantage of making
> sure records are only loaded once (unless the user
> passes a 'which'
> argument that contains overlapping ranges but then
> s/he's really
> looking for it ;-) ).
>
>
> and (2) any read
>
> pairs with one end inside which and the other
out
> will be discarded.
>
>
> No easy work around for that one with the current
> implementation of
> readGAlignmentPairs which just passes its 'which'
> argument down to
> scanBam().
>
>
> Excluding per-chromosome iteration, 'which' is a
bit
> dangerous when
>
> dealing
> with read pairs.
>
>
> It's also dangerous when dealing with single end
reads,
> and it has
> been for a while.
>
> Cheers,
> H.
>
>
> Somehow we should make this obvious to the user.
>
>
>
> On Wed, Sep 11, 2013 at 11:02 AM, Leonard
Goldstein <
> goldstein.leonard at gene.com
> <mailto:goldstein.leonard at="" gene.com="">> wrote:
>
> Dear Herve and list,
>
>
>
> I realized the error might have been caused
due
> to the same bam record
> being read in multiple times when passing
> multiple ranges in the which
> argument.
>
>
> I don't think it is necessarily obvious to
users
> that
> readGAlignments/Pairs
> returns a concatenation of records that were
> read in for each range in
> the
> which argument. Instead one might expect to
> obtain the union of records
> that overlap any of the ranges included in
> which. Do you think it would
> be
> helpful to issue a warning in case records
are
> read in multiple times?
>
>
> Best wishes,
>
>
> Leonard
>
>
>
> On Tue, Sep 10, 2013 at 5:58 PM, Leonard
> Goldstein <goldstel at="" gene.com=""> <mailto:goldstel at="" gene.com="">
>
> wrote:
>
>
> Hi Herve,
>
>
>
> I ran into some issues when using
> readGAlignmentPairs with the
> ScanBamParam 'which' argument. I
included an
> example below. I can see
> how
> changing 'which' can result in different
> pairings but was wondering
>
> whether
>
> the error can be avoided and problematic
> pairings dropped instead.
>
>
> It looks like the issue is introduced in
> revision 77808 (no error in
> 77791). I was hoping you have an idea
what
> causes the problem, otherwise
>
> I
>
> can try to forward a test case.
>
>
> Thanks for your help.
>
>
> Leonard
>
>
> gr
>
>
> GRanges with 2 ranges and 0 metadata
columns:
> seqnames ranges
strand
> <rle> <iranges>
<rle>
> [1] 10 [75758072, 75758133]
*
> [2] 10 [75802841, 75802911]
*
> ---
> seqlengths:
> 10
> NA
>
>
> readGAlignmentPairs(file, param =
> ScanBamParam(which = gr[1]))
>
> GAlignmentPairs with 0 alignment pairs
and 0
> metadata columns:
> seqnames strand : ranges --
ranges
> <rle> <rle> : <iranges> --
<iranges>
> ---
> seqlengths:
> 1 2 3 ...
> GL000247.1 GL000248.1 GL000249.1
> 249250621 243199373 198022430 ...
> 36422 39786 38502
>
>
> readGAlignmentPairs(file, param =
> ScanBamParam(which = gr[2]))
>
> GAlignmentPairs with 3 alignment pairs
and 0
> metadata columns:
> seqnames strand :
> ranges -- ranges
> <rle> <rle> :
> <iranges> -- <iranges>
> [1] 10 + : [75758115,
75802896]
> -- [75802892, 75830482]
> [2] 10 - : [75802908,
75830498]
> -- [75758111, 75802892]
> [3] 10 - : [75802908,
75830498]
> -- [75802878, 75830468]
> ---
> seqlengths:
> 1 2 3 ...
> GL000247.1 GL000248.1 GL000249.1
> 249250621 243199373 198022430 ...
> 36422 39786 38502
>
>
> readGAlignmentPairs(file, param =
> ScanBamParam(which = gr))
>
> Error in makeGAlignmentPairs(galn,
use.names
> = use.names, use.mcols =
> use.mcols) :
> findMateAlignment() returned an
invalid
> 'mate' vector
> In addition: Warning message:
> In findMateAlignment(x) :
> 4 alignments with ambiguous
pairing
> were dumped.
> Use 'getDumpedAlignments()' to
> retrieve them from the dump
>
> environment.
>
>
> sessionInfo()
>
> R version 3.0.0 (2013-04-03)
> Platform: x86_64-unknown-linux-gnu
(64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8
LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8
> LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8
> LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C
LC_NAME=C
> [9] LC_ADDRESS=C
> LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8
> LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel stats graphics
grDevices
> utils datasets methods
> [8] base
>
> other attached packages:
> [1] Rsamtools_1.13.39
Biostrings_2.29.16
> GenomicRanges_1.13.41
> [4] XVector_0.1.1
IRanges_1.19.31
> BiocGenerics_0.7.4
>
> loaded via a namespace (and not
attached):
> [1] bitops_1.0-6 stats4_3.0.0
zlibbioc_1.7.0
>
>
>
>
>
> [[alternative HTML version
deleted]]
>
>
________________________________**_________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> <mailto:bioconductor at="" r-project.org="">
>
https://stat.ethz.ch/mailman/*__*listinfo/bioconductor
> <https: stat.ethz.ch="" mailman="" **listinfo="" bio="" conductor=""><https: __="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">
<https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">>
> Search the archives:
> http://news.gmane.org/gmane.**__science.biol
ogy.informatics.**__conductor
> <http: news.gmane.org="" gmane.**science.biolo="" gy.informatics.**conductor=""><http: news.gmane.__org="" gmane.science.biol="" ogy.__informatics.conductor="">
<http: news.gmane.org="" gmane.science.biology.informatics.conductor="">>
>
>
> [[alternative HTML version deleted]]
>
>
________________________________**_________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> <mailto:bioconductor at="" r-project.org="">
>
https://stat.ethz.ch/mailman/*__*listinfo/bioconductor
> <https: stat.ethz.ch="" mailman="" **listinfo="" biocond="" uctor=""><https: __="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">
<https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">>
> Search the archives:
http://news.gmane.org/gmane.**
> science.biology.informatics.**__conductor<http: news.gmane.__org="" gmane.science.biology.__informatics.conductor="">
<http: news.gmane.org="" gmane.science.biology.informatics.conductor="">>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org="">
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
> [[alternative HTML version deleted]]
>
>
>
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org="">
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">
> Search the archives:
>
http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>
<http: news.gmane.org="" gmane.science.biology.informatics.conductor="">
>
>
>
>
On 09/17/2013 04:15 PM, Michael Lawrence wrote:
> Leonard was thinking that the problem of importing the same
alignment
> multiple times when caught by multiple which ranges could be solved
by
> maintaining a hash of what reads have been seen before. This could
be
> done with the read ID + pos + strand + cigar, we think,
FWIW, I've seen BAM files with more than 1 record sharing the same
QNAME + POS + strand + CIGAR.
The 'Ambiguous pairing' section in the man page for
findMateAlignment()
shows such an example. It was taken from a real BAM file. IIRC the 2
records were actually indistinguible.
> but even easier
> would be to get at the file offset. It looks like samtools keeps
that
> hidden, but we could just do some copy-and-pasting of the bam_iter_t
> struct...
IIUC the common use case is processing a BAM file by chromosome
chunks.
Are you saying the record ID would be returned by scanBam or
readGAlignemnts() so the end/user has a way to discard records that
have already been processed.
Note that with a small change of semantic to the 'which' argument, you
don't need to return a record ID and the end-user needs to do nothing:
the same record cannot be loaded twice.
H.
>
>
>
>
> On Tue, Sep 17, 2013 at 3:29 PM, Michael Lawrence <michafla at="" gene.com=""> <mailto:michafla at="" gene.com="">> wrote:
>
> I'm a big fan of the new paired-end streaming. It's the way to
go
> over which-based partitioning. Nice job!
>
>
> On Tue, Sep 17, 2013 at 2:46 PM, Valerie Obenchain
> <vobencha at="" fhcrc.org="" <mailto:vobencha="" at="" fhcrc.org="">> wrote:
>
> The new pairing algorithm used by readGAlignmentsList() does
> find mates outside the specified range. When only one of the
> pairs overlaps the 'which', it looks for a potential mate in
the
> rest of the file.
>
> fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
> param <- ScanBamParam(what="flag",
> which=GRanges("seq1", IRanges(1,
width=40)))
>
> readGAlignmentPairs() finds no mates in this region.
>
> gapairs <- readGAlignmentPairs(BamFile(__fl), param=param)
> > gapairs
>
> GAlignmentPairs with 0 alignment pairs and 0 metadata
columns:
> seqnames strand : ranges -- ranges
> <rle> <rle> : <iranges> -- <iranges>
>
> readGAlignmentsList() finds two mate pairs and several non-
mates
> in this region. The 'mates' metadata column indicates which
are
> legitimate mates as per the algorithem described at
> ?readGAlignmentsListFromBam. More control over what records
are
> returned can be done by specifying flags in ScanBamParam.
>
> galist <- readGAlignmentsList(BamFile(__fl, asMates=TRUE),
> param=param)
> > galist
> GAlignmentsList of length 17:
> [[1]]
> GAlignments with 2 alignments and 2 metadata columns:
> seqnames strand cigar qwidth start end width ngap |
mates
> flag
> [1] seq1 - 35M 35 229 263 35
> <tel:35%20%20%20229%20263%20%20%20%2035> 0 | 1 83
> [2] seq1 + 35M 35 37 71 35 0 |
1
> 163
>
> [[2]]
> GAlignments with 2 alignments and 2 metadata columns:
> seqnames strand cigar qwidth start end width ngap |
mates
> flag
> [1] seq1 - 35M 35 185 219 35
> <tel:35%20%20%20185%20219%20%20%20%2035> 0 | 1 147
> [2] seq1 + 35M 35 36 70 35 0 |
1
> 99
>
> [[3]]
> GAlignments with 1 alignment and 2 metadata columns:
> seqnames strand cigar qwidth start end width ngap |
mates
> flag
> [1] seq1 + 36M 36 6 41 36 0 |
0
> 137
>
>
> > elementLengths(galist)
> [1] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>
> The first two list elements contain mates.
>
> sapply(galist, function(x) sum((mcols(x)$mates)) > 0)
>
> [1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> FALSE FALSE FALSE
> [13] FALSE FALSE FALSE FALSE FALSE
>
>
> Here are descriptions of the first few flags as per this
site:
> http://picard.sourceforge.net/__explain-flags.html
> <http: picard.sourceforge.net="" explain-flags.html="">
> Flag 137 indicates an unmapped mate which is why this record
was
> not mated.
>
>
> 83:
> Summary:
> read paired
> read mapped in proper pair
> read reverse strand
> first in pair
>
> 163:
> Summary:
> read paired
> read mapped in proper pair
> mate reverse strand
> second in pair
>
> 147:
> Summary:
> read paired
> read mapped in proper pair
> read reverse strand
> second in pair
>
> 99:
> Summary:
> read paired
> read mapped in proper pair
> mate reverse strand
> first in pair
>
> 137:
> Summary:
> read paired
> mate unmapped
> second in pair
>
>
> Valerie
>
>
>
> On 09/17/2013 01:55 PM, Michael Lawrence wrote:
>
> Yes, I knew there weren't any easy solutions. I just
think
> that somehow we
> need to point these out to users, because it's not
exactly
> obvious.
>
>
> On Tue, Sep 17, 2013 at 11:52 AM, Hervé Pagès
> <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>
wrote:
>
> Hi guys,
>
> Sorry for the late answer.
>
>
> On 09/11/2013 11:51 AM, Michael Lawrence wrote:
>
> There seem to be at least two "gotchas" with the
> 'which' argument and
> readGAlignmentPairs: (1) as Leonard said,
there's no
> easy way to handle
> cases where one record overlaps multiple which
arguments
>
>
> This is not just with readGAlignmentPairs, it's also
with
> readGAlignments and, at a lower level, with
scanBam(), which
> the 2 formers are based on. If a record overlaps 2
> ranges in the
> 'which' arg, scanBam() loads it twice.
>
> I can't think of an easy way to address this unless
we
> change the
> semantic of the 'which' argument. For example
instead of
> loading
> records that overlap we could make the choice to
only
> load records
> that have their POS (i.e. start) within the ranges
> specified thru
> 'which'. It introduces a dissymmetry (because it
looks
> at the start
> of the record and not at its end) but it has the
> advantage of making
> sure records are only loaded once (unless the user
> passes a 'which'
> argument that contains overlapping ranges but then
> s/he's really
> looking for it ;-) ).
>
>
> and (2) any read
>
> pairs with one end inside which and the other
out
> will be discarded.
>
>
> No easy work around for that one with the current
> implementation of
> readGAlignmentPairs which just passes its 'which'
> argument down to
> scanBam().
>
>
> Excluding per-chromosome iteration, 'which' is a
bit
> dangerous when
>
> dealing
> with read pairs.
>
>
> It's also dangerous when dealing with single end
reads,
> and it has
> been for a while.
>
> Cheers,
> H.
>
>
> Somehow we should make this obvious to the user.
>
>
>
> On Wed, Sep 11, 2013 at 11:02 AM, Leonard
Goldstein <
> goldstein.leonard at gene.com
> <mailto:goldstein.leonard at="" gene.com="">> wrote:
>
> Dear Herve and list,
>
>
>
> I realized the error might have been caused
due
> to the same bam record
> being read in multiple times when passing
> multiple ranges in the which
> argument.
>
>
> I don't think it is necessarily obvious to
users
> that
> readGAlignments/Pairs
> returns a concatenation of records that were
> read in for each range in
> the
> which argument. Instead one might expect to
> obtain the union of records
> that overlap any of the ranges included in
> which. Do you think it would
> be
> helpful to issue a warning in case records
are
> read in multiple times?
>
>
> Best wishes,
>
>
> Leonard
>
>
>
> On Tue, Sep 10, 2013 at 5:58 PM, Leonard
> Goldstein <goldstel at="" gene.com=""> <mailto:goldstel at="" gene.com="">
>
> wrote:
>
>
> Hi Herve,
>
>
>
> I ran into some issues when using
> readGAlignmentPairs with the
> ScanBamParam 'which' argument. I
included an
> example below. I can see
> how
> changing 'which' can result in different
> pairings but was wondering
>
> whether
>
> the error can be avoided and problematic
> pairings dropped instead.
>
>
> It looks like the issue is introduced in
> revision 77808 (no error in
> 77791). I was hoping you have an idea
what
> causes the problem, otherwise
>
> I
>
> can try to forward a test case.
>
>
> Thanks for your help.
>
>
> Leonard
>
>
> gr
>
>
> GRanges with 2 ranges and 0 metadata
columns:
> seqnames ranges
strand
> <rle> <iranges>
<rle>
> [1] 10 [75758072, 75758133]
*
> [2] 10 [75802841, 75802911]
*
> ---
> seqlengths:
> 10
> NA
>
>
> readGAlignmentPairs(file, param =
> ScanBamParam(which = gr[1]))
>
> GAlignmentPairs with 0 alignment pairs
and 0
> metadata columns:
> seqnames strand : ranges --
ranges
> <rle> <rle> : <iranges> --
<iranges>
> ---
> seqlengths:
> 1 2 3 ...
> GL000247.1 GL000248.1 GL000249.1
> 249250621 243199373 198022430 ...
> 36422 39786 38502
>
>
> readGAlignmentPairs(file, param =
> ScanBamParam(which = gr[2]))
>
> GAlignmentPairs with 3 alignment pairs
and 0
> metadata columns:
> seqnames strand :
> ranges -- ranges
> <rle> <rle> :
> <iranges> -- <iranges>
> [1] 10 + : [75758115,
75802896]
> -- [75802892, 75830482]
> [2] 10 - : [75802908,
75830498]
> -- [75758111, 75802892]
> [3] 10 - : [75802908,
75830498]
> -- [75802878, 75830468]
> ---
> seqlengths:
> 1 2 3 ...
> GL000247.1 GL000248.1 GL000249.1
> 249250621 243199373 198022430 ...
> 36422 39786 38502
>
>
> readGAlignmentPairs(file, param =
> ScanBamParam(which = gr))
>
> Error in makeGAlignmentPairs(galn,
use.names
> = use.names, use.mcols =
> use.mcols) :
> findMateAlignment() returned an
invalid
> 'mate' vector
> In addition: Warning message:
> In findMateAlignment(x) :
> 4 alignments with ambiguous
pairing
> were dumped.
> Use 'getDumpedAlignments()' to
> retrieve them from the dump
>
> environment.
>
>
> sessionInfo()
>
> R version 3.0.0 (2013-04-03)
> Platform: x86_64-unknown-linux-gnu
(64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8
LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8
> LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8
> LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C
LC_NAME=C
> [9] LC_ADDRESS=C
> LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8
> LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel stats graphics
grDevices
> utils datasets methods
> [8] base
>
> other attached packages:
> [1] Rsamtools_1.13.39
Biostrings_2.29.16
> GenomicRanges_1.13.41
> [4] XVector_0.1.1
IRanges_1.19.31
> BiocGenerics_0.7.4
>
> loaded via a namespace (and not
attached):
> [1] bitops_1.0-6 stats4_3.0.0
zlibbioc_1.7.0
>
>
>
>
>
> [[alternative HTML version
deleted]]
>
>
________________________________**_________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> <mailto:bioconductor at="" r-project.org="">
>
https://stat.ethz.ch/mailman/*__*listinfo/bioconductor
> <https: stat.ethz.ch="" mailman="" **listinfo="" bio="" conductor=""><https: __="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">
<https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">>
> Search the archives:
> http://news.gmane.org/gmane.**__science.biol
ogy.informatics.**__conductor
> <http: news.gmane.org="" gmane.**science.biolo="" gy.informatics.**conductor=""><http: news.gmane.__org="" gmane.science.biol="" ogy.__informatics.conductor="">
<http: news.gmane.org="" gmane.science.biology.informatics.conductor="">>
>
>
> [[alternative HTML version deleted]]
>
>
________________________________**_________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> <mailto:bioconductor at="" r-project.org="">
>
https://stat.ethz.ch/mailman/*__*listinfo/bioconductor
> <https: stat.ethz.ch="" mailman="" **listinfo="" biocond="" uctor=""><https: __="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">
<https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">>
> Search the archives:
http://news.gmane.org/gmane.**
> science.biology.informatics.**__conductor<http: news.gmane.__org="" gmane.science.biology.__informatics.conductor="">
<http: news.gmane.org="" gmane.science.biology.informatics.conductor="">>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org="">
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
> [[alternative HTML version deleted]]
>
>
>
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org="">
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">
> Search the archives:
>
http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>
<http: news.gmane.org="" gmane.science.biology.informatics.conductor="">
>
>
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
On 09/17/2013 04:39 PM, Hervé Pagès wrote:
> On 09/17/2013 04:15 PM, Michael Lawrence wrote:
>> Leonard was thinking that the problem of importing the same
alignment
>> multiple times when caught by multiple which ranges could be solved
by
>> maintaining a hash of what reads have been seen before. This could
be
>> done with the read ID + pos + strand + cigar, we think,
>
> FWIW, I've seen BAM files with more than 1 record sharing the same
> QNAME + POS + strand + CIGAR.
>
> The 'Ambiguous pairing' section in the man page for
findMateAlignment()
> shows such an example. It was taken from a real BAM file. IIRC the 2
> records were actually indistinguible.
>
>> but even easier
>> would be to get at the file offset. It looks like samtools keeps
that
>> hidden, but we could just do some copy-and-pasting of the
bam_iter_t
>> struct...
>
> IIUC the common use case is processing a BAM file by chromosome
chunks.
> Are you saying the record ID would be returned by scanBam or
> readGAlignemnts() so the end/user has a way to discard records that
> have already been processed.
>
> Note that with a small change of semantic to the 'which' argument,
you
> don't need to return a record ID and the end-user needs to do
nothing:
> the same record cannot be loaded twice.
To be even clearer about this: with its current semantic, 'which'
doesn't feel like the right thing to use to process a chromosome "by
tiles". Instead of trying to fix it (by changing its semantic or by
maintaining a hash of what reads have been seen before), maybe it
would make more sense to add a 'tiles' (or 'tiling') argument to
ScanBamParam() that would also take a GRanges object. Then scanBam()
would load the records that have their POS in the tiles and
scanBam(asMates=TRUE) would load the records that have their POS
in the tiles and are *first* mate and it would try to pair them
(beyond 'tiles' if needed).
It would be enough to enforce 'tiles' to have non-overlapping ranges
but it would not be necessary to require it to be a "real" tiling
(i.e.
a set of adjacent ranges that cover the entire chromosome or genome).
In addition, deciding whether a record belongs to 'tiles' would be
cheaper (and faster) than deciding whether it overlaps with
'which' because there is no more need to compute the end of the
record based on its CIGAR.
H.
>
> H.
>
>>
>>
>>
>>
>> On Tue, Sep 17, 2013 at 3:29 PM, Michael Lawrence <michafla at="" gene.com="">> <mailto:michafla at="" gene.com="">> wrote:
>>
>> I'm a big fan of the new paired-end streaming. It's the way to
go
>> over which-based partitioning. Nice job!
>>
>>
>> On Tue, Sep 17, 2013 at 2:46 PM, Valerie Obenchain
>> <vobencha at="" fhcrc.org="" <mailto:vobencha="" at="" fhcrc.org="">> wrote:
>>
>> The new pairing algorithm used by readGAlignmentsList()
does
>> find mates outside the specified range. When only one of
the
>> pairs overlaps the 'which', it looks for a potential mate
in the
>> rest of the file.
>>
>> fl <- system.file("extdata", "ex1.bam",
package="Rsamtools")
>> param <- ScanBamParam(what="flag",
>> which=GRanges("seq1", IRanges(1,
>> width=40)))
>>
>> readGAlignmentPairs() finds no mates in this region.
>>
>> gapairs <- readGAlignmentPairs(BamFile(__fl), param=param)
>> > gapairs
>>
>> GAlignmentPairs with 0 alignment pairs and 0 metadata
columns:
>> seqnames strand : ranges -- ranges
>> <rle> <rle> : <iranges> -- <iranges>
>>
>> readGAlignmentsList() finds two mate pairs and several non-
mates
>> in this region. The 'mates' metadata column indicates which
are
>> legitimate mates as per the algorithem described at
>> ?readGAlignmentsListFromBam. More control over what records
are
>> returned can be done by specifying flags in ScanBamParam.
>>
>> galist <- readGAlignmentsList(BamFile(__fl, asMates=TRUE),
>> param=param)
>> > galist
>> GAlignmentsList of length 17:
>> [[1]]
>> GAlignments with 2 alignments and 2 metadata columns:
>> seqnames strand cigar qwidth start end width ngap |
mates
>> flag
>> [1] seq1 - 35M 35 229 263 35
>> <tel:35%20%20%20229%20263%20%20%20%2035> 0 | 1 83
>> [2] seq1 + 35M 35 37 71 35 0 |
1
>> 163
>>
>> [[2]]
>> GAlignments with 2 alignments and 2 metadata columns:
>> seqnames strand cigar qwidth start end width ngap |
mates
>> flag
>> [1] seq1 - 35M 35 185 219 35
>> <tel:35%20%20%20185%20219%20%20%20%2035> 0 | 1 147
>> [2] seq1 + 35M 35 36 70 35 0 |
1
>> 99
>>
>> [[3]]
>> GAlignments with 1 alignment and 2 metadata columns:
>> seqnames strand cigar qwidth start end width ngap |
mates
>> flag
>> [1] seq1 + 36M 36 6 41 36 0 |
0
>> 137
>>
>>
>> > elementLengths(galist)
>> [1] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>>
>> The first two list elements contain mates.
>>
>> sapply(galist, function(x) sum((mcols(x)$mates)) > 0)
>>
>> [1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>> FALSE FALSE FALSE
>> [13] FALSE FALSE FALSE FALSE FALSE
>>
>>
>> Here are descriptions of the first few flags as per this
site:
>> http://picard.sourceforge.net/__explain-flags.html
>> <http: picard.sourceforge.net="" explain-flags.html="">
>> Flag 137 indicates an unmapped mate which is why this
record was
>> not mated.
>>
>>
>> 83:
>> Summary:
>> read paired
>> read mapped in proper pair
>> read reverse strand
>> first in pair
>>
>> 163:
>> Summary:
>> read paired
>> read mapped in proper pair
>> mate reverse strand
>> second in pair
>>
>> 147:
>> Summary:
>> read paired
>> read mapped in proper pair
>> read reverse strand
>> second in pair
>>
>> 99:
>> Summary:
>> read paired
>> read mapped in proper pair
>> mate reverse strand
>> first in pair
>>
>> 137:
>> Summary:
>> read paired
>> mate unmapped
>> second in pair
>>
>>
>> Valerie
>>
>>
>>
>> On 09/17/2013 01:55 PM, Michael Lawrence wrote:
>>
>> Yes, I knew there weren't any easy solutions. I just
think
>> that somehow we
>> need to point these out to users, because it's not
exactly
>> obvious.
>>
>>
>> On Tue, Sep 17, 2013 at 11:52 AM, Hervé Pagès
>> <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>
wrote:
>>
>> Hi guys,
>>
>> Sorry for the late answer.
>>
>>
>> On 09/11/2013 11:51 AM, Michael Lawrence wrote:
>>
>> There seem to be at least two "gotchas" with
the
>> 'which' argument and
>> readGAlignmentPairs: (1) as Leonard said,
there's no
>> easy way to handle
>> cases where one record overlaps multiple which
>> arguments
>>
>>
>> This is not just with readGAlignmentPairs, it's
also with
>> readGAlignments and, at a lower level, with
scanBam(),
>> which
>> the 2 formers are based on. If a record overlaps 2
>> ranges in the
>> 'which' arg, scanBam() loads it twice.
>>
>> I can't think of an easy way to address this unless
we
>> change the
>> semantic of the 'which' argument. For example
instead of
>> loading
>> records that overlap we could make the choice to
only
>> load records
>> that have their POS (i.e. start) within the ranges
>> specified thru
>> 'which'. It introduces a dissymmetry (because it
looks
>> at the start
>> of the record and not at its end) but it has the
>> advantage of making
>> sure records are only loaded once (unless the user
>> passes a 'which'
>> argument that contains overlapping ranges but then
>> s/he's really
>> looking for it ;-) ).
>>
>>
>> and (2) any read
>>
>> pairs with one end inside which and the other
out
>> will be discarded.
>>
>>
>> No easy work around for that one with the current
>> implementation of
>> readGAlignmentPairs which just passes its 'which'
>> argument down to
>> scanBam().
>>
>>
>> Excluding per-chromosome iteration, 'which' is a
bit
>> dangerous when
>>
>> dealing
>> with read pairs.
>>
>>
>> It's also dangerous when dealing with single end
reads,
>> and it has
>> been for a while.
>>
>> Cheers,
>> H.
>>
>>
>> Somehow we should make this obvious to the user.
>>
>>
>>
>> On Wed, Sep 11, 2013 at 11:02 AM, Leonard
Goldstein <
>> goldstein.leonard at gene.com
>> <mailto:goldstein.leonard at="" gene.com="">> wrote:
>>
>> Dear Herve and list,
>>
>>
>>
>> I realized the error might have been caused
due
>> to the same bam record
>> being read in multiple times when passing
>> multiple ranges in the which
>> argument.
>>
>>
>> I don't think it is necessarily obvious to
users
>> that
>> readGAlignments/Pairs
>> returns a concatenation of records that
were
>> read in for each range in
>> the
>> which argument. Instead one might expect to
>> obtain the union of records
>> that overlap any of the ranges included in
>> which. Do you think it would
>> be
>> helpful to issue a warning in case records
are
>> read in multiple times?
>>
>>
>> Best wishes,
>>
>>
>> Leonard
>>
>>
>>
>> On Tue, Sep 10, 2013 at 5:58 PM, Leonard
>> Goldstein <goldstel at="" gene.com="">> <mailto:goldstel at="" gene.com="">
>>
>> wrote:
>>
>>
>> Hi Herve,
>>
>>
>>
>> I ran into some issues when using
>> readGAlignmentPairs with the
>> ScanBamParam 'which' argument. I
included an
>> example below. I can see
>> how
>> changing 'which' can result in
different
>> pairings but was wondering
>>
>> whether
>>
>> the error can be avoided and
problematic
>> pairings dropped instead.
>>
>>
>> It looks like the issue is introduced
in
>> revision 77808 (no error in
>> 77791). I was hoping you have an idea
what
>> causes the problem, otherwise
>>
>> I
>>
>> can try to forward a test case.
>>
>>
>> Thanks for your help.
>>
>>
>> Leonard
>>
>>
>> gr
>>
>>
>> GRanges with 2 ranges and 0 metadata
columns:
>> seqnames ranges
strand
>> <rle> <iranges>
<rle>
>> [1] 10 [75758072, 75758133]
*
>> [2] 10 [75802841, 75802911]
*
>> ---
>> seqlengths:
>> 10
>> NA
>>
>>
>> readGAlignmentPairs(file, param =
>> ScanBamParam(which = gr[1]))
>>
>> GAlignmentPairs with 0 alignment pairs
and 0
>> metadata columns:
>> seqnames strand : ranges --
ranges
>> <rle> <rle> : <iranges> --
<iranges>
>> ---
>> seqlengths:
>> 1 2 3
...
>> GL000247.1 GL000248.1 GL000249.1
>> 249250621 243199373 198022430
...
>> 36422 39786 38502
>>
>>
>> readGAlignmentPairs(file, param =
>> ScanBamParam(which = gr[2]))
>>
>> GAlignmentPairs with 3 alignment pairs
and 0
>> metadata columns:
>> seqnames strand :
>> ranges -- ranges
>> <rle> <rle> :
>> <iranges> -- <iranges>
>> [1] 10 + : [75758115,
75802896]
>> -- [75802892, 75830482]
>> [2] 10 - : [75802908,
75830498]
>> -- [75758111, 75802892]
>> [3] 10 - : [75802908,
75830498]
>> -- [75802878, 75830468]
>> ---
>> seqlengths:
>> 1 2 3
...
>> GL000247.1 GL000248.1 GL000249.1
>> 249250621 243199373 198022430
...
>> 36422 39786 38502
>>
>>
>> readGAlignmentPairs(file, param =
>> ScanBamParam(which = gr))
>>
>> Error in makeGAlignmentPairs(galn,
use.names
>> = use.names, use.mcols =
>> use.mcols) :
>> findMateAlignment() returned an
invalid
>> 'mate' vector
>> In addition: Warning message:
>> In findMateAlignment(x) :
>> 4 alignments with ambiguous
pairing
>> were dumped.
>> Use 'getDumpedAlignments()' to
>> retrieve them from the dump
>>
>> environment.
>>
>>
>> sessionInfo()
>>
>> R version 3.0.0 (2013-04-03)
>> Platform: x86_64-unknown-linux-gnu
(64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8
>> LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8
>> LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8
>> LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=C
LC_NAME=C
>> [9] LC_ADDRESS=C
>> LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8
>> LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] parallel stats graphics
grDevices
>> utils datasets methods
>> [8] base
>>
>> other attached packages:
>> [1] Rsamtools_1.13.39
Biostrings_2.29.16
>> GenomicRanges_1.13.41
>> [4] XVector_0.1.1
IRanges_1.19.31
>> BiocGenerics_0.7.4
>>
>> loaded via a namespace (and not
attached):
>> [1] bitops_1.0-6 stats4_3.0.0
>> zlibbioc_1.7.0
>>
>>
>>
>>
>>
>> [[alternative HTML version
deleted]]
>>
>>
>> ________________________________**_________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> <mailto:bioconductor at="" r-project.org="">
>>
>> https://stat.ethz.ch/mailman/*__*listinfo/bioconductor
>>
>> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor=""><https: __="" st="" at.ethz.ch="" mailman="" __listinfo="" bioconductor="">>
>>
>> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">>
>> Search the archives:
>>
>> http://news.gmane.org/gmane.**__science.biology.informatics.**__con
ductor
>>
>> <http: news.gmane.org="" gmane.**science.biology.informatics.**conduc="" tor=""><http: news.gmane.__org="" gmane.science.biology.__informatics.condu="" ctor="">>
>>
>>
<http: news.gmane.org="" gmane.science.biology.informatics.conductor="">>
>>
>>
>> [[alternative HTML version deleted]]
>>
>>
________________________________**_________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> <mailto:bioconductor at="" r-project.org="">
>>
>> https://stat.ethz.ch/mailman/*__*listinfo/bioconductor
>>
>> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor=""><https: __="" st="" at.ethz.ch="" mailman="" __listinfo="" bioconductor="">>
>>
<https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">>
>> Search the archives:
http://news.gmane.org/gmane.**
>>
>> science.biology.informatics.**__conductor<http: news.gmane.__org="" g="" mane.science.biology.__informatics.conductor="">>
>>
>>
<http: news.gmane.org="" gmane.science.biology.informatics.conductor="">>
>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org="">
>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>>
>>
>> [[alternative HTML version deleted]]
>>
>>
>>
>> _________________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> <mailto:bioconductor at="" r-project.org="">
>> https://stat.ethz.ch/mailman/__listinfo/bioconductor
>> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">
>> Search the archives:
>>
>>
http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>>
>> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">
>>
>>
>>
>>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
So here's a use case: get all the reads for a set of putative
transcribed
regions, numbering in the thousands. You don't want just the reads
that
start in the regions, and you want to process the reads in vectorized
fashion, ideally with a single query to the BAM, returning a
GAlignmentPairs.
There are at least two ways to do this:
a) Have readGAlignmentPairs and friends return non-redundant
alignments,
even with multiple which,
b) Query the BAM separately (like in an lapply) for each which, then
combine the results, and keep around a partitioning that records the
original 'which' for each alignment. This would be slower and require
more
work of the user, but it would work right now.
Hopefully others have additional ideas.
Michael
On Tue, Sep 17, 2013 at 5:46 PM, Hervé Pagès <hpages@fhcrc.org> wrote:
> On 09/17/2013 04:39 PM, Hervé Pagès wrote:
>
>> On 09/17/2013 04:15 PM, Michael Lawrence wrote:
>>
>>> Leonard was thinking that the problem of importing the same
alignment
>>> multiple times when caught by multiple which ranges could be
solved by
>>> maintaining a hash of what reads have been seen before. This could
be
>>> done with the read ID + pos + strand + cigar, we think,
>>>
>>
>> FWIW, I've seen BAM files with more than 1 record sharing the same
>> QNAME + POS + strand + CIGAR.
>>
>> The 'Ambiguous pairing' section in the man page for
findMateAlignment()
>> shows such an example. It was taken from a real BAM file. IIRC the
2
>> records were actually indistinguible.
>>
>> but even easier
>>> would be to get at the file offset. It looks like samtools keeps
that
>>> hidden, but we could just do some copy-and-pasting of the
bam_iter_t
>>> struct...
>>>
>>
>> IIUC the common use case is processing a BAM file by chromosome
chunks.
>> Are you saying the record ID would be returned by scanBam or
>> readGAlignemnts() so the end/user has a way to discard records that
>> have already been processed.
>>
>> Note that with a small change of semantic to the 'which' argument,
you
>> don't need to return a record ID and the end-user needs to do
nothing:
>> the same record cannot be loaded twice.
>>
>
> To be even clearer about this: with its current semantic, 'which'
doesn't
> feel like the right thing to use to process a chromosome "by
> tiles". Instead of trying to fix it (by changing its semantic or by
> maintaining a hash of what reads have been seen before), maybe it
> would make more sense to add a 'tiles' (or 'tiling') argument to
> ScanBamParam() that would also take a GRanges object. Then scanBam()
> would load the records that have their POS in the tiles and
> scanBam(asMates=TRUE) would load the records that have their POS
> in the tiles and are *first* mate and it would try to pair them
> (beyond 'tiles' if needed).
>
> It would be enough to enforce 'tiles' to have non-overlapping ranges
> but it would not be necessary to require it to be a "real" tiling
(i.e.
> a set of adjacent ranges that cover the entire chromosome or
genome).
>
> In addition, deciding whether a record belongs to 'tiles' would be
> cheaper (and faster) than deciding whether it overlaps with
> 'which' because there is no more need to compute the end of the
> record based on its CIGAR.
>
>
> H.
>
>
>> H.
>>
>>
>>>
>>>
>>>
>>> On Tue, Sep 17, 2013 at 3:29 PM, Michael Lawrence
<michafla@gene.com>>> <mailto:michafla@gene.com>> wrote:
>>>
>>> I'm a big fan of the new paired-end streaming. It's the way to
go
>>> over which-based partitioning. Nice job!
>>>
>>>
>>> On Tue, Sep 17, 2013 at 2:46 PM, Valerie Obenchain
>>> <vobencha@fhcrc.org <mailto:vobencha@fhcrc.org="">> wrote:
>>>
>>> The new pairing algorithm used by readGAlignmentsList()
does
>>> find mates outside the specified range. When only one of
the
>>> pairs overlaps the 'which', it looks for a potential mate
in the
>>> rest of the file.
>>>
>>> fl <- system.file("extdata", "ex1.bam",
package="Rsamtools")
>>> param <- ScanBamParam(what="flag",
>>> which=GRanges("seq1", IRanges(1,
>>> width=40)))
>>>
>>> readGAlignmentPairs() finds no mates in this region.
>>>
>>> gapairs <- readGAlignmentPairs(BamFile(__**fl),
param=param)
>>> > gapairs
>>>
>>> GAlignmentPairs with 0 alignment pairs and 0 metadata
columns:
>>> seqnames strand : ranges -- ranges
>>> <rle> <rle> : <iranges> -- <iranges>
>>>
>>> readGAlignmentsList() finds two mate pairs and several
non-mates
>>> in this region. The 'mates' metadata column indicates
which are
>>> legitimate mates as per the algorithem described at
>>> ?readGAlignmentsListFromBam. More control over what
records are
>>> returned can be done by specifying flags in ScanBamParam.
>>>
>>> galist <- readGAlignmentsList(BamFile(__**fl,
asMates=TRUE),
>>> param=param)
>>> > galist
>>> GAlignmentsList of length 17:
>>> [[1]]
>>> GAlignments with 2 alignments and 2 metadata columns:
>>> seqnames strand cigar qwidth start end width ngap |
mates
>>> flag
>>> [1] seq1 - 35M 35 229 263 35
>>> <tel:35%20%20%20229%20263%20%**20%20%2035> 0 | 1
83
>>> [2] seq1 + 35M 35 37 71 35 0 |
1
>>> 163
>>>
>>> [[2]]
>>> GAlignments with 2 alignments and 2 metadata columns:
>>> seqnames strand cigar qwidth start end width ngap |
mates
>>> flag
>>> [1] seq1 - 35M 35 185 219 35
>>> <tel:35%20%20%20185%20219%20%**20%20%2035> 0 | 1
147
>>> [2] seq1 + 35M 35 36 70 35 0 |
1
>>> 99
>>>
>>> [[3]]
>>> GAlignments with 1 alignment and 2 metadata columns:
>>> seqnames strand cigar qwidth start end width ngap |
mates
>>> flag
>>> [1] seq1 + 36M 36 6 41 36 0 |
0
>>> 137
>>>
>>>
>>> > elementLengths(galist)
>>> [1] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>>>
>>> The first two list elements contain mates.
>>>
>>> sapply(galist, function(x) sum((mcols(x)$mates)) > 0)
>>>
>>> [1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE
>>> FALSE FALSE FALSE
>>> [13] FALSE FALSE FALSE FALSE FALSE
>>>
>>>
>>> Here are descriptions of the first few flags as per this
site:
>>> http://picard.sourceforge.net/**__explain-
flags.html<http: picard.sourceforge.net="" __explain-flags.html="">
>>> <http: picard.sourceforge.**net="" explain-="" flags.html<http:="" picard.sourceforge.net="" explain-flags.html="">
>>> >
>>> Flag 137 indicates an unmapped mate which is why this
record was
>>> not mated.
>>>
>>>
>>> 83:
>>> Summary:
>>> read paired
>>> read mapped in proper pair
>>> read reverse strand
>>> first in pair
>>>
>>> 163:
>>> Summary:
>>> read paired
>>> read mapped in proper pair
>>> mate reverse strand
>>> second in pair
>>>
>>> 147:
>>> Summary:
>>> read paired
>>> read mapped in proper pair
>>> read reverse strand
>>> second in pair
>>>
>>> 99:
>>> Summary:
>>> read paired
>>> read mapped in proper pair
>>> mate reverse strand
>>> first in pair
>>>
>>> 137:
>>> Summary:
>>> read paired
>>> mate unmapped
>>> second in pair
>>>
>>>
>>> Valerie
>>>
>>>
>>>
>>> On 09/17/2013 01:55 PM, Michael Lawrence wrote:
>>>
>>> Yes, I knew there weren't any easy solutions. I just
think
>>> that somehow we
>>> need to point these out to users, because it's not
exactly
>>> obvious.
>>>
>>>
>>> On Tue, Sep 17, 2013 at 11:52 AM, Hervé Pagès
>>> <hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> wrote:
>>>
>>> Hi guys,
>>>
>>> Sorry for the late answer.
>>>
>>>
>>> On 09/11/2013 11:51 AM, Michael Lawrence wrote:
>>>
>>> There seem to be at least two "gotchas" with
the
>>> 'which' argument and
>>> readGAlignmentPairs: (1) as Leonard said,
there's no
>>> easy way to handle
>>> cases where one record overlaps multiple which
>>> arguments
>>>
>>>
>>> This is not just with readGAlignmentPairs, it's
also with
>>> readGAlignments and, at a lower level, with
scanBam(),
>>> which
>>> the 2 formers are based on. If a record overlaps 2
>>> ranges in the
>>> 'which' arg, scanBam() loads it twice.
>>>
>>> I can't think of an easy way to address this
unless we
>>> change the
>>> semantic of the 'which' argument. For example
instead of
>>> loading
>>> records that overlap we could make the choice to
only
>>> load records
>>> that have their POS (i.e. start) within the ranges
>>> specified thru
>>> 'which'. It introduces a dissymmetry (because it
looks
>>> at the start
>>> of the record and not at its end) but it has the
>>> advantage of making
>>> sure records are only loaded once (unless the user
>>> passes a 'which'
>>> argument that contains overlapping ranges but then
>>> s/he's really
>>> looking for it ;-) ).
>>>
>>>
>>> and (2) any read
>>>
>>> pairs with one end inside which and the other
out
>>> will be discarded.
>>>
>>>
>>> No easy work around for that one with the current
>>> implementation of
>>> readGAlignmentPairs which just passes its 'which'
>>> argument down to
>>> scanBam().
>>>
>>>
>>> Excluding per-chromosome iteration, 'which' is
a bit
>>> dangerous when
>>>
>>> dealing
>>> with read pairs.
>>>
>>>
>>> It's also dangerous when dealing with single end
reads,
>>> and it has
>>> been for a while.
>>>
>>> Cheers,
>>> H.
>>>
>>>
>>> Somehow we should make this obvious to the
user.
>>>
>>>
>>>
>>> On Wed, Sep 11, 2013 at 11:02 AM, Leonard
Goldstein <
>>> goldstein.leonard@gene.com
>>>
<mailto:goldstein.leonard@**gene.com<goldstein.leonard@gene.com>>>
>>> wrote:
>>>
>>> Dear Herve and list,
>>>
>>>
>>>
>>> I realized the error might have been
caused due
>>> to the same bam record
>>> being read in multiple times when passing
>>> multiple ranges in the which
>>> argument.
>>>
>>>
>>> I don't think it is necessarily obvious to
users
>>> that
>>> readGAlignments/Pairs
>>> returns a concatenation of records that
were
>>> read in for each range in
>>> the
>>> which argument. Instead one might expect
to
>>> obtain the union of records
>>> that overlap any of the ranges included in
>>> which. Do you think it would
>>> be
>>> helpful to issue a warning in case records
are
>>> read in multiple times?
>>>
>>>
>>> Best wishes,
>>>
>>>
>>> Leonard
>>>
>>>
>>>
>>> On Tue, Sep 10, 2013 at 5:58 PM, Leonard
>>> Goldstein <goldstel@gene.com>>> <mailto:goldstel@gene.com>
>>>
>>> wrote:
>>>
>>>
>>> Hi Herve,
>>>
>>>
>>>
>>> I ran into some issues when using
>>> readGAlignmentPairs with the
>>> ScanBamParam 'which' argument. I
included an
>>> example below. I can see
>>> how
>>> changing 'which' can result in
different
>>> pairings but was wondering
>>>
>>> whether
>>>
>>> the error can be avoided and
problematic
>>> pairings dropped instead.
>>>
>>>
>>> It looks like the issue is introduced
in
>>> revision 77808 (no error in
>>> 77791). I was hoping you have an idea
what
>>> causes the problem, otherwise
>>>
>>> I
>>>
>>> can try to forward a test case.
>>>
>>>
>>> Thanks for your help.
>>>
>>>
>>> Leonard
>>>
>>>
>>> gr
>>>
>>>
>>> GRanges with 2 ranges and 0 metadata
columns:
>>> seqnames ranges
strand
>>> <rle> <iranges>
<rle>
>>> [1] 10 [75758072, 75758133]
*
>>> [2] 10 [75802841, 75802911]
*
>>> ---
>>> seqlengths:
>>> 10
>>> NA
>>>
>>>
>>> readGAlignmentPairs(file, param =
>>> ScanBamParam(which = gr[1]))
>>>
>>> GAlignmentPairs with 0 alignment pairs
and 0
>>> metadata columns:
>>> seqnames strand : ranges --
ranges
>>> <rle> <rle> : <iranges> --
<iranges>
>>> ---
>>> seqlengths:
>>> 1 2 3
...
>>> GL000247.1 GL000248.1 GL000249.1
>>> 249250621 243199373 198022430
...
>>> 36422 39786 38502
>>>
>>>
>>> readGAlignmentPairs(file, param =
>>> ScanBamParam(which = gr[2]))
>>>
>>> GAlignmentPairs with 3 alignment pairs
and 0
>>> metadata columns:
>>> seqnames strand :
>>> ranges -- ranges
>>> <rle> <rle> :
>>> <iranges> -- <iranges>
>>> [1] 10 + : [75758115,
75802896]
>>> -- [75802892, 75830482]
>>> [2] 10 - : [75802908,
75830498]
>>> -- [75758111, 75802892]
>>> [3] 10 - : [75802908,
75830498]
>>> -- [75802878, 75830468]
>>> ---
>>> seqlengths:
>>> 1 2 3
...
>>> GL000247.1 GL000248.1 GL000249.1
>>> 249250621 243199373 198022430
...
>>> 36422 39786 38502
>>>
>>>
>>> readGAlignmentPairs(file, param =
>>> ScanBamParam(which = gr))
>>>
>>> Error in makeGAlignmentPairs(galn,
use.names
>>> = use.names, use.mcols =
>>> use.mcols) :
>>> findMateAlignment() returned an
invalid
>>> 'mate' vector
>>> In addition: Warning message:
>>> In findMateAlignment(x) :
>>> 4 alignments with ambiguous
pairing
>>> were dumped.
>>> Use 'getDumpedAlignments()' to
>>> retrieve them from the dump
>>>
>>> environment.
>>>
>>>
>>> sessionInfo()
>>>
>>> R version 3.0.0 (2013-04-03)
>>> Platform: x86_64-unknown-linux-gnu
(64-bit)
>>>
>>> locale:
>>> [1] LC_CTYPE=en_US.UTF-8
>>> LC_NUMERIC=C
>>> [3] LC_TIME=en_US.UTF-8
>>> LC_COLLATE=en_US.UTF-8
>>> [5] LC_MONETARY=en_US.UTF-8
>>> LC_MESSAGES=en_US.UTF-8
>>> [7] LC_PAPER=C
LC_NAME=C
>>> [9] LC_ADDRESS=C
>>> LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8
>>> LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] parallel stats graphics
grDevices
>>> utils datasets methods
>>> [8] base
>>>
>>> other attached packages:
>>> [1] Rsamtools_1.13.39
Biostrings_2.29.16
>>> GenomicRanges_1.13.41
>>> [4] XVector_0.1.1
IRanges_1.19.31
>>> BiocGenerics_0.7.4
>>>
>>> loaded via a namespace (and not
attached):
>>> [1] bitops_1.0-6 stats4_3.0.0
>>> zlibbioc_1.7.0
>>>
>>>
>>>
>>>
>>>
>>> [[alternative HTML version
deleted]]
>>>
>>>
>>> ______________________________**__**_________________
>>> Bioconductor mailing list
>>> Bioconductor@r-project.org
>>>
<mailto:bioconductor@r-**project.org<bioconductor@r-project.org>
>>> >
>>>
>>> https://stat.ethz.ch/mailman/***__*listinfo/bioconductor<https: s="" tat.ethz.ch="" mailman="" *__*listinfo="" bioconductor="">
>>>
>>> <https: stat.ethz.ch="" mailman="" ****listinfo="" bioconductor<https:="" st="" at.ethz.ch="" mailman="" **listinfo="" bioconductor="">
>>> ><**https:/__/stat.ethz.ch/**mailman/__listinfo/**bioconductor<htt p:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">
>>>
>>>
>>> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<https:="" stat="" .ethz.ch="" mailman="" listinfo="" bioconductor="">
>>> >>
>>> Search the archives:
>>>
>>> http://news.gmane.org/gmane.****__science.biology.informatics.**
>>> **__conductor<http: news.gmane.org="" gmane.**__science.biology.info="" rmatics.**__conductor="">
>>>
>>> <http: news.gmane.org="" gmane.****science.biology.informatics.***="">>> *conductor<http: news.gmane.org="" gmane.**science.biology.informati="" cs.**conductor="">
>>> ><http: news.gmane.**__org="" gmane.science.biology.__**="">>> informatics.conductor
>>>
>>>
>>> <http: news.gmane.org="" gmane.**science.biology.informatics.**condu="" ctor<http:="" news.gmane.org="" gmane.science.biology.informatics.conductor="">
>>> >>
>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> ______________________________**
>>> __**_________________
>>> Bioconductor mailing list
>>> Bioconductor@r-project.org
>>>
<mailto:bioconductor@r-**project.org<bioconductor@r-project.org>
>>> >
>>>
>>> https://stat.ethz.ch/mailman/***__*listinfo/bioconductor<https: s="" tat.ethz.ch="" mailman="" *__*listinfo="" bioconductor="">
>>>
>>> <https: stat.ethz.ch="" mailman="" ****listinfo="" bioconductor<https:="" st="" at.ethz.ch="" mailman="" **listinfo="" bioconductor="">
>>> ><**https:/__/stat.ethz.ch/**mailman/__listinfo/**bioconductor<htt p:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">
>>>
>>> <https: stat.ethz.ch="" mailman="" **="">>> listinfo/bioconductor<https: stat.ethz.ch="" mailman="" listinfo="" biocon="" ductor="">
>>> >>
>>> Search the archives:
http://news.gmane.org/gmane.**
>>>
>>> science.biology.informatics.****__conductor<http: news.gmane.**="">>> __org/gmane.science.biology.__**informatics.conductor
>>>
>>>
>>> <http: news.gmane.org="" gmane.**science.biology.informatics.**condu="" ctor<http:="" news.gmane.org="" gmane.science.biology.informatics.conductor="">
>>> >>
>>>
>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org>
>>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>>
>>>
>>> ______________________________**___________________
>>> Bioconductor mailing list
>>> Bioconductor@r-project.org
>>> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org="">>
>>> https://stat.ethz.ch/mailman/_**_listinfo/bioconductor
<https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">
>>> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<="" https:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor="">
>>> >
>>> Search the archives:
>>>
>>> http://news.gmane.org/gmane.__**science.biology.informatics.__**
>>> conductor<http: news.gmane.org="" gmane.__science.biology.informatic="" s.__conductor="">
>>>
>>> <http: news.gmane.org="" gmane.**science.biology.informatics.**condu="" ctor<http:="" news.gmane.org="" gmane.science.biology.informatics.conductor="">
>>> >
>>>
>>>
>>>
>>>
>>>
>>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages@fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
[[alternative HTML version deleted]]
Just to clarify, I was only suggesting the 'tiles' arg as a simple way
to process a chromosome by chunks/tiles, which seems to be a common
use
case. 'tiles' is also trivial to split over multiple cpus. 'which' is
harder, because of the risk to process twice the same read. Using a
hash
to work around the problem is probably very hard in the multiple-cpu
context.
H.
On 09/17/2013 09:10 PM, Michael Lawrence wrote:
> So here's a use case: get all the reads for a set of putative
> transcribed regions, numbering in the thousands. You don't want just
the
> reads that start in the regions, and you want to process the reads
in
> vectorized fashion, ideally with a single query to the BAM,
returning a
> GAlignmentPairs.
>
> There are at least two ways to do this:
> a) Have readGAlignmentPairs and friends return non-redundant
alignments,
> even with multiple which,
> b) Query the BAM separately (like in an lapply) for each which, then
> combine the results, and keep around a partitioning that records the
> original 'which' for each alignment. This would be slower and
require
> more work of the user, but it would work right now.
>
> Hopefully others have additional ideas.
>
> Michael
>
>
>
>
>
>
>
> On Tue, Sep 17, 2013 at 5:46 PM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> wrote:
>
> On 09/17/2013 04:39 PM, Hervé Pagès wrote:
>
> On 09/17/2013 04:15 PM, Michael Lawrence wrote:
>
> Leonard was thinking that the problem of importing the
same
> alignment
> multiple times when caught by multiple which ranges
could be
> solved by
> maintaining a hash of what reads have been seen before.
This
> could be
> done with the read ID + pos + strand + cigar, we think,
>
>
> FWIW, I've seen BAM files with more than 1 record sharing
the same
> QNAME + POS + strand + CIGAR.
>
> The 'Ambiguous pairing' section in the man page for
> findMateAlignment()
> shows such an example. It was taken from a real BAM file.
IIRC the 2
> records were actually indistinguible.
>
> but even easier
> would be to get at the file offset. It looks like
samtools
> keeps that
> hidden, but we could just do some copy-and-pasting of
the
> bam_iter_t
> struct...
>
>
> IIUC the common use case is processing a BAM file by
chromosome
> chunks.
> Are you saying the record ID would be returned by scanBam or
> readGAlignemnts() so the end/user has a way to discard
records that
> have already been processed.
>
> Note that with a small change of semantic to the 'which'
> argument, you
> don't need to return a record ID and the end-user needs to
do
> nothing:
> the same record cannot be loaded twice.
>
>
> To be even clearer about this: with its current semantic,
'which'
> doesn't feel like the right thing to use to process a chromosome
"by
> tiles". Instead of trying to fix it (by changing its semantic or
by
> maintaining a hash of what reads have been seen before), maybe
it
> would make more sense to add a 'tiles' (or 'tiling') argument to
> ScanBamParam() that would also take a GRanges object. Then
scanBam()
> would load the records that have their POS in the tiles and
> scanBam(asMates=TRUE) would load the records that have their POS
> in the tiles and are *first* mate and it would try to pair them
> (beyond 'tiles' if needed).
>
> It would be enough to enforce 'tiles' to have non-overlapping
ranges
> but it would not be necessary to require it to be a "real"
tiling (i.e.
> a set of adjacent ranges that cover the entire chromosome or
genome).
>
> In addition, deciding whether a record belongs to 'tiles' would
be
> cheaper (and faster) than deciding whether it overlaps with
> 'which' because there is no more need to compute the end of the
> record based on its CIGAR.
>
>
> H.
>
>
> H.
>
>
>
>
>
> On Tue, Sep 17, 2013 at 3:29 PM, Michael Lawrence
> <michafla at="" gene.com="" <mailto:michafla="" at="" gene.com="">
> <mailto:michafla at="" gene.com="" <mailto:michafla="" at="" gene.com="">>> wrote:
>
> I'm a big fan of the new paired-end streaming. It's
the
> way to go
> over which-based partitioning. Nice job!
>
>
> On Tue, Sep 17, 2013 at 2:46 PM, Valerie Obenchain
> <vobencha at="" fhcrc.org="" <mailto:vobencha="" at="" fhcrc.org="">
> <mailto:vobencha at="" fhcrc.org="" <mailto:vobencha="" at="" fhcrc.org="">>> wrote:
>
> The new pairing algorithm used by
> readGAlignmentsList() does
> find mates outside the specified range. When
only
> one of the
> pairs overlaps the 'which', it looks for a
> potential mate in the
> rest of the file.
>
> fl <- system.file("extdata", "ex1.bam",
> package="Rsamtools")
> param <- ScanBamParam(what="flag",
> which=GRanges("seq1",
IRanges(1,
> width=40)))
>
> readGAlignmentPairs() finds no mates in this
region.
>
> gapairs <- readGAlignmentPairs(BamFile(____fl),
> param=param)
> > gapairs
>
> GAlignmentPairs with 0 alignment pairs and 0
> metadata columns:
> seqnames strand : ranges -- ranges
> <rle> <rle> : <iranges> -- <iranges>
>
> readGAlignmentsList() finds two mate pairs and
> several non-mates
> in this region. The 'mates' metadata column
> indicates which are
> legitimate mates as per the algorithem
described at
> ?readGAlignmentsListFromBam. More control over
what
> records are
> returned can be done by specifying flags in
> ScanBamParam.
>
> galist <- readGAlignmentsList(BamFile(____fl,
> asMates=TRUE),
> param=param)
> > galist
> GAlignmentsList of length 17:
> [[1]]
> GAlignments with 2 alignments and 2 metadata
columns:
> seqnames strand cigar qwidth start end
width
> ngap | mates
> flag
> [1] seq1 - 35M 35 229 263 35
> <tel:35%20229%20263%2035>
> <tel:35%20%20%20229%20263%20%__20%20%2035> 0
|
> 1 83
> [2] seq1 + 35M 35 37 71
35
> 0 | 1
> 163
>
> [[2]]
> GAlignments with 2 alignments and 2 metadata
columns:
> seqnames strand cigar qwidth start end
width
> ngap | mates
> flag
> [1] seq1 - 35M 35 185 219 35
> <tel:35%20185%20219%2035>
> <tel:35%20%20%20185%20219%20%__20%20%2035> 0
|
> 1 147
> [2] seq1 + 35M 35 36 70
35
> 0 | 1
> 99
>
> [[3]]
> GAlignments with 1 alignment and 2 metadata
columns:
> seqnames strand cigar qwidth start end
width
> ngap | mates
> flag
> [1] seq1 + 36M 36 6 41
36
> 0 | 0
> 137
>
>
> > elementLengths(galist)
> [1] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>
> The first two list elements contain mates.
>
> sapply(galist, function(x)
> sum((mcols(x)$mates)) > 0)
>
> [1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE
> FALSE FALSE
> FALSE FALSE FALSE
> [13] FALSE FALSE FALSE FALSE FALSE
>
>
> Here are descriptions of the first few flags as
per
> this site:
> http://picard.sourceforge.net/____explain-flags.html
> <http: picard.sourceforge.net="" __explain-flags.html="">
> <http: picard.sourceforge.__net="" explain-="" flags.html=""> <http: picard.sourceforge.net="" explain-flags.html="">>
> Flag 137 indicates an unmapped mate which is
why
> this record was
> not mated.
>
>
> 83:
> Summary:
> read paired
> read mapped in proper pair
> read reverse strand
> first in pair
>
> 163:
> Summary:
> read paired
> read mapped in proper pair
> mate reverse strand
> second in pair
>
> 147:
> Summary:
> read paired
> read mapped in proper pair
> read reverse strand
> second in pair
>
> 99:
> Summary:
> read paired
> read mapped in proper pair
> mate reverse strand
> first in pair
>
> 137:
> Summary:
> read paired
> mate unmapped
> second in pair
>
>
> Valerie
>
>
>
> On 09/17/2013 01:55 PM, Michael Lawrence wrote:
>
> Yes, I knew there weren't any easy
solutions. I
> just think
> that somehow we
> need to point these out to users, because
it's
> not exactly
> obvious.
>
>
> On Tue, Sep 17, 2013 at 11:52 AM, Hervé
Pagès
> <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">
> <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> wrote:
>
> Hi guys,
>
> Sorry for the late answer.
>
>
> On 09/11/2013 11:51 AM, Michael
Lawrence wrote:
>
> There seem to be at least two
"gotchas"
> with the
> 'which' argument and
> readGAlignmentPairs: (1) as Leonard
> said, there's no
> easy way to handle
> cases where one record overlaps
> multiple which
> arguments
>
>
> This is not just with
readGAlignmentPairs,
> it's also with
> readGAlignments and, at a lower level,
with
> scanBam(),
> which
> the 2 formers are based on. If a record
> overlaps 2
> ranges in the
> 'which' arg, scanBam() loads it twice.
>
> I can't think of an easy way to address
> this unless we
> change the
> semantic of the 'which' argument. For
> example instead of
> loading
> records that overlap we could make the
> choice to only
> load records
> that have their POS (i.e. start) within
the
> ranges
> specified thru
> 'which'. It introduces a dissymmetry
> (because it looks
> at the start
> of the record and not at its end) but
it
> has the
> advantage of making
> sure records are only loaded once
(unless
> the user
> passes a 'which'
> argument that contains overlapping
ranges
> but then
> s/he's really
> looking for it ;-) ).
>
>
> and (2) any read
>
> pairs with one end inside which and
the
> other out
> will be discarded.
>
>
> No easy work around for that one with
the
> current
> implementation of
> readGAlignmentPairs which just passes
its
> 'which'
> argument down to
> scanBam().
>
>
> Excluding per-chromosome iteration,
> 'which' is a bit
> dangerous when
>
> dealing
> with read pairs.
>
>
> It's also dangerous when dealing with
> single end reads,
> and it has
> been for a while.
>
> Cheers,
> H.
>
>
> Somehow we should make this obvious
to
> the user.
>
>
>
> On Wed, Sep 11, 2013 at 11:02 AM,
> Leonard Goldstein <
> goldstein.leonard at gene.com <mailto:goldstein.leonard at="" gene.com="">
> <mailto:goldstein.leonard at="" __gene.com=""> <mailto:goldstein.leonard at="" gene.com="">>> wrote:
>
> Dear Herve and list,
>
>
>
> I realized the error might have
> been caused due
> to the same bam record
> being read in multiple times
when
> passing
> multiple ranges in the which
> argument.
>
>
> I don't think it is necessarily
> obvious to users
> that
> readGAlignments/Pairs
> returns a concatenation of
records
> that were
> read in for each range in
> the
> which argument. Instead one
might
> expect to
> obtain the union of records
> that overlap any of the ranges
> included in
> which. Do you think it would
> be
> helpful to issue a warning in
case
> records are
> read in multiple times?
>
>
> Best wishes,
>
>
> Leonard
>
>
>
> On Tue, Sep 10, 2013 at 5:58
PM,
> Leonard
> Goldstein <goldstel at="" gene.com=""> <mailto:goldstel at="" gene.com="">
> <mailto:goldstel at="" gene.com=""> <mailto:goldstel at="" gene.com="">>
>
> wrote:
>
>
> Hi Herve,
>
>
>
> I ran into some issues when
using
> readGAlignmentPairs with
the
> ScanBamParam 'which'
argument.
> I included an
> example below. I can see
> how
> changing 'which' can result
in
> different
> pairings but was wondering
>
> whether
>
> the error can be avoided
and
> problematic
> pairings dropped instead.
>
>
> It looks like the issue is
> introduced in
> revision 77808 (no error in
> 77791). I was hoping you
have
> an idea what
> causes the problem,
otherwise
>
> I
>
> can try to forward a test
case.
>
>
> Thanks for your help.
>
>
> Leonard
>
>
> gr
>
>
> GRanges with 2 ranges and 0
> metadata columns:
> seqnames
> ranges strand
> <rle>
> <iranges> <rle>
> [1] 10
[75758072,
> 75758133] *
> [2] 10
[75802841,
> 75802911] *
> ---
> seqlengths:
> 10
> NA
>
>
>
readGAlignmentPairs(file,
> param =
> ScanBamParam(which =
gr[1]))
>
> GAlignmentPairs with 0
> alignment pairs and 0
> metadata columns:
> seqnames strand :
ranges
> -- ranges
> <rle> <rle> :
<iranges>
> -- <iranges>
> ---
> seqlengths:
> 1 2
> 3 ...
> GL000247.1 GL000248.1
GL000249.1
> 249250621 243199373
> 198022430 ...
> 36422 39786
38502
>
>
>
readGAlignmentPairs(file,
> param =
> ScanBamParam(which =
gr[2]))
>
> GAlignmentPairs with 3
> alignment pairs and 0
> metadata columns:
> seqnames strand :
> ranges --
ranges
> <rle> <rle> :
> <iranges> --
<iranges>
> [1] 10 + :
> [75758115, 75802896]
> -- [75802892, 75830482]
> [2] 10 - :
> [75802908, 75830498]
> -- [75758111, 75802892]
> [3] 10 - :
> [75802908, 75830498]
> -- [75802878, 75830468]
> ---
> seqlengths:
> 1 2
> 3 ...
> GL000247.1 GL000248.1
GL000249.1
> 249250621 243199373
> 198022430 ...
> 36422 39786
38502
>
>
>
readGAlignmentPairs(file,
> param =
> ScanBamParam(which =
gr))
>
> Error in
> makeGAlignmentPairs(galn, use.names
> = use.names, use.mcols =
> use.mcols) :
> findMateAlignment()
> returned an invalid
> 'mate' vector
> In addition: Warning
message:
> In findMateAlignment(x) :
> 4 alignments with
> ambiguous pairing
> were dumped.
> Use
> 'getDumpedAlignments()' to
> retrieve them from the dump
>
> environment.
>
>
> sessionInfo()
>
> R version 3.0.0
(2013-04-03)
> Platform:
> x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1]
LC_CTYPE=en_US.UTF-8
> LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8
> LC_COLLATE=en_US.UTF-8
> [5]
LC_MONETARY=en_US.UTF-8
> LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C
> LC_NAME=C
> [9] LC_ADDRESS=C
> LC_TELEPHONE=C
> [11]
LC_MEASUREMENT=en_US.UTF-8
> LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel stats
> graphics grDevices
> utils datasets methods
> [8] base
>
> other attached packages:
> [1] Rsamtools_1.13.39
> Biostrings_2.29.16
> GenomicRanges_1.13.41
> [4] XVector_0.1.1
> IRanges_1.19.31
> BiocGenerics_0.7.4
>
> loaded via a namespace (and
not
> attached):
> [1] bitops_1.0-6
stats4_3.0.0
> zlibbioc_1.7.0
>
>
>
>
>
> [[alternative HTML
> version deleted]]
>
>
> __________________________________**_________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org="">
>
> <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>
>
> https://stat.ethz.ch/mailman/*____*listinfo/bioconductor
> <https: stat.ethz.ch="" mailman="" *__*listinfo="" bioconductor="">
>
> <https: stat.ethz.ch="" mailman="" __**listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor="">><
__https:/__/stat.ethz.ch/__mailman/__listinfo/__bioconductor
> <http: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">
>
>
> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">>>
> Search the archives:
>
> http://news.gmane.org/gmane.**____science.biology.inform
atics.__**__conductor
> <http: news.gmane.org="" gmane.**__science.biology.informa="" tics.**__conductor="">
>
> <http: news.gmane.org="" gmane.*__*science.biology.informa="" tics.*__*conductor=""> <http: news.gmane.org="" gmane.**science.biology.informati="" cs.**conductor="">><http: news.gmane.____org="" gmane.science.biology.____i="" nformatics.conductor="">
>
>
<http: news.gmane.org="" gmane.__science.biology.informatics.__conductor="">
<http: news.gmane.org="" gmane.science.biology.informatics.conductor="">>>
>
>
> [[alternative HTML
version
> deleted]]
>
>
> __________________________________**_________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org="">
> <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>
>
> https://stat.ethz.ch/mailman/*____*listinfo/bioconductor
> <https: stat.ethz.ch="" mailman="" *__*listinfo="" bioconductor="">
>
> <https: stat.ethz.ch="" mailman="" __**listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor="">><
__https:/__/stat.ethz.ch/__mailman/__listinfo/__bioconductor
> <http: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">
>
>
> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">>>
> Search the archives:
> http://news.gmane.org/gmane.**
>
> science.biology.informatics.**____conductor<http: news.="" gmane.____org="" gmane.science.biology.____informatics.conductor="">
>
>
<http: news.gmane.org="" gmane.__science.biology.informatics.__conductor="">
<http: news.gmane.org="" gmane.science.biology.informatics.conductor="">>>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">>
> Phone: (206) 667-5791
> <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319
> <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319>
>
>
> [[alternative HTML version
deleted]]
>
>
>
>
___________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org="">
> <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>
> https://stat.ethz.ch/mailman/____listinfo/bioconductor
> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">
>
> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">>
> Search the archives:
>
> http://news.gmane.org/gmane.____science.biology.informat
ics.____conductor
> <http: news.gmane.org="" gmane.__science.biology.informati="" cs.__conductor="">
>
>
<http: news.gmane.org="" gmane.__science.biology.informatics.__conductor="">
<http: news.gmane.org="" gmane.science.biology.informatics.conductor="">>
>
>
>
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org="">
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
For processing an entire BAM, I think the yieldSize streaming is the
way to
go. Each read is read only once, and there is now the option of
guaranteed
pairs.
On Tue, Sep 17, 2013 at 10:43 PM, Hervé Pagès <hpages@fhcrc.org>
wrote:
> Just to clarify, I was only suggesting the 'tiles' arg as a simple
way
> to process a chromosome by chunks/tiles, which seems to be a common
use
> case. 'tiles' is also trivial to split over multiple cpus. 'which'
is
> harder, because of the risk to process twice the same read. Using a
hash
> to work around the problem is probably very hard in the multiple-cpu
> context.
>
> H.
>
>
>
> On 09/17/2013 09:10 PM, Michael Lawrence wrote:
>
>> So here's a use case: get all the reads for a set of putative
>> transcribed regions, numbering in the thousands. You don't want
just the
>> reads that start in the regions, and you want to process the reads
in
>> vectorized fashion, ideally with a single query to the BAM,
returning a
>> GAlignmentPairs.
>>
>> There are at least two ways to do this:
>> a) Have readGAlignmentPairs and friends return non-redundant
alignments,
>> even with multiple which,
>> b) Query the BAM separately (like in an lapply) for each which,
then
>> combine the results, and keep around a partitioning that records
the
>> original 'which' for each alignment. This would be slower and
require
>> more work of the user, but it would work right now.
>>
>> Hopefully others have additional ideas.
>>
>> Michael
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Sep 17, 2013 at 5:46 PM, Hervé Pagès <hpages@fhcrc.org>> <mailto:hpages@fhcrc.org>> wrote:
>>
>> On 09/17/2013 04:39 PM, Hervé Pagès wrote:
>>
>> On 09/17/2013 04:15 PM, Michael Lawrence wrote:
>>
>> Leonard was thinking that the problem of importing the
same
>> alignment
>> multiple times when caught by multiple which ranges
could be
>> solved by
>> maintaining a hash of what reads have been seen before.
This
>> could be
>> done with the read ID + pos + strand + cigar, we think,
>>
>>
>> FWIW, I've seen BAM files with more than 1 record sharing
the same
>> QNAME + POS + strand + CIGAR.
>>
>> The 'Ambiguous pairing' section in the man page for
>> findMateAlignment()
>> shows such an example. It was taken from a real BAM file.
IIRC
>> the 2
>> records were actually indistinguible.
>>
>> but even easier
>> would be to get at the file offset. It looks like
samtools
>> keeps that
>> hidden, but we could just do some copy-and-pasting of
the
>> bam_iter_t
>> struct...
>>
>>
>> IIUC the common use case is processing a BAM file by
chromosome
>> chunks.
>> Are you saying the record ID would be returned by scanBam
or
>> readGAlignemnts() so the end/user has a way to discard
records
>> that
>> have already been processed.
>>
>> Note that with a small change of semantic to the 'which'
>> argument, you
>> don't need to return a record ID and the end-user needs to
do
>> nothing:
>> the same record cannot be loaded twice.
>>
>>
>> To be even clearer about this: with its current semantic,
'which'
>> doesn't feel like the right thing to use to process a
chromosome "by
>> tiles". Instead of trying to fix it (by changing its semantic
or by
>> maintaining a hash of what reads have been seen before), maybe
it
>> would make more sense to add a 'tiles' (or 'tiling') argument
to
>> ScanBamParam() that would also take a GRanges object. Then
scanBam()
>> would load the records that have their POS in the tiles and
>> scanBam(asMates=TRUE) would load the records that have their
POS
>> in the tiles and are *first* mate and it would try to pair them
>> (beyond 'tiles' if needed).
>>
>> It would be enough to enforce 'tiles' to have non-overlapping
ranges
>> but it would not be necessary to require it to be a "real"
tiling
>> (i.e.
>> a set of adjacent ranges that cover the entire chromosome or
genome).
>>
>> In addition, deciding whether a record belongs to 'tiles' would
be
>> cheaper (and faster) than deciding whether it overlaps with
>> 'which' because there is no more need to compute the end of the
>> record based on its CIGAR.
>>
>>
>> H.
>>
>>
>> H.
>>
>>
>>
>>
>>
>> On Tue, Sep 17, 2013 at 3:29 PM, Michael Lawrence
>> <michafla@gene.com <mailto:michafla@gene.com="">
>> <mailto:michafla@gene.com <mailto:michafla@gene.com="">>>
wrote:
>>
>> I'm a big fan of the new paired-end streaming.
It's the
>> way to go
>> over which-based partitioning. Nice job!
>>
>>
>> On Tue, Sep 17, 2013 at 2:46 PM, Valerie Obenchain
>> <vobencha@fhcrc.org <mailto:vobencha@fhcrc.org="">
>> <mailto:vobencha@fhcrc.org <mailto:vobencha@fhcrc.org="">>>
>> wrote:
>>
>> The new pairing algorithm used by
>> readGAlignmentsList() does
>> find mates outside the specified range. When
only
>> one of the
>> pairs overlaps the 'which', it looks for a
>> potential mate in the
>> rest of the file.
>>
>> fl <- system.file("extdata", "ex1.bam",
>> package="Rsamtools")
>> param <- ScanBamParam(what="flag",
>> which=GRanges("seq1",
>> IRanges(1,
>> width=40)))
>>
>> readGAlignmentPairs() finds no mates in this
region.
>>
>> gapairs <-
readGAlignmentPairs(BamFile(__**__fl),
>>
>> param=param)
>> > gapairs
>>
>> GAlignmentPairs with 0 alignment pairs and 0
>> metadata columns:
>> seqnames strand : ranges -- ranges
>> <rle> <rle> : <iranges> -- <iranges>
>>
>> readGAlignmentsList() finds two mate pairs and
>> several non-mates
>> in this region. The 'mates' metadata column
>> indicates which are
>> legitimate mates as per the algorithem
described at
>> ?readGAlignmentsListFromBam. More control over
what
>> records are
>> returned can be done by specifying flags in
>> ScanBamParam.
>>
>> galist <-
readGAlignmentsList(BamFile(__**__fl,
>>
>> asMates=TRUE),
>> param=param)
>> > galist
>> GAlignmentsList of length 17:
>> [[1]]
>> GAlignments with 2 alignments and 2 metadata
columns:
>> seqnames strand cigar qwidth start end
width
>> ngap | mates
>> flag
>> [1] seq1 - 35M 35 229 263 35
>> <tel:35%20229%20263%2035>
>> <tel:35%20%20%20229%20263%20%_**_20%20%2035>
0 |
>>
>> 1 83
>> [2] seq1 + 35M 35 37 71
35
>> 0 | 1
>> 163
>>
>> [[2]]
>> GAlignments with 2 alignments and 2 metadata
columns:
>> seqnames strand cigar qwidth start end
width
>> ngap | mates
>> flag
>> [1] seq1 - 35M 35 185 219 35
>> <tel:35%20185%20219%2035>
>> <tel:35%20%20%20185%20219%20%_**_20%20%2035>
0 |
>>
>> 1 147
>> [2] seq1 + 35M 35 36 70
35
>> 0 | 1
>> 99
>>
>> [[3]]
>> GAlignments with 1 alignment and 2 metadata
columns:
>> seqnames strand cigar qwidth start end
width
>> ngap | mates
>> flag
>> [1] seq1 + 36M 36 6 41
36
>> 0 | 0
>> 137
>>
>>
>> > elementLengths(galist)
>> [1] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>>
>> The first two list elements contain mates.
>>
>> sapply(galist, function(x)
>> sum((mcols(x)$mates)) > 0)
>>
>> [1] TRUE TRUE FALSE FALSE FALSE FALSE
FALSE
>> FALSE FALSE
>> FALSE FALSE FALSE
>> [13] FALSE FALSE FALSE FALSE FALSE
>>
>>
>> Here are descriptions of the first few flags
as per
>> this site:
>> http://picard.sourceforge.net/**____explain-
flags.html<http: picard.sourceforge.net="" ____explain-flags.html="">
>> <http: picard.sourceforge.**net="" __explain-="" flags.html<http:="" picard.sourceforge.net="" __explain-flags.html="">
>> >
>> <http: picard.sourceforge.__**="">> net/explain-flags.html
>>
>> <http: picard.sourceforge.**net="" explain-="" flags.html<http:="" picard.sourceforge.net="" explain-flags.html="">
>> >>
>> Flag 137 indicates an unmapped mate which is
why
>> this record was
>> not mated.
>>
>>
>> 83:
>> Summary:
>> read paired
>> read mapped in proper pair
>> read reverse strand
>> first in pair
>>
>> 163:
>> Summary:
>> read paired
>> read mapped in proper pair
>> mate reverse strand
>> second in pair
>>
>> 147:
>> Summary:
>> read paired
>> read mapped in proper pair
>> read reverse strand
>> second in pair
>>
>> 99:
>> Summary:
>> read paired
>> read mapped in proper pair
>> mate reverse strand
>> first in pair
>>
>> 137:
>> Summary:
>> read paired
>> mate unmapped
>> second in pair
>>
>>
>> Valerie
>>
>>
>>
>> On 09/17/2013 01:55 PM, Michael Lawrence
wrote:
>>
>> Yes, I knew there weren't any easy
solutions. I
>> just think
>> that somehow we
>> need to point these out to users, because
it's
>> not exactly
>> obvious.
>>
>>
>> On Tue, Sep 17, 2013 at 11:52 AM, Hervé
Pagès
>> <hpages@fhcrc.org <mailto:hpages@fhcrc.org="">
>> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>>
wrote:
>>
>> Hi guys,
>>
>> Sorry for the late answer.
>>
>>
>> On 09/11/2013 11:51 AM, Michael
Lawrence
>> wrote:
>>
>> There seem to be at least two
"gotchas"
>> with the
>> 'which' argument and
>> readGAlignmentPairs: (1) as
Leonard
>> said, there's no
>> easy way to handle
>> cases where one record overlaps
>> multiple which
>> arguments
>>
>>
>> This is not just with
readGAlignmentPairs,
>> it's also with
>> readGAlignments and, at a lower level,
with
>> scanBam(),
>> which
>> the 2 formers are based on. If a
record
>> overlaps 2
>> ranges in the
>> 'which' arg, scanBam() loads it twice.
>>
>> I can't think of an easy way to
address
>> this unless we
>> change the
>> semantic of the 'which' argument. For
>> example instead of
>> loading
>> records that overlap we could make the
>> choice to only
>> load records
>> that have their POS (i.e. start)
within the
>> ranges
>> specified thru
>> 'which'. It introduces a dissymmetry
>> (because it looks
>> at the start
>> of the record and not at its end) but
it
>> has the
>> advantage of making
>> sure records are only loaded once
(unless
>> the user
>> passes a 'which'
>> argument that contains overlapping
ranges
>> but then
>> s/he's really
>> looking for it ;-) ).
>>
>>
>> and (2) any read
>>
>> pairs with one end inside which
and the
>> other out
>> will be discarded.
>>
>>
>> No easy work around for that one with
the
>> current
>> implementation of
>> readGAlignmentPairs which just passes
its
>> 'which'
>> argument down to
>> scanBam().
>>
>>
>> Excluding per-chromosome iteration,
>> 'which' is a bit
>> dangerous when
>>
>> dealing
>> with read pairs.
>>
>>
>> It's also dangerous when dealing with
>> single end reads,
>> and it has
>> been for a while.
>>
>> Cheers,
>> H.
>>
>>
>> Somehow we should make this obvious
to
>> the user.
>>
>>
>>
>> On Wed, Sep 11, 2013 at 11:02 AM,
>> Leonard Goldstein <
>> goldstein.leonard@gene.com <mailto:goldstein.leonard@**>> gene.com <goldstein.leonard@gene.com>>
>>
<mailto:goldstein.leonard@__ge**ne.com<http: gene.com="">
>>
>>
<mailto:goldstein.leonard@**gene.com<goldstein.leonard@gene.com>>>>
>> wrote:
>>
>> Dear Herve and list,
>>
>>
>>
>> I realized the error might
have
>> been caused due
>> to the same bam record
>> being read in multiple times
when
>> passing
>> multiple ranges in the which
>> argument.
>>
>>
>> I don't think it is
necessarily
>> obvious to users
>> that
>> readGAlignments/Pairs
>> returns a concatenation of
records
>> that were
>> read in for each range in
>> the
>> which argument. Instead one
might
>> expect to
>> obtain the union of records
>> that overlap any of the ranges
>> included in
>> which. Do you think it would
>> be
>> helpful to issue a warning in
case
>> records are
>> read in multiple times?
>>
>>
>> Best wishes,
>>
>>
>> Leonard
>>
>>
>>
>> On Tue, Sep 10, 2013 at 5:58
PM,
>> Leonard
>> Goldstein <goldstel@gene.com>> <mailto:goldstel@gene.com>
>> <mailto:goldstel@gene.com>>
>> <mailto:goldstel@gene.com>>
>>
>> wrote:
>>
>>
>> Hi Herve,
>>
>>
>>
>> I ran into some issues
when using
>> readGAlignmentPairs with
the
>> ScanBamParam 'which'
argument.
>> I included an
>> example below. I can see
>> how
>> changing 'which' can
result in
>> different
>> pairings but was wondering
>>
>> whether
>>
>> the error can be avoided
and
>> problematic
>> pairings dropped instead.
>>
>>
>> It looks like the issue is
>> introduced in
>> revision 77808 (no error
in
>> 77791). I was hoping you
have
>> an idea what
>> causes the problem,
otherwise
>>
>> I
>>
>> can try to forward a test
case.
>>
>>
>> Thanks for your help.
>>
>>
>> Leonard
>>
>>
>> gr
>>
>>
>> GRanges with 2 ranges and
0
>> metadata columns:
>> seqnames
>> ranges strand
>> <rle>
>> <iranges> <rle>
>> [1] 10
[75758072,
>> 75758133] *
>> [2] 10
[75802841,
>> 75802911] *
>> ---
>> seqlengths:
>> 10
>> NA
>>
>>
>>
readGAlignmentPairs(file,
>> param =
>> ScanBamParam(which =
gr[1]))
>>
>> GAlignmentPairs with 0
>> alignment pairs and 0
>> metadata columns:
>> seqnames strand :
ranges
>> -- ranges
>> <rle> <rle> :
<iranges>
>> -- <iranges>
>> ---
>> seqlengths:
>> 1 2
>> 3 ...
>> GL000247.1 GL000248.1
GL000249.1
>> 249250621 243199373
>> 198022430 ...
>> 36422 39786
38502
>>
>>
>>
readGAlignmentPairs(file,
>> param =
>> ScanBamParam(which =
gr[2]))
>>
>> GAlignmentPairs with 3
>> alignment pairs and 0
>> metadata columns:
>> seqnames strand :
>> ranges --
ranges
>> <rle> <rle> :
>> <iranges> --
>> <iranges>
>> [1] 10 + :
>> [75758115, 75802896]
>> -- [75802892, 75830482]
>> [2] 10 - :
>> [75802908, 75830498]
>> -- [75758111, 75802892]
>> [3] 10 - :
>> [75802908, 75830498]
>> -- [75802878, 75830468]
>> ---
>> seqlengths:
>> 1 2
>> 3 ...
>> GL000247.1 GL000248.1
GL000249.1
>> 249250621 243199373
>> 198022430 ...
>> 36422 39786
38502
>>
>>
>>
readGAlignmentPairs(file,
>> param =
>> ScanBamParam(which =
gr))
>>
>> Error in
>> makeGAlignmentPairs(galn, use.names
>> = use.names, use.mcols =
>> use.mcols) :
>> findMateAlignment()
>> returned an invalid
>> 'mate' vector
>> In addition: Warning
message:
>> In findMateAlignment(x) :
>> 4 alignments with
>> ambiguous pairing
>> were dumped.
>> Use
>> 'getDumpedAlignments()' to
>> retrieve them from the
dump
>>
>> environment.
>>
>>
>> sessionInfo()
>>
>> R version 3.0.0
(2013-04-03)
>> Platform:
>> x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1]
LC_CTYPE=en_US.UTF-8
>> LC_NUMERIC=C
>> [3]
LC_TIME=en_US.UTF-8
>> LC_COLLATE=en_US.UTF-8
>> [5]
LC_MONETARY=en_US.UTF-8
>> LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=C
>> LC_NAME=C
>> [9] LC_ADDRESS=C
>> LC_TELEPHONE=C
>> [11]
LC_MEASUREMENT=en_US.UTF-8
>> LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] parallel stats
>> graphics grDevices
>> utils datasets
methods
>> [8] base
>>
>> other attached packages:
>> [1] Rsamtools_1.13.39
>> Biostrings_2.29.16
>> GenomicRanges_1.13.41
>> [4] XVector_0.1.1
>> IRanges_1.19.31
>> BiocGenerics_0.7.4
>>
>> loaded via a namespace
(and not
>> attached):
>> [1] bitops_1.0-6
stats4_3.0.0
>> zlibbioc_1.7.0
>>
>>
>>
>>
>>
>> [[alternative HTML
>> version deleted]]
>>
>>
>> ______________________________**____**_________________
>>
>> Bioconductor mailing list
>> Bioconductor@r-project.org <mailto:bioconductor@r-**>> project.org <bioconductor@r-project.org>>
>>
>>
<mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org>
>>
<mailto:bioconductor@r-**project.org<bioconductor@r-project.org>
>> >>
>>
>> https://stat.ethz.ch/mailman/***____*listinfo/bioconduc
tor<https: stat.ethz.ch="" mailman="" *____*listinfo="" bioconductor="">
>> <https: stat.ethz.ch="" mailman="" ***__*listinfo="" bioconduct="" or<https:="" stat.ethz.ch="" mailman="" *__*listinfo="" bioconductor="">
>> >
>>
>> <https: stat.ethz.ch="" mailman="" **__**listinfo="" bioconduct="" or<https:="" stat.ethz.ch="" mailman="" __**listinfo="" bioconductor="">
>> <https: stat.ethz.ch="" mailman="" ****listinfo="" bioconductor="" <https:="" stat.ethz.ch="" mailman="" **listinfo="" bioconductor="">
>> >><__**https:/__/stat.ethz.ch/__**mailman/__listinfo/__**bioconduct
or<http: stat.ethz.ch="" __mailman="" __listinfo="" __bioconductor="">
>> <http: stat.ethz.ch="" mailman="" _**_listinfo="" bioconductor<="" http:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">
>> >
>>
>>
>>
>> <https: stat.ethz.ch="" mailman="" **__listinfo="" bioconductor="" <https:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">
>> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<h="" ttps:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor="">
>> >>>
>> Search the archives:
>>
>> http://news.gmane.org/gmane.****____science.biology.**
>> informatics.__**__conductor<http: news.gmane.org="" gmane.**____scien="" ce.biology.informatics.__**__conductor="">
>> <http: news.gmane.org="" gmane.****__science.biology.**="">> informatics.**__conductor<http: news.gmane.org="" gmane.**__science.b="" iology.informatics.**__conductor="">
>> >
>>
>> <http: news.gmane.org="" gmane.***__*science.biology.**="">> informatics.*__*conductor<http: news.gmane.org="" gmane.*__*science.b="" iology.informatics.*__*conductor="">
>> <http: news.gmane.org="" gmane.***="">> *science.biology.informatics.****conductor<http: news.gmane.org="" gm="" ane.**science.biology.informatics.**conductor="">
>> >><http: news.**gmane.="" <http:="" news.gmane.="">____org/gmane.science.**
>> biology.____informatics.**conductor
>>
>>
>> <http: news.gmane.org="" gmane._**="">> _science.biology.informatics._**_conductor<http: news.gmane.org="" gm="" ane.__science.biology.informatics.__conductor="">
>>
>>
<http: news.gmane.org="" gmane.**science.biology.informatics.**="">> conductor<http: news.gmane.org="" gmane.science.biology.informatics.c="" onductor="">
>> >>>
>>
>>
>> [[alternative HTML
version
>> deleted]]
>>
>>
>> ______________________________**____**_________________
>>
>> Bioconductor mailing list
>> Bioconductor@r-project.org <mailto:bioconductor@r-**>> project.org <bioconductor@r-project.org>>
>>
<mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org>
>>
<mailto:bioconductor@r-**project.org<bioconductor@r-project.org>
>> >>
>>
>> https://stat.ethz.ch/mailman/***____*listinfo/bioconduc
tor<https: stat.ethz.ch="" mailman="" *____*listinfo="" bioconductor="">
>> <https: stat.ethz.ch="" mailman="" ***__*listinfo="" bioconduct="" or<https:="" stat.ethz.ch="" mailman="" *__*listinfo="" bioconductor="">
>> >
>>
>> <https: stat.ethz.ch="" mailman="" **__**listinfo="" bioconduct="" or<https:="" stat.ethz.ch="" mailman="" __**listinfo="" bioconductor="">
>> <https: stat.ethz.ch="" mailman="" ****listinfo="" bioconductor="" <https:="" stat.ethz.ch="" mailman="" **listinfo="" bioconductor="">
>> >><__**https:/__/stat.ethz.ch/__**mailman/__listinfo/__**bioconduct
or<http: stat.ethz.ch="" __mailman="" __listinfo="" __bioconductor="">
>> <http: stat.ethz.ch="" mailman="" _**_listinfo="" bioconductor<="" http:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">
>> >
>>
>>
>>
>> <https: stat.ethz.ch="" mailman="" **__listinfo="" bioconductor="" <https:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">
>> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<h="" ttps:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor="">
>> >>>
>> Search the archives:
>> http://news.gmane.org/gmane.**
>>
>>
science.biology.informatics.****____conductor<http: news.**="">> gmane. <http: news.gmane.="">____org/gmane.science.**
>> biology.____informatics.**conductor
>>
>>
>>
>> <http: news.gmane.org="" gmane._**="">> _science.biology.informatics._**_conductor<http: news.gmane.org="" gm="" ane.__science.biology.informatics.__conductor="">
>>
<http: news.gmane.org="" gmane.**science.biology.informatics.**="">> conductor<http: news.gmane.org="" gmane.science.biology.informatics.c="" onductor="">
>> >>>
>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages@fhcrc.org
>> <mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org>>
>> <mailto:hpages@fhcrc.org>>
>> Phone: (206) 667-5791
>> <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791>
>> Fax: (206) 667-1319
>> <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319>
>>
>>
>> [[alternative HTML version
deleted]]
>>
>>
>>
>> ______________________________**
>> _____________________
>>
>> Bioconductor mailing list
>> Bioconductor@r-project.org <mailto:bioconductor@r-**>> project.org <bioconductor@r-project.org>>
>>
<mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org>
>>
<mailto:bioconductor@r-**project.org<bioconductor@r-project.org>
>> >>
>> https://stat.ethz.ch/mailman/_**___listinfo/bioconducto
r<https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor="">
>> <https: stat.ethz.ch="" mailman="" **__listinfo="" bioconductor="" <https:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">
>> >
>>
>>
>> <https: stat.ethz.ch="" mailman="" **__listinfo="" bioconductor="" <https:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">
>> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<h="" ttps:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor="">
>> >>
>> Search the archives:
>>
>> http://news.gmane.org/gmane.__**
>> __science.biology.informatics.**____conductor<http: news.gmane.org="" gmane.____science.biology.informatics.____conductor="">
>> <http: news.gmane.org="" gmane._**="">> _science.biology.informatics._**_conductor<http: news.gmane.org="" gm="" ane.__science.biology.informatics.__conductor="">
>> >
>>
>>
>> <http: news.gmane.org="" gmane._**="">> _science.biology.informatics._**_conductor<http: news.gmane.org="" gm="" ane.__science.biology.informatics.__conductor="">
>>
<http: news.gmane.org="" gmane.**science.biology.informatics.**="">> conductor<http: news.gmane.org="" gmane.science.biology.informatics.c="" onductor="">
>> >>
>>
>>
>>
>>
>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org>
>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>>
>>
>>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages@fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
[[alternative HTML version deleted]]
Yes yieldSize is the way to go for streaming of course (that's what it
has been designed for). But for simple parallelization schemes, it's
nice to have a simple way to partition your BAM file *ahead* of time,
which neither yieldSize or 'which' provide.
H.
On 09/18/2013 09:22 AM, Michael Lawrence wrote:
> For processing an entire BAM, I think the yieldSize streaming is the
way
> to go. Each read is read only once, and there is now the option of
> guaranteed pairs.
>
>
>
>
> On Tue, Sep 17, 2013 at 10:43 PM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> wrote:
>
> Just to clarify, I was only suggesting the 'tiles' arg as a
simple way
> to process a chromosome by chunks/tiles, which seems to be a
common use
> case. 'tiles' is also trivial to split over multiple cpus.
'which' is
> harder, because of the risk to process twice the same read.
Using a hash
> to work around the problem is probably very hard in the
multiple-cpu
> context.
>
> H.
>
>
>
> On 09/17/2013 09:10 PM, Michael Lawrence wrote:
>
> So here's a use case: get all the reads for a set of
putative
> transcribed regions, numbering in the thousands. You don't
want
> just the
> reads that start in the regions, and you want to process the
> reads in
> vectorized fashion, ideally with a single query to the BAM,
> returning a
> GAlignmentPairs.
>
> There are at least two ways to do this:
> a) Have readGAlignmentPairs and friends return non-redundant
> alignments,
> even with multiple which,
> b) Query the BAM separately (like in an lapply) for each
which, then
> combine the results, and keep around a partitioning that
records the
> original 'which' for each alignment. This would be slower
and
> require
> more work of the user, but it would work right now.
>
> Hopefully others have additional ideas.
>
> Michael
>
>
>
>
>
>
>
> On Tue, Sep 17, 2013 at 5:46 PM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">
> <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>>
wrote:
>
> On 09/17/2013 04:39 PM, Hervé Pagès wrote:
>
> On 09/17/2013 04:15 PM, Michael Lawrence wrote:
>
> Leonard was thinking that the problem of
importing
> the same
> alignment
> multiple times when caught by multiple which
ranges
> could be
> solved by
> maintaining a hash of what reads have been seen
> before. This
> could be
> done with the read ID + pos + strand + cigar,
we think,
>
>
> FWIW, I've seen BAM files with more than 1 record
> sharing the same
> QNAME + POS + strand + CIGAR.
>
> The 'Ambiguous pairing' section in the man page for
> findMateAlignment()
> shows such an example. It was taken from a real BAM
> file. IIRC the 2
> records were actually indistinguible.
>
> but even easier
> would be to get at the file offset. It looks
like
> samtools
> keeps that
> hidden, but we could just do some copy-and-
pasting
> of the
> bam_iter_t
> struct...
>
>
> IIUC the common use case is processing a BAM file
by
> chromosome
> chunks.
> Are you saying the record ID would be returned by
> scanBam or
> readGAlignemnts() so the end/user has a way to
discard
> records that
> have already been processed.
>
> Note that with a small change of semantic to the
'which'
> argument, you
> don't need to return a record ID and the end-user
needs
> to do
> nothing:
> the same record cannot be loaded twice.
>
>
> To be even clearer about this: with its current
semantic,
> 'which'
> doesn't feel like the right thing to use to process a
> chromosome "by
> tiles". Instead of trying to fix it (by changing its
> semantic or by
> maintaining a hash of what reads have been seen
before),
> maybe it
> would make more sense to add a 'tiles' (or 'tiling')
> argument to
> ScanBamParam() that would also take a GRanges object.
Then
> scanBam()
> would load the records that have their POS in the tiles
and
> scanBam(asMates=TRUE) would load the records that have
> their POS
> in the tiles and are *first* mate and it would try to
pair them
> (beyond 'tiles' if needed).
>
> It would be enough to enforce 'tiles' to have
> non-overlapping ranges
> but it would not be necessary to require it to be a
"real"
> tiling (i.e.
> a set of adjacent ranges that cover the entire
chromosome
> or genome).
>
> In addition, deciding whether a record belongs to
'tiles'
> would be
> cheaper (and faster) than deciding whether it overlaps
with
> 'which' because there is no more need to compute the
end of the
> record based on its CIGAR.
>
>
> H.
>
>
> H.
>
>
>
>
>
> On Tue, Sep 17, 2013 at 3:29 PM, Michael
Lawrence
> <michafla at="" gene.com="" <mailto:michafla="" at="" gene.com="">
> <mailto:michafla at="" gene.com="" <mailto:michafla="" at="" gene.com="">>
> <mailto:michafla at="" gene.com=""> <mailto:michafla at="" gene.com=""> <mailto:michafla at="" gene.com=""> <mailto:michafla at="" gene.com="">>>> wrote:
>
> I'm a big fan of the new paired-end
streaming.
> It's the
> way to go
> over which-based partitioning. Nice job!
>
>
> On Tue, Sep 17, 2013 at 2:46 PM, Valerie
Obenchain
> <vobencha at="" fhcrc.org=""> <mailto:vobencha at="" fhcrc.org=""> <mailto:vobencha at="" fhcrc.org=""> <mailto:vobencha at="" fhcrc.org="">>
> <mailto:vobencha at="" fhcrc.org=""> <mailto:vobencha at="" fhcrc.org=""> <mailto:vobencha at="" fhcrc.org=""> <mailto:vobencha at="" fhcrc.org="">>>> wrote:
>
> The new pairing algorithm used by
> readGAlignmentsList() does
> find mates outside the specified
range.
> When only
> one of the
> pairs overlaps the 'which', it looks
for a
> potential mate in the
> rest of the file.
>
> fl <- system.file("extdata",
"ex1.bam",
> package="Rsamtools")
> param <- ScanBamParam(what="flag",
>
> which=GRanges("seq1", IRanges(1,
> width=40)))
>
> readGAlignmentPairs() finds no mates
in
> this region.
>
> gapairs <-
> readGAlignmentPairs(BamFile(______fl),
>
> param=param)
> > gapairs
>
> GAlignmentPairs with 0 alignment pairs
and 0
> metadata columns:
> seqnames strand : ranges --
ranges
> <rle> <rle> : <iranges> --
<iranges>
>
> readGAlignmentsList() finds two mate
pairs and
> several non-mates
> in this region. The 'mates' metadata
column
> indicates which are
> legitimate mates as per the algorithem
> described at
> ?readGAlignmentsListFromBam. More
control
> over what
> records are
> returned can be done by specifying
flags in
> ScanBamParam.
>
> galist <-
> readGAlignmentsList(BamFile(______fl,
>
> asMates=TRUE),
> param=param)
> > galist
> GAlignmentsList of length 17:
> [[1]]
> GAlignments with 2 alignments and 2
> metadata columns:
> seqnames strand cigar qwidth
start
> end width
> ngap | mates
> flag
> [1] seq1 - 35M 35 229
263 35
> <tel:35%20229%20263%2035>
> <tel:35%20229%20263%2035>
>
> <tel:35%20%20%20229%20263%20%____20%20%2035> 0 |
>
> 1 83
> [2] seq1 + 35M 35
37
> 71 35
> 0 | 1
> 163
>
> [[2]]
> GAlignments with 2 alignments and 2
> metadata columns:
> seqnames strand cigar qwidth
start
> end width
> ngap | mates
> flag
> [1] seq1 - 35M 35 185
219 35
> <tel:35%20185%20219%2035>
> <tel:35%20185%20219%2035>
>
> <tel:35%20%20%20185%20219%20%____20%20%2035> 0 |
>
> 1 147
> [2] seq1 + 35M 35
36
> 70 35
> 0 | 1
> 99
>
> [[3]]
> GAlignments with 1 alignment and 2
> metadata columns:
> seqnames strand cigar qwidth
start
> end width
> ngap | mates
> flag
> [1] seq1 + 36M 36
6
> 41 36
> 0 | 0
> 137
>
>
> > elementLengths(galist)
> [1] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1
>
> The first two list elements contain
mates.
>
> sapply(galist, function(x)
> sum((mcols(x)$mates)) > 0)
>
> [1] TRUE TRUE FALSE FALSE FALSE
FALSE
> FALSE
> FALSE FALSE
> FALSE FALSE FALSE
> [13] FALSE FALSE FALSE FALSE FALSE
>
>
> Here are descriptions of the first few
> flags as per
> this site:
> http://picard.sourceforge.net/______explain-flags.html
> <http: picard.sourceforge.net="" ____explain-flags.html="">
>
> <http: picard.sourceforge.__net="" __explain-flags.html=""> <http: picard.sourceforge.net="" __explain-flags.html="">>
>
> <http: picard.sourceforge.____net="" explain-flags.html="">
> <http: picard.sourceforge.__net="" explain-="" flags.html=""> <http: picard.sourceforge.net="" explain-flags.html="">>>
> Flag 137 indicates an unmapped mate
which
> is why
> this record was
> not mated.
>
>
> 83:
> Summary:
> read paired
> read mapped in proper pair
> read reverse strand
> first in pair
>
> 163:
> Summary:
> read paired
> read mapped in proper pair
> mate reverse strand
> second in pair
>
> 147:
> Summary:
> read paired
> read mapped in proper pair
> read reverse strand
> second in pair
>
> 99:
> Summary:
> read paired
> read mapped in proper pair
> mate reverse strand
> first in pair
>
> 137:
> Summary:
> read paired
> mate unmapped
> second in pair
>
>
> Valerie
>
>
>
> On 09/17/2013 01:55 PM, Michael
Lawrence
> wrote:
>
> Yes, I knew there weren't any easy
> solutions. I
> just think
> that somehow we
> need to point these out to users,
> because it's
> not exactly
> obvious.
>
>
> On Tue, Sep 17, 2013 at 11:52 AM,
> Hervé Pagès
> <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">>
> <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">
> <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>>>
wrote:
>
> Hi guys,
>
> Sorry for the late answer.
>
>
> On 09/11/2013 11:51 AM,
Michael
> Lawrence wrote:
>
> There seem to be at least
two
> "gotchas"
> with the
> 'which' argument and
> readGAlignmentPairs: (1)
as
> Leonard
> said, there's no
> easy way to handle
> cases where one record
overlaps
> multiple which
> arguments
>
>
> This is not just with
> readGAlignmentPairs,
> it's also with
> readGAlignments and, at a
lower
> level, with
> scanBam(),
> which
> the 2 formers are based on. If
a
> record
> overlaps 2
> ranges in the
> 'which' arg, scanBam() loads
it twice.
>
> I can't think of an easy way
to
> address
> this unless we
> change the
> semantic of the 'which'
argument. For
> example instead of
> loading
> records that overlap we could
make the
> choice to only
> load records
> that have their POS (i.e.
start)
> within the
> ranges
> specified thru
> 'which'. It introduces a
dissymmetry
> (because it looks
> at the start
> of the record and not at its
end)
> but it
> has the
> advantage of making
> sure records are only loaded
once
> (unless
> the user
> passes a 'which'
> argument that contains
overlapping
> ranges
> but then
> s/he's really
> looking for it ;-) ).
>
>
> and (2) any read
>
> pairs with one end inside
> which and the
> other out
> will be discarded.
>
>
> No easy work around for that
one
> with the
> current
> implementation of
> readGAlignmentPairs which just
> passes its
> 'which'
> argument down to
> scanBam().
>
>
> Excluding per-chromosome
iteration,
> 'which' is a bit
> dangerous when
>
> dealing
> with read pairs.
>
>
> It's also dangerous when
dealing with
> single end reads,
> and it has
> been for a while.
>
> Cheers,
> H.
>
>
> Somehow we should make this
> obvious to
> the user.
>
>
>
> On Wed, Sep 11, 2013 at
11:02 AM,
> Leonard Goldstein <
> goldstein.leonard at gene.com <mailto:goldstein.leonard at="" gene.com="">
> <mailto:goldstein.leonard at="" __gene.com=""> <mailto:goldstein.leonard at="" gene.com="">>
> <mailto:goldstein.leonard@> <mailto:goldstein.leonard@>__ge__ne.com <http: gene.com="">
>
> <mailto:goldstein.leonard at="" __gene.com=""> <mailto:goldstein.leonard at="" gene.com="">>>> wrote:
>
> Dear Herve and list,
>
>
>
> I realized the error
might
> have
> been caused due
> to the same bam record
> being read in multiple
> times when
> passing
> multiple ranges in the
which
> argument.
>
>
> I don't think it is
> necessarily
> obvious to users
> that
> readGAlignments/Pairs
> returns a
concatenation of
> records
> that were
> read in for each range
in
> the
> which argument.
Instead
> one might
> expect to
> obtain the union of
records
> that overlap any of
the ranges
> included in
> which. Do you think it
would
> be
> helpful to issue a
warning
> in case
> records are
> read in multiple
times?
>
>
> Best wishes,
>
>
> Leonard
>
>
>
> On Tue, Sep 10, 2013
at
> 5:58 PM,
> Leonard
> Goldstein
> <goldstel at="" gene.com="" <mailto:goldstel="" at="" gene.com="">
> <mailto:goldstel at="" gene.com="" <mailto:goldstel="" at="" gene.com="">>
> <mailto:goldstel at="" gene.com=""> <mailto:goldstel at="" gene.com="">
>
> <mailto:goldstel at="" gene.com="" <mailto:goldstel="" at="" gene.com="">>>
>
> wrote:
>
>
> Hi Herve,
>
>
>
> I ran into some
issues
> when using
>
readGAlignmentPairs
> with the
> ScanBamParam
'which'
> argument.
> I included an
> example below. I
can see
> how
> changing 'which'
can
> result in
> different
> pairings but was
wondering
>
> whether
>
> the error can be
> avoided and
> problematic
> pairings dropped
instead.
>
>
> It looks like the
issue is
> introduced in
> revision 77808 (no
> error in
> 77791). I was
hoping
> you have
> an idea what
> causes the
problem,
> otherwise
>
> I
>
> can try to forward
a
> test case.
>
>
> Thanks for your
help.
>
>
> Leonard
>
>
> gr
>
>
> GRanges with 2
ranges
> and 0
> metadata columns:
> seqnames
> ranges strand
> <rle>
> <iranges> <rle>
> [1] 10
> [75758072,
> 75758133] *
> [2] 10
> [75802841,
> 75802911] *
> ---
> seqlengths:
> 10
> NA
>
>
>
> readGAlignmentPairs(file,
> param =
>
ScanBamParam(which
> = gr[1]))
>
> GAlignmentPairs
with 0
> alignment pairs and 0
> metadata columns:
> seqnames
strand :
> ranges
> -- ranges
> <rle>
<rle> :
> <iranges>
> -- <iranges>
> ---
> seqlengths:
> 1
2
> 3 ...
> GL000247.1
GL000248.1
> GL000249.1
> 249250621
243199373
> 198022430 ...
> 36422 39786
> 38502
>
>
>
> readGAlignmentPairs(file,
> param =
>
ScanBamParam(which
> = gr[2]))
>
> GAlignmentPairs
with 3
> alignment pairs and 0
> metadata columns:
> seqnames
strand :
> ranges --
> ranges
> <rle>
<rle> :
> <iranges> --
> <iranges>
> [1] 10
+ :
> [75758115, 75802896]
> -- [75802892,
75830482]
> [2] 10
- :
> [75802908, 75830498]
> -- [75758111,
75802892]
> [3] 10
- :
> [75802908, 75830498]
> -- [75802878,
75830468]
> ---
> seqlengths:
> 1
2
> 3 ...
> GL000247.1
GL000248.1
> GL000249.1
> 249250621
243199373
> 198022430 ...
> 36422 39786
> 38502
>
>
>
> readGAlignmentPairs(file,
> param =
>
ScanBamParam(which
> = gr))
>
> Error in
> makeGAlignmentPairs(galn, use.names
> = use.names,
use.mcols =
> use.mcols) :
>
findMateAlignment()
> returned an invalid
> 'mate' vector
> In addition:
Warning
> message:
> In
findMateAlignment(x) :
> 4
alignments with
> ambiguous pairing
> were dumped.
> Use
> 'getDumpedAlignments()' to
> retrieve them from
the
> dump
>
> environment.
>
>
> sessionInfo()
>
> R version 3.0.0
> (2013-04-03)
> Platform:
> x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1]
> LC_CTYPE=en_US.UTF-8
> LC_NUMERIC=C
> [3]
> LC_TIME=en_US.UTF-8
>
LC_COLLATE=en_US.UTF-8
> [5]
> LC_MONETARY=en_US.UTF-8
>
LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C
> LC_NAME=C
> [9]
LC_ADDRESS=C
> LC_TELEPHONE=C
> [11]
> LC_MEASUREMENT=en_US.UTF-8
>
LC_IDENTIFICATION=C
>
> attached base
packages:
> [1] parallel
stats
> graphics grDevices
> utils datasets
> methods
> [8] base
>
> other attached
packages:
> [1]
Rsamtools_1.13.39
> Biostrings_2.29.16
>
GenomicRanges_1.13.41
> [4] XVector_0.1.1
> IRanges_1.19.31
>
BiocGenerics_0.7.4
>
> loaded via a
namespace
> (and not
> attached):
> [1] bitops_1.0-6
> stats4_3.0.0
> zlibbioc_1.7.0
>
>
>
>
>
>
[[alternative HTML
> version deleted]]
>
>
>
____________________________________**_________________
>
> Bioconductor mailing
list
> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org="">
> <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>
>
> <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org="">
> <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>>
>
> https://stat.ethz.ch/mailman/*______*listinfo/bioconductor
> <https: stat.ethz.ch="" mailman="" *____*listinfo="" bioconductor="">
>
> <https: stat.ethz.ch="" mailman="" __*__*listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" *__*listinfo="" bioconductor="">>
>
>
> <https: stat.ethz.ch="" mailman="" ____**listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" __**listinfo="" bioconductor="">
>
> <https: stat.ethz.ch="" mailman="" __**listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor="">>><___
_https:/__/stat.ethz.ch/____mailman/__listinfo/____bioconductor
> <http: stat.ethz.ch="" __mailman="" __listinfo="" __bioconductor="">
>
> <http: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> <http: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">>
>
>
>
>
> <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">
>
> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">>>>
> Search the archives:
>
> http://news.gmane.org/gmane.**______science.biology.__inform
atics.__**__conductor
> <http: news.gmane.org="" gmane.**____science.biology.informati="" cs.__**__conductor="">
>
> <http: news.gmane.org="" gmane.*__*__science.biology.__informa="" tics.**__conductor=""> <http: news.gmane.org="" gmane.**__science.biology.informatics="" .**__conductor="">>
>
>
> <http: news.gmane.org="" gmane.*____*science.biology.__informa="" tics.*__*conductor=""> <http: news.gmane.org="" gmane.*__*science.biology.informatics="" .*__*conductor="">
>
> <http: news.gmane.org="" gmane.*__*science.biology.informatics="" .*__*conductor=""> <http: news.gmane.org="" gmane.**science.biology.informatics.*="" *conductor="">>><http: news.__gmane.=""> <http: news.gmane.="">____org/gmane.science.__biology.____info
rmatics.__conductor
>
>
>
> <http: news.gmane.org="" gmane.____science.biology.informatics="" .____conductor=""> <http: news.gmane.org="" gmane.__science.biology.informatics._="" _conductor="">
>
>
>
<http: news.gmane.org="" gmane.__science.biology.informatics.__conductor="">
<http: news.gmane.org="" gmane.science.biology.informatics.conductor="">>>>
>
>
> [[alternative
HTML
> version
> deleted]]
>
>
>
____________________________________**_________________
>
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org="">
> <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>
>
> <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org="">
> <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>>
>
> https://stat.ethz.ch/mailman/*______*listinfo/bioconductor
> <https: stat.ethz.ch="" mailman="" *____*listinfo="" bioconductor="">
>
> <https: stat.ethz.ch="" mailman="" __*__*listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" *__*listinfo="" bioconductor="">>
>
>
> <https: stat.ethz.ch="" mailman="" ____**listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" __**listinfo="" bioconductor="">
>
> <https: stat.ethz.ch="" mailman="" __**listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor="">>><___
_https:/__/stat.ethz.ch/____mailman/__listinfo/____bioconductor
> <http: stat.ethz.ch="" __mailman="" __listinfo="" __bioconductor="">
>
> <http: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> <http: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">>
>
>
>
>
> <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">
>
> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">>>>
> Search the archives:
> http://news.gmane.org/gmane.**
>
>
>
science.biology.informatics.**______conductor<http: news.__gmane.=""> <http: news.gmane.="">____org/gmane.science.__biology.____info
rmatics.__conductor
>
>
>
>
> <http: news.gmane.org="" gmane.____science.biology.informatics="" .____conductor=""> <http: news.gmane.org="" gmane.__science.biology.informatics._="" _conductor="">
>
>
<http: news.gmane.org="" gmane.__science.biology.informatics.__conductor="">
<http: news.gmane.org="" gmane.science.biology.informatics.conductor="">>>>
>
>
> --
> Hervé Pagès
>
> Program in Computational
Biology
> Division of Public Health
Sciences
> Fred Hutchinson Cancer
Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> <mailto:hpages at="" fhcrc.org="">
> <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>
> <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">
>
> <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>>
> Phone: (206) 667-5791
> <tel:%28206%29%20667-5791>
> <tel:%28206%29%20667-5791>
<tel:%28206%29%20667-5791>
> Fax: (206) 667-1319
> <tel:%28206%29%20667-1319>
> <tel:%28206%29%20667-1319>
<tel:%28206%29%20667-1319>
>
>
> [[alternative HTML
version
> deleted]]
>
>
>
>
> _____________________________________________________
>
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org="">
> <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>
> <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org="">
> <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>>
> https://stat.ethz.ch/mailman/______listinfo/bioconductor
> <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor="">
>
> <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">>
>
>
>
> <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">
>
> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">>>
> Search the archives:
>
> http://news.gmane.org/gmane.______science.biology.informatic
s.______conductor
> <http: news.gmane.org="" gmane.____science.biology.informatics="" .____conductor="">
>
> <http: news.gmane.org="" gmane.____science.biology.informatics="" .____conductor=""> <http: news.gmane.org="" gmane.__science.biology.informatics._="" _conductor="">>
>
>
>
> <http: news.gmane.org="" gmane.____science.biology.informatics="" .____conductor=""> <http: news.gmane.org="" gmane.__science.biology.informatics._="" _conductor="">
>
>
<http: news.gmane.org="" gmane.__science.biology.informatics.__conductor="">
<http: news.gmane.org="" gmane.science.biology.informatics.conductor="">>>
>
>
>
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org="">
> <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
> <tel:%28206%29%20667-1319>
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org="">
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
Hi Leonard,
Good timing. I was actually going to get back to you today about this
because yesterday I fixed a bug in findMateAlignment() that I think
was
causing the problem you originally reported on the mailing list. It's
also probably what is causing this new problem you are reporting
below.
Totally agree with you that even if using a 'which' argument can alter
the pairing, in any case it should produce an error. More generally
the
intended behavior of readGAlignmentPairs() is to *drop* records that
don't have a mate and to *dump* ambiguous pairings. Not to choke on
them. The difference between "drop" and "dump" here is that what's
dumped can be examined latter with getDumpedAlignments(). See
?getDumpedAlignments for the details.
Please let me know if you still run into those issues with the latest
Rsamtools (1.13.45). Works fine for me on your test.bam file (thanks
for providing the file).
Cheers,
H.
On 09/27/2013 04:23 PM, Leonard Goldstein wrote:
> Hi Herv?
>
> Thanks for your and Valerie's comments on using 'which' with
> readGAlignmentPairs.
>
> I actually encountered the same error again without using 'which'
(see
> below). I attached test.bam, which narrows down the problematic
> alignment(s) to around ~600. If readGAlignmentPairs could drop any
> unusual alignments rather than generating an error that would be
really
> helpful.
>
> Many thanks for your help.
>
> Leonard
>
>
> > readGAlignmentPairs(file = "test.bam")
> Error in makeGAlignmentPairs(galn, use.names = use.names, use.mcols
=
> use.mcols) :
> findMateAlignment() returned an invalid 'mate' vector
> In addition: Warning message:
> In findMateAlignment(x) :
> 2 alignments with ambiguous pairing were dumped.
> Use 'getDumpedAlignments()' to retrieve them from the dump
environment.
> >
> > sessionInfo()
> R version 3.0.0 (2013-04-03)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets
methods
> [8] base
>
> other attached packages:
> [1] Rsamtools_1.13.44 Biostrings_2.29.19
GenomicRanges_1.13.45
> [4] XVector_0.1.4 IRanges_1.19.38 BiocGenerics_0.7.5
>
> loaded via a namespace (and not attached):
> [1] bitops_1.0-6 stats4_3.0.0 zlibbioc_1.7.0
> >
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
Hi Hervé,
Yes the latest version of Rsamtools works fine for both the original
and
the new problem.
Many thanks for fixing this bug, and the quick reply.
Best wishes
Leonard
On Fri, Sep 27, 2013 at 4:55 PM, Hervé Pagès <hpages@fhcrc.org> wrote:
> Hi Leonard,
>
> Good timing. I was actually going to get back to you today about
this
> because yesterday I fixed a bug in findMateAlignment() that I think
was
> causing the problem you originally reported on the mailing list.
It's
> also probably what is causing this new problem you are reporting
below.
> Totally agree with you that even if using a 'which' argument can
alter
> the pairing, in any case it should produce an error. More generally
the
> intended behavior of readGAlignmentPairs() is to *drop* records that
> don't have a mate and to *dump* ambiguous pairings. Not to choke on
> them. The difference between "drop" and "dump" here is that what's
> dumped can be examined latter with getDumpedAlignments(). See
> ?getDumpedAlignments for the details.
>
> Please let me know if you still run into those issues with the
latest
> Rsamtools (1.13.45). Works fine for me on your test.bam file (thanks
> for providing the file).
>
> Cheers,
> H.
>
>
>
> On 09/27/2013 04:23 PM, Leonard Goldstein wrote:
>
>> Hi Hervé
>>
>> Thanks for your and Valerie's comments on using 'which' with
>> readGAlignmentPairs.
>>
>> I actually encountered the same error again without using 'which'
(see
>> below). I attached test.bam, which narrows down the problematic
>> alignment(s) to around ~600. If readGAlignmentPairs could drop any
>> unusual alignments rather than generating an error that would be
really
>> helpful.
>>
>> Many thanks for your help.
>>
>> Leonard
>>
>>
>> > readGAlignmentPairs(file = "test.bam")
>> Error in makeGAlignmentPairs(galn, use.names = use.names, use.mcols
=
>> use.mcols) :
>> findMateAlignment() returned an invalid 'mate' vector
>> In addition: Warning message:
>> In findMateAlignment(x) :
>> 2 alignments with ambiguous pairing were dumped.
>> Use 'getDumpedAlignments()' to retrieve them from the dump
>> environment.
>> >
>> > sessionInfo()
>> R version 3.0.0 (2013-04-03)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=C LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] parallel stats graphics grDevices utils datasets
methods
>> [8] base
>>
>> other attached packages:
>> [1] Rsamtools_1.13.44 Biostrings_2.29.19
GenomicRanges_1.13.45
>> [4] XVector_0.1.4 IRanges_1.19.38 BiocGenerics_0.7.5
>>
>> loaded via a namespace (and not attached):
>> [1] bitops_1.0-6 stats4_3.0.0 zlibbioc_1.7.0
>> >
>>
>>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages@fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
[[alternative HTML version deleted]]