Search
Question: readGAlignmentPairs error
0
gravatar for Leonard Goldstein
5.2 years ago by
United States
Leonard Goldstein80 wrote:
Hi Herve, I ran into some issues when using readGAlignmentPairs with the ScanBamParam 'which' argument. I included an example below. I can see how changing 'which' can result in different pairings but was wondering whether the error can be avoided and problematic pairings dropped instead. It looks like the issue is introduced in revision 77808 (no error in 77791). I was hoping you have an idea what causes the problem, otherwise I can try to forward a test case. Thanks for your help. Leonard > gr GRanges with 2 ranges and 0 metadata columns: seqnames ranges strand <rle> <iranges> <rle> [1] 10 [75758072, 75758133] * [2] 10 [75802841, 75802911] * --- seqlengths: 10 NA > > readGAlignmentPairs(file, param = ScanBamParam(which = gr[1])) GAlignmentPairs with 0 alignment pairs and 0 metadata columns: seqnames strand : ranges -- ranges <rle> <rle> : <iranges> -- <iranges> --- seqlengths: 1 2 3 ... GL000247.1 GL000248.1 GL000249.1 249250621 243199373 198022430 ... 36422 39786 38502 > > readGAlignmentPairs(file, param = ScanBamParam(which = gr[2])) GAlignmentPairs with 3 alignment pairs and 0 metadata columns: seqnames strand : ranges -- ranges <rle> <rle> : <iranges> -- <iranges> [1] 10 + : [75758115, 75802896] -- [75802892, 75830482] [2] 10 - : [75802908, 75830498] -- [75758111, 75802892] [3] 10 - : [75802908, 75830498] -- [75802878, 75830468] --- seqlengths: 1 2 3 ... GL000247.1 GL000248.1 GL000249.1 249250621 243199373 198022430 ... 36422 39786 38502 > > readGAlignmentPairs(file, param = ScanBamParam(which = gr)) Error in makeGAlignmentPairs(galn, use.names = use.names, use.mcols = use.mcols) : findMateAlignment() returned an invalid 'mate' vector In addition: Warning message: In findMateAlignment(x) : 4 alignments with ambiguous pairing were dumped. Use 'getDumpedAlignments()' to retrieve them from the dump environment. > > sessionInfo() R version 3.0.0 (2013-04-03) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] Rsamtools_1.13.39 Biostrings_2.29.16 GenomicRanges_1.13.41 [4] XVector_0.1.1 IRanges_1.19.31 BiocGenerics_0.7.4 loaded via a namespace (and not attached): [1] bitops_1.0-6 stats4_3.0.0 zlibbioc_1.7.0 > [[alternative HTML version deleted]]
ADD COMMENTlink modified 5.1 years ago by Hervé Pagès ♦♦ 13k • written 5.2 years ago by Leonard Goldstein80
0
gravatar for Leonard Goldstein
5.2 years ago by
United States
Leonard Goldstein80 wrote:
Dear Herve and list, I realized the error might have been caused due to the same bam record being read in multiple times when passing multiple ranges in the which argument. I don't think it is necessarily obvious to users that readGAlignments/Pairs returns a concatenation of records that were read in for each range in the which argument. Instead one might expect to obtain the union of records that overlap any of the ranges included in which. Do you think it would be helpful to issue a warning in case records are read in multiple times? Best wishes, Leonard On Tue, Sep 10, 2013 at 5:58 PM, Leonard Goldstein <goldstel@gene.com>wrote: > Hi Herve, > > > I ran into some issues when using readGAlignmentPairs with the > ScanBamParam 'which' argument. I included an example below. I can see how > changing 'which' can result in different pairings but was wondering whether > the error can be avoided and problematic pairings dropped instead. > > > It looks like the issue is introduced in revision 77808 (no error in > 77791). I was hoping you have an idea what causes the problem, otherwise I > can try to forward a test case. > > > Thanks for your help. > > > Leonard > > > > gr > GRanges with 2 ranges and 0 metadata columns: > seqnames ranges strand > <rle> <iranges> <rle> > [1] 10 [75758072, 75758133] * > [2] 10 [75802841, 75802911] * > --- > seqlengths: > 10 > NA > > > > readGAlignmentPairs(file, param = ScanBamParam(which = gr[1])) > GAlignmentPairs with 0 alignment pairs and 0 metadata columns: > seqnames strand : ranges -- ranges > <rle> <rle> : <iranges> -- <iranges> > --- > seqlengths: > 1 2 3 ... GL000247.1 GL000248.1 GL000249.1 > 249250621 243199373 198022430 ... 36422 39786 38502 > > > > readGAlignmentPairs(file, param = ScanBamParam(which = gr[2])) > GAlignmentPairs with 3 alignment pairs and 0 metadata columns: > seqnames strand : ranges -- ranges > <rle> <rle> : <iranges> -- <iranges> > [1] 10 + : [75758115, 75802896] -- [75802892, 75830482] > [2] 10 - : [75802908, 75830498] -- [75758111, 75802892] > [3] 10 - : [75802908, 75830498] -- [75802878, 75830468] > --- > seqlengths: > 1 2 3 ... GL000247.1 GL000248.1 GL000249.1 > 249250621 243199373 198022430 ... 36422 39786 38502 > > > > readGAlignmentPairs(file, param = ScanBamParam(which = gr)) > Error in makeGAlignmentPairs(galn, use.names = use.names, use.mcols = > use.mcols) : > findMateAlignment() returned an invalid 'mate' vector > In addition: Warning message: > In findMateAlignment(x) : > 4 alignments with ambiguous pairing were dumped. > Use 'getDumpedAlignments()' to retrieve them from the dump environment. > > > > sessionInfo() > R version 3.0.0 (2013-04-03) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] Rsamtools_1.13.39 Biostrings_2.29.16 GenomicRanges_1.13.41 > [4] XVector_0.1.1 IRanges_1.19.31 BiocGenerics_0.7.4 > > loaded via a namespace (and not attached): > [1] bitops_1.0-6 stats4_3.0.0 zlibbioc_1.7.0 > > > > [[alternative HTML version deleted]]
ADD COMMENTlink written 5.2 years ago by Leonard Goldstein80
There seem to be at least two "gotchas" with the 'which' argument and readGAlignmentPairs: (1) as Leonard said, there's no easy way to handle cases where one record overlaps multiple which arguments and (2) any read pairs with one end inside which and the other out will be discarded. Excluding per-chromosome iteration, 'which' is a bit dangerous when dealing with read pairs. Somehow we should make this obvious to the user. On Wed, Sep 11, 2013 at 11:02 AM, Leonard Goldstein < goldstein.leonard@gene.com> wrote: > Dear Herve and list, > > > I realized the error might have been caused due to the same bam record > being read in multiple times when passing multiple ranges in the which > argument. > > > I don't think it is necessarily obvious to users that readGAlignments/Pairs > returns a concatenation of records that were read in for each range in the > which argument. Instead one might expect to obtain the union of records > that overlap any of the ranges included in which. Do you think it would be > helpful to issue a warning in case records are read in multiple times? > > > Best wishes, > > > Leonard > > > > On Tue, Sep 10, 2013 at 5:58 PM, Leonard Goldstein <goldstel@gene.com> >wrote: > > > Hi Herve, > > > > > > I ran into some issues when using readGAlignmentPairs with the > > ScanBamParam 'which' argument. I included an example below. I can see how > > changing 'which' can result in different pairings but was wondering > whether > > the error can be avoided and problematic pairings dropped instead. > > > > > > It looks like the issue is introduced in revision 77808 (no error in > > 77791). I was hoping you have an idea what causes the problem, otherwise > I > > can try to forward a test case. > > > > > > Thanks for your help. > > > > > > Leonard > > > > > > > gr > > GRanges with 2 ranges and 0 metadata columns: > > seqnames ranges strand > > <rle> <iranges> <rle> > > [1] 10 [75758072, 75758133] * > > [2] 10 [75802841, 75802911] * > > --- > > seqlengths: > > 10 > > NA > > > > > > readGAlignmentPairs(file, param = ScanBamParam(which = gr[1])) > > GAlignmentPairs with 0 alignment pairs and 0 metadata columns: > > seqnames strand : ranges -- ranges > > <rle> <rle> : <iranges> -- <iranges> > > --- > > seqlengths: > > 1 2 3 ... GL000247.1 GL000248.1 GL000249.1 > > 249250621 243199373 198022430 ... 36422 39786 38502 > > > > > > readGAlignmentPairs(file, param = ScanBamParam(which = gr[2])) > > GAlignmentPairs with 3 alignment pairs and 0 metadata columns: > > seqnames strand : ranges -- ranges > > <rle> <rle> : <iranges> -- <iranges> > > [1] 10 + : [75758115, 75802896] -- [75802892, 75830482] > > [2] 10 - : [75802908, 75830498] -- [75758111, 75802892] > > [3] 10 - : [75802908, 75830498] -- [75802878, 75830468] > > --- > > seqlengths: > > 1 2 3 ... GL000247.1 GL000248.1 GL000249.1 > > 249250621 243199373 198022430 ... 36422 39786 38502 > > > > > > readGAlignmentPairs(file, param = ScanBamParam(which = gr)) > > Error in makeGAlignmentPairs(galn, use.names = use.names, use.mcols = > > use.mcols) : > > findMateAlignment() returned an invalid 'mate' vector > > In addition: Warning message: > > In findMateAlignment(x) : > > 4 alignments with ambiguous pairing were dumped. > > Use 'getDumpedAlignments()' to retrieve them from the dump > environment. > > > > > > sessionInfo() > > R version 3.0.0 (2013-04-03) > > Platform: x86_64-unknown-linux-gnu (64-bit) > > > > locale: > > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > > [7] LC_PAPER=C LC_NAME=C > > [9] LC_ADDRESS=C LC_TELEPHONE=C > > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > > > attached base packages: > > [1] parallel stats graphics grDevices utils datasets methods > > [8] base > > > > other attached packages: > > [1] Rsamtools_1.13.39 Biostrings_2.29.16 GenomicRanges_1.13.41 > > [4] XVector_0.1.1 IRanges_1.19.31 BiocGenerics_0.7.4 > > > > loaded via a namespace (and not attached): > > [1] bitops_1.0-6 stats4_3.0.0 zlibbioc_1.7.0 > > > > > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD REPLYlink written 5.2 years ago by Michael Lawrence10k
Hi guys, Sorry for the late answer. On 09/11/2013 11:51 AM, Michael Lawrence wrote: > There seem to be at least two "gotchas" with the 'which' argument and > readGAlignmentPairs: (1) as Leonard said, there's no easy way to handle > cases where one record overlaps multiple which arguments This is not just with readGAlignmentPairs, it's also with readGAlignments and, at a lower level, with scanBam(), which the 2 formers are based on. If a record overlaps 2 ranges in the 'which' arg, scanBam() loads it twice. I can't think of an easy way to address this unless we change the semantic of the 'which' argument. For example instead of loading records that overlap we could make the choice to only load records that have their POS (i.e. start) within the ranges specified thru 'which'. It introduces a dissymmetry (because it looks at the start of the record and not at its end) but it has the advantage of making sure records are only loaded once (unless the user passes a 'which' argument that contains overlapping ranges but then s/he's really looking for it ;-) ). > and (2) any read > pairs with one end inside which and the other out will be discarded. No easy work around for that one with the current implementation of readGAlignmentPairs which just passes its 'which' argument down to scanBam(). > Excluding per-chromosome iteration, 'which' is a bit dangerous when dealing > with read pairs. It's also dangerous when dealing with single end reads, and it has been for a while. Cheers, H. > Somehow we should make this obvious to the user. > > > On Wed, Sep 11, 2013 at 11:02 AM, Leonard Goldstein < > goldstein.leonard at gene.com> wrote: > >> Dear Herve and list, >> >> >> I realized the error might have been caused due to the same bam record >> being read in multiple times when passing multiple ranges in the which >> argument. >> >> >> I don't think it is necessarily obvious to users that readGAlignments/Pairs >> returns a concatenation of records that were read in for each range in the >> which argument. Instead one might expect to obtain the union of records >> that overlap any of the ranges included in which. Do you think it would be >> helpful to issue a warning in case records are read in multiple times? >> >> >> Best wishes, >> >> >> Leonard >> >> >> >> On Tue, Sep 10, 2013 at 5:58 PM, Leonard Goldstein <goldstel at="" gene.com="">>> wrote: >> >>> Hi Herve, >>> >>> >>> I ran into some issues when using readGAlignmentPairs with the >>> ScanBamParam 'which' argument. I included an example below. I can see how >>> changing 'which' can result in different pairings but was wondering >> whether >>> the error can be avoided and problematic pairings dropped instead. >>> >>> >>> It looks like the issue is introduced in revision 77808 (no error in >>> 77791). I was hoping you have an idea what causes the problem, otherwise >> I >>> can try to forward a test case. >>> >>> >>> Thanks for your help. >>> >>> >>> Leonard >>> >>> >>>> gr >>> GRanges with 2 ranges and 0 metadata columns: >>> seqnames ranges strand >>> <rle> <iranges> <rle> >>> [1] 10 [75758072, 75758133] * >>> [2] 10 [75802841, 75802911] * >>> --- >>> seqlengths: >>> 10 >>> NA >>>> >>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr[1])) >>> GAlignmentPairs with 0 alignment pairs and 0 metadata columns: >>> seqnames strand : ranges -- ranges >>> <rle> <rle> : <iranges> -- <iranges> >>> --- >>> seqlengths: >>> 1 2 3 ... GL000247.1 GL000248.1 GL000249.1 >>> 249250621 243199373 198022430 ... 36422 39786 38502 >>>> >>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr[2])) >>> GAlignmentPairs with 3 alignment pairs and 0 metadata columns: >>> seqnames strand : ranges -- ranges >>> <rle> <rle> : <iranges> -- <iranges> >>> [1] 10 + : [75758115, 75802896] -- [75802892, 75830482] >>> [2] 10 - : [75802908, 75830498] -- [75758111, 75802892] >>> [3] 10 - : [75802908, 75830498] -- [75802878, 75830468] >>> --- >>> seqlengths: >>> 1 2 3 ... GL000247.1 GL000248.1 GL000249.1 >>> 249250621 243199373 198022430 ... 36422 39786 38502 >>>> >>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr)) >>> Error in makeGAlignmentPairs(galn, use.names = use.names, use.mcols = >>> use.mcols) : >>> findMateAlignment() returned an invalid 'mate' vector >>> In addition: Warning message: >>> In findMateAlignment(x) : >>> 4 alignments with ambiguous pairing were dumped. >>> Use 'getDumpedAlignments()' to retrieve them from the dump >> environment. >>>> >>>> sessionInfo() >>> R version 3.0.0 (2013-04-03) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=C LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils datasets methods >>> [8] base >>> >>> other attached packages: >>> [1] Rsamtools_1.13.39 Biostrings_2.29.16 GenomicRanges_1.13.41 >>> [4] XVector_0.1.1 IRanges_1.19.31 BiocGenerics_0.7.4 >>> >>> loaded via a namespace (and not attached): >>> [1] bitops_1.0-6 stats4_3.0.0 zlibbioc_1.7.0 >>>> >>> >>> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLYlink written 5.2 years ago by Hervé Pagès ♦♦ 13k
Yes, I knew there weren't any easy solutions. I just think that somehow we need to point these out to users, because it's not exactly obvious. On Tue, Sep 17, 2013 at 11:52 AM, Hervé Pagès <hpages@fhcrc.org> wrote: > Hi guys, > > Sorry for the late answer. > > > On 09/11/2013 11:51 AM, Michael Lawrence wrote: > >> There seem to be at least two "gotchas" with the 'which' argument and >> readGAlignmentPairs: (1) as Leonard said, there's no easy way to handle >> cases where one record overlaps multiple which arguments >> > > This is not just with readGAlignmentPairs, it's also with > readGAlignments and, at a lower level, with scanBam(), which > the 2 formers are based on. If a record overlaps 2 ranges in the > 'which' arg, scanBam() loads it twice. > > I can't think of an easy way to address this unless we change the > semantic of the 'which' argument. For example instead of loading > records that overlap we could make the choice to only load records > that have their POS (i.e. start) within the ranges specified thru > 'which'. It introduces a dissymmetry (because it looks at the start > of the record and not at its end) but it has the advantage of making > sure records are only loaded once (unless the user passes a 'which' > argument that contains overlapping ranges but then s/he's really > looking for it ;-) ). > > > and (2) any read >> pairs with one end inside which and the other out will be discarded. >> > > No easy work around for that one with the current implementation of > readGAlignmentPairs which just passes its 'which' argument down to > scanBam(). > > > Excluding per-chromosome iteration, 'which' is a bit dangerous when >> dealing >> with read pairs. >> > > It's also dangerous when dealing with single end reads, and it has > been for a while. > > Cheers, > H. > > > Somehow we should make this obvious to the user. >> >> >> On Wed, Sep 11, 2013 at 11:02 AM, Leonard Goldstein < >> goldstein.leonard@gene.com> wrote: >> >> Dear Herve and list, >>> >>> >>> I realized the error might have been caused due to the same bam record >>> being read in multiple times when passing multiple ranges in the which >>> argument. >>> >>> >>> I don't think it is necessarily obvious to users that >>> readGAlignments/Pairs >>> returns a concatenation of records that were read in for each range in >>> the >>> which argument. Instead one might expect to obtain the union of records >>> that overlap any of the ranges included in which. Do you think it would >>> be >>> helpful to issue a warning in case records are read in multiple times? >>> >>> >>> Best wishes, >>> >>> >>> Leonard >>> >>> >>> >>> On Tue, Sep 10, 2013 at 5:58 PM, Leonard Goldstein <goldstel@gene.com>>> >>>> wrote: >>>> >>> >>> Hi Herve, >>>> >>>> >>>> I ran into some issues when using readGAlignmentPairs with the >>>> ScanBamParam 'which' argument. I included an example below. I can see >>>> how >>>> changing 'which' can result in different pairings but was wondering >>>> >>> whether >>> >>>> the error can be avoided and problematic pairings dropped instead. >>>> >>>> >>>> It looks like the issue is introduced in revision 77808 (no error in >>>> 77791). I was hoping you have an idea what causes the problem, otherwise >>>> >>> I >>> >>>> can try to forward a test case. >>>> >>>> >>>> Thanks for your help. >>>> >>>> >>>> Leonard >>>> >>>> >>>> gr >>>>> >>>> GRanges with 2 ranges and 0 metadata columns: >>>> seqnames ranges strand >>>> <rle> <iranges> <rle> >>>> [1] 10 [75758072, 75758133] * >>>> [2] 10 [75802841, 75802911] * >>>> --- >>>> seqlengths: >>>> 10 >>>> NA >>>> >>>>> >>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr[1])) >>>>> >>>> GAlignmentPairs with 0 alignment pairs and 0 metadata columns: >>>> seqnames strand : ranges -- ranges >>>> <rle> <rle> : <iranges> -- <iranges> >>>> --- >>>> seqlengths: >>>> 1 2 3 ... GL000247.1 GL000248.1 GL000249.1 >>>> 249250621 243199373 198022430 ... 36422 39786 38502 >>>> >>>>> >>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr[2])) >>>>> >>>> GAlignmentPairs with 3 alignment pairs and 0 metadata columns: >>>> seqnames strand : ranges -- ranges >>>> <rle> <rle> : <iranges> -- <iranges> >>>> [1] 10 + : [75758115, 75802896] -- [75802892, 75830482] >>>> [2] 10 - : [75802908, 75830498] -- [75758111, 75802892] >>>> [3] 10 - : [75802908, 75830498] -- [75802878, 75830468] >>>> --- >>>> seqlengths: >>>> 1 2 3 ... GL000247.1 GL000248.1 GL000249.1 >>>> 249250621 243199373 198022430 ... 36422 39786 38502 >>>> >>>>> >>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr)) >>>>> >>>> Error in makeGAlignmentPairs(galn, use.names = use.names, use.mcols = >>>> use.mcols) : >>>> findMateAlignment() returned an invalid 'mate' vector >>>> In addition: Warning message: >>>> In findMateAlignment(x) : >>>> 4 alignments with ambiguous pairing were dumped. >>>> Use 'getDumpedAlignments()' to retrieve them from the dump >>>> >>> environment. >>> >>>> >>>>> sessionInfo() >>>>> >>>> R version 3.0.0 (2013-04-03) >>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>> >>>> locale: >>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>> [7] LC_PAPER=C LC_NAME=C >>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>> >>>> attached base packages: >>>> [1] parallel stats graphics grDevices utils datasets methods >>>> [8] base >>>> >>>> other attached packages: >>>> [1] Rsamtools_1.13.39 Biostrings_2.29.16 GenomicRanges_1.13.41 >>>> [4] XVector_0.1.1 IRanges_1.19.31 BiocGenerics_0.7.4 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] bitops_1.0-6 stats4_3.0.0 zlibbioc_1.7.0 >>>> >>>>> >>>>> >>>> >>>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________**_________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.="" ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> Search the archives: >>> http://news.gmane.org/gmane.**science.biology.informatics.**conduc tor<http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >>> >>> >> [[alternative HTML version deleted]] >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]
ADD REPLYlink written 5.2 years ago by Michael Lawrence10k
The new pairing algorithm used by readGAlignmentsList() does find mates outside the specified range. When only one of the pairs overlaps the 'which', it looks for a potential mate in the rest of the file. fl <- system.file("extdata", "ex1.bam", package="Rsamtools") param <- ScanBamParam(what="flag", which=GRanges("seq1", IRanges(1, width=40))) readGAlignmentPairs() finds no mates in this region. gapairs <- readGAlignmentPairs(BamFile(fl), param=param) > gapairs GAlignmentPairs with 0 alignment pairs and 0 metadata columns: seqnames strand : ranges -- ranges <rle> <rle> : <iranges> -- <iranges> readGAlignmentsList() finds two mate pairs and several non-mates in this region. The 'mates' metadata column indicates which are legitimate mates as per the algorithem described at ?readGAlignmentsListFromBam. More control over what records are returned can be done by specifying flags in ScanBamParam. galist <- readGAlignmentsList(BamFile(fl, asMates=TRUE), param=param) > galist GAlignmentsList of length 17: [[1]] GAlignments with 2 alignments and 2 metadata columns: seqnames strand cigar qwidth start end width ngap | mates flag [1] seq1 - 35M 35 229 263 35 0 | 1 83 [2] seq1 + 35M 35 37 71 35 0 | 1 163 [[2]] GAlignments with 2 alignments and 2 metadata columns: seqnames strand cigar qwidth start end width ngap | mates flag [1] seq1 - 35M 35 185 219 35 0 | 1 147 [2] seq1 + 35M 35 36 70 35 0 | 1 99 [[3]] GAlignments with 1 alignment and 2 metadata columns: seqnames strand cigar qwidth start end width ngap | mates flag [1] seq1 + 36M 36 6 41 36 0 | 0 137 > elementLengths(galist) [1] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 The first two list elements contain mates. > sapply(galist, function(x) sum((mcols(x)$mates)) > 0) [1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [13] FALSE FALSE FALSE FALSE FALSE Here are descriptions of the first few flags as per this site: http://picard.sourceforge.net/explain-flags.html Flag 137 indicates an unmapped mate which is why this record was not mated. 83: Summary: read paired read mapped in proper pair read reverse strand first in pair 163: Summary: read paired read mapped in proper pair mate reverse strand second in pair 147: Summary: read paired read mapped in proper pair read reverse strand second in pair 99: Summary: read paired read mapped in proper pair mate reverse strand first in pair 137: Summary: read paired mate unmapped second in pair Valerie On 09/17/2013 01:55 PM, Michael Lawrence wrote: > Yes, I knew there weren't any easy solutions. I just think that somehow we > need to point these out to users, because it's not exactly obvious. > > > On Tue, Sep 17, 2013 at 11:52 AM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: > >> Hi guys, >> >> Sorry for the late answer. >> >> >> On 09/11/2013 11:51 AM, Michael Lawrence wrote: >> >>> There seem to be at least two "gotchas" with the 'which' argument and >>> readGAlignmentPairs: (1) as Leonard said, there's no easy way to handle >>> cases where one record overlaps multiple which arguments >>> >> >> This is not just with readGAlignmentPairs, it's also with >> readGAlignments and, at a lower level, with scanBam(), which >> the 2 formers are based on. If a record overlaps 2 ranges in the >> 'which' arg, scanBam() loads it twice. >> >> I can't think of an easy way to address this unless we change the >> semantic of the 'which' argument. For example instead of loading >> records that overlap we could make the choice to only load records >> that have their POS (i.e. start) within the ranges specified thru >> 'which'. It introduces a dissymmetry (because it looks at the start >> of the record and not at its end) but it has the advantage of making >> sure records are only loaded once (unless the user passes a 'which' >> argument that contains overlapping ranges but then s/he's really >> looking for it ;-) ). >> >> >> and (2) any read >>> pairs with one end inside which and the other out will be discarded. >>> >> >> No easy work around for that one with the current implementation of >> readGAlignmentPairs which just passes its 'which' argument down to >> scanBam(). >> >> >> Excluding per-chromosome iteration, 'which' is a bit dangerous when >>> dealing >>> with read pairs. >>> >> >> It's also dangerous when dealing with single end reads, and it has >> been for a while. >> >> Cheers, >> H. >> >> >> Somehow we should make this obvious to the user. >>> >>> >>> On Wed, Sep 11, 2013 at 11:02 AM, Leonard Goldstein < >>> goldstein.leonard at gene.com> wrote: >>> >>> Dear Herve and list, >>>> >>>> >>>> I realized the error might have been caused due to the same bam record >>>> being read in multiple times when passing multiple ranges in the which >>>> argument. >>>> >>>> >>>> I don't think it is necessarily obvious to users that >>>> readGAlignments/Pairs >>>> returns a concatenation of records that were read in for each range in >>>> the >>>> which argument. Instead one might expect to obtain the union of records >>>> that overlap any of the ranges included in which. Do you think it would >>>> be >>>> helpful to issue a warning in case records are read in multiple times? >>>> >>>> >>>> Best wishes, >>>> >>>> >>>> Leonard >>>> >>>> >>>> >>>> On Tue, Sep 10, 2013 at 5:58 PM, Leonard Goldstein <goldstel at="" gene.com="">>>> >>>>> wrote: >>>>> >>>> >>>> Hi Herve, >>>>> >>>>> >>>>> I ran into some issues when using readGAlignmentPairs with the >>>>> ScanBamParam 'which' argument. I included an example below. I can see >>>>> how >>>>> changing 'which' can result in different pairings but was wondering >>>>> >>>> whether >>>> >>>>> the error can be avoided and problematic pairings dropped instead. >>>>> >>>>> >>>>> It looks like the issue is introduced in revision 77808 (no error in >>>>> 77791). I was hoping you have an idea what causes the problem, otherwise >>>>> >>>> I >>>> >>>>> can try to forward a test case. >>>>> >>>>> >>>>> Thanks for your help. >>>>> >>>>> >>>>> Leonard >>>>> >>>>> >>>>> gr >>>>>> >>>>> GRanges with 2 ranges and 0 metadata columns: >>>>> seqnames ranges strand >>>>> <rle> <iranges> <rle> >>>>> [1] 10 [75758072, 75758133] * >>>>> [2] 10 [75802841, 75802911] * >>>>> --- >>>>> seqlengths: >>>>> 10 >>>>> NA >>>>> >>>>>> >>>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr[1])) >>>>>> >>>>> GAlignmentPairs with 0 alignment pairs and 0 metadata columns: >>>>> seqnames strand : ranges -- ranges >>>>> <rle> <rle> : <iranges> -- <iranges> >>>>> --- >>>>> seqlengths: >>>>> 1 2 3 ... GL000247.1 GL000248.1 GL000249.1 >>>>> 249250621 243199373 198022430 ... 36422 39786 38502 >>>>> >>>>>> >>>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr[2])) >>>>>> >>>>> GAlignmentPairs with 3 alignment pairs and 0 metadata columns: >>>>> seqnames strand : ranges -- ranges >>>>> <rle> <rle> : <iranges> -- <iranges> >>>>> [1] 10 + : [75758115, 75802896] -- [75802892, 75830482] >>>>> [2] 10 - : [75802908, 75830498] -- [75758111, 75802892] >>>>> [3] 10 - : [75802908, 75830498] -- [75802878, 75830468] >>>>> --- >>>>> seqlengths: >>>>> 1 2 3 ... GL000247.1 GL000248.1 GL000249.1 >>>>> 249250621 243199373 198022430 ... 36422 39786 38502 >>>>> >>>>>> >>>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr)) >>>>>> >>>>> Error in makeGAlignmentPairs(galn, use.names = use.names, use.mcols = >>>>> use.mcols) : >>>>> findMateAlignment() returned an invalid 'mate' vector >>>>> In addition: Warning message: >>>>> In findMateAlignment(x) : >>>>> 4 alignments with ambiguous pairing were dumped. >>>>> Use 'getDumpedAlignments()' to retrieve them from the dump >>>>> >>>> environment. >>>> >>>>> >>>>>> sessionInfo() >>>>>> >>>>> R version 3.0.0 (2013-04-03) >>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>> >>>>> locale: >>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>>> [7] LC_PAPER=C LC_NAME=C >>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>>> >>>>> attached base packages: >>>>> [1] parallel stats graphics grDevices utils datasets methods >>>>> [8] base >>>>> >>>>> other attached packages: >>>>> [1] Rsamtools_1.13.39 Biostrings_2.29.16 GenomicRanges_1.13.41 >>>>> [4] XVector_0.1.1 IRanges_1.19.31 BiocGenerics_0.7.4 >>>>> >>>>> loaded via a namespace (and not attached): >>>>> [1] bitops_1.0-6 stats4_3.0.0 zlibbioc_1.7.0 >>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> [[alternative HTML version deleted]] >>>> >>>> ______________________________**_________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat="" .ethz.ch="" mailman="" listinfo="" bioconductor=""> >>>> Search the archives: >>>> http://news.gmane.org/gmane.**science.biology.informatics.**condu ctor<http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >>>> >>>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________**_________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.="" ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> Search the archives: http://news.gmane.org/gmane.** >>> science.biology.informatics.**conductor<http: news.gmane.org="" gman="" e.science.biology.informatics.conductor=""> >>> >>> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> > > [[alternative HTML version deleted]] > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLYlink written 5.2 years ago by Valerie Obenchain ♦♦ 6.6k
I'm a big fan of the new paired-end streaming. It's the way to go over which-based partitioning. Nice job! On Tue, Sep 17, 2013 at 2:46 PM, Valerie Obenchain <vobencha@fhcrc.org>wrote: > The new pairing algorithm used by readGAlignmentsList() does find mates > outside the specified range. When only one of the pairs overlaps the > 'which', it looks for a potential mate in the rest of the file. > > fl <- system.file("extdata", "ex1.bam", package="Rsamtools") > param <- ScanBamParam(what="flag", > which=GRanges("seq1", IRanges(1, width=40))) > > readGAlignmentPairs() finds no mates in this region. > > gapairs <- readGAlignmentPairs(BamFile(**fl), param=param) > > gapairs > > GAlignmentPairs with 0 alignment pairs and 0 metadata columns: > seqnames strand : ranges -- ranges > <rle> <rle> : <iranges> -- <iranges> > > readGAlignmentsList() finds two mate pairs and several non-mates in this > region. The 'mates' metadata column indicates which are legitimate mates as > per the algorithem described at ?readGAlignmentsListFromBam. More control > over what records are returned can be done by specifying flags in > ScanBamParam. > > galist <- readGAlignmentsList(BamFile(**fl, asMates=TRUE), param=param) > > galist > GAlignmentsList of length 17: > [[1]] > GAlignments with 2 alignments and 2 metadata columns: > seqnames strand cigar qwidth start end width ngap | mates flag > [1] seq1 - 35M 35 229 263 35 0 | 1 83 > [2] seq1 + 35M 35 37 71 35 0 | 1 163 > > [[2]] > GAlignments with 2 alignments and 2 metadata columns: > seqnames strand cigar qwidth start end width ngap | mates flag > [1] seq1 - 35M 35 185 219 35 0 | 1 147 > [2] seq1 + 35M 35 36 70 35 0 | 1 99 > > [[3]] > GAlignments with 1 alignment and 2 metadata columns: > seqnames strand cigar qwidth start end width ngap | mates flag > [1] seq1 + 36M 36 6 41 36 0 | 0 137 > > > > elementLengths(galist) > [1] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 > > The first two list elements contain mates. > >> sapply(galist, function(x) sum((mcols(x)$mates)) > 0) >> > [1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > FALSE > [13] FALSE FALSE FALSE FALSE FALSE > > > Here are descriptions of the first few flags as per this site: > http://picard.sourceforge.net/**explain- flags.html<http: picard.sourceforge.net="" explain-flags.html=""> > Flag 137 indicates an unmapped mate which is why this record was not mated. > > > 83: > Summary: > read paired > read mapped in proper pair > read reverse strand > first in pair > > 163: > Summary: > read paired > read mapped in proper pair > mate reverse strand > second in pair > > 147: > Summary: > read paired > read mapped in proper pair > read reverse strand > second in pair > > 99: > Summary: > read paired > read mapped in proper pair > mate reverse strand > first in pair > > 137: > Summary: > read paired > mate unmapped > second in pair > > > Valerie > > > > On 09/17/2013 01:55 PM, Michael Lawrence wrote: > >> Yes, I knew there weren't any easy solutions. I just think that somehow we >> need to point these out to users, because it's not exactly obvious. >> >> >> On Tue, Sep 17, 2013 at 11:52 AM, Hervé Pagès <hpages@fhcrc.org> wrote: >> >> Hi guys, >>> >>> Sorry for the late answer. >>> >>> >>> On 09/11/2013 11:51 AM, Michael Lawrence wrote: >>> >>> There seem to be at least two "gotchas" with the 'which' argument and >>>> readGAlignmentPairs: (1) as Leonard said, there's no easy way to handle >>>> cases where one record overlaps multiple which arguments >>>> >>>> >>> This is not just with readGAlignmentPairs, it's also with >>> readGAlignments and, at a lower level, with scanBam(), which >>> the 2 formers are based on. If a record overlaps 2 ranges in the >>> 'which' arg, scanBam() loads it twice. >>> >>> I can't think of an easy way to address this unless we change the >>> semantic of the 'which' argument. For example instead of loading >>> records that overlap we could make the choice to only load records >>> that have their POS (i.e. start) within the ranges specified thru >>> 'which'. It introduces a dissymmetry (because it looks at the start >>> of the record and not at its end) but it has the advantage of making >>> sure records are only loaded once (unless the user passes a 'which' >>> argument that contains overlapping ranges but then s/he's really >>> looking for it ;-) ). >>> >>> >>> and (2) any read >>> >>>> pairs with one end inside which and the other out will be discarded. >>>> >>>> >>> No easy work around for that one with the current implementation of >>> readGAlignmentPairs which just passes its 'which' argument down to >>> scanBam(). >>> >>> >>> Excluding per-chromosome iteration, 'which' is a bit dangerous when >>> >>>> dealing >>>> with read pairs. >>>> >>>> >>> It's also dangerous when dealing with single end reads, and it has >>> been for a while. >>> >>> Cheers, >>> H. >>> >>> >>> Somehow we should make this obvious to the user. >>> >>>> >>>> >>>> On Wed, Sep 11, 2013 at 11:02 AM, Leonard Goldstein < >>>> goldstein.leonard@gene.com> wrote: >>>> >>>> Dear Herve and list, >>>> >>>>> >>>>> >>>>> I realized the error might have been caused due to the same bam record >>>>> being read in multiple times when passing multiple ranges in the which >>>>> argument. >>>>> >>>>> >>>>> I don't think it is necessarily obvious to users that >>>>> readGAlignments/Pairs >>>>> returns a concatenation of records that were read in for each range in >>>>> the >>>>> which argument. Instead one might expect to obtain the union of records >>>>> that overlap any of the ranges included in which. Do you think it would >>>>> be >>>>> helpful to issue a warning in case records are read in multiple times? >>>>> >>>>> >>>>> Best wishes, >>>>> >>>>> >>>>> Leonard >>>>> >>>>> >>>>> >>>>> On Tue, Sep 10, 2013 at 5:58 PM, Leonard Goldstein <goldstel@gene.com>>>>> >>>>> wrote: >>>>>> >>>>>> >>>>> Hi Herve, >>>>> >>>>>> >>>>>> >>>>>> I ran into some issues when using readGAlignmentPairs with the >>>>>> ScanBamParam 'which' argument. I included an example below. I can see >>>>>> how >>>>>> changing 'which' can result in different pairings but was wondering >>>>>> >>>>>> whether >>>>> >>>>> the error can be avoided and problematic pairings dropped instead. >>>>>> >>>>>> >>>>>> It looks like the issue is introduced in revision 77808 (no error in >>>>>> 77791). I was hoping you have an idea what causes the problem, >>>>>> otherwise >>>>>> >>>>>> I >>>>> >>>>> can try to forward a test case. >>>>>> >>>>>> >>>>>> Thanks for your help. >>>>>> >>>>>> >>>>>> Leonard >>>>>> >>>>>> >>>>>> gr >>>>>> >>>>>>> >>>>>>> GRanges with 2 ranges and 0 metadata columns: >>>>>> seqnames ranges strand >>>>>> <rle> <iranges> <rle> >>>>>> [1] 10 [75758072, 75758133] * >>>>>> [2] 10 [75802841, 75802911] * >>>>>> --- >>>>>> seqlengths: >>>>>> 10 >>>>>> NA >>>>>> >>>>>> >>>>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr[1])) >>>>>>> >>>>>>> GAlignmentPairs with 0 alignment pairs and 0 metadata columns: >>>>>> seqnames strand : ranges -- ranges >>>>>> <rle> <rle> : <iranges> -- <iranges> >>>>>> --- >>>>>> seqlengths: >>>>>> 1 2 3 ... GL000247.1 GL000248.1 >>>>>> GL000249.1 >>>>>> 249250621 243199373 198022430 ... 36422 39786 >>>>>> 38502 >>>>>> >>>>>> >>>>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr[2])) >>>>>>> >>>>>>> GAlignmentPairs with 3 alignment pairs and 0 metadata columns: >>>>>> seqnames strand : ranges -- ranges >>>>>> <rle> <rle> : <iranges> -- <iranges> >>>>>> [1] 10 + : [75758115, 75802896] -- [75802892, 75830482] >>>>>> [2] 10 - : [75802908, 75830498] -- [75758111, 75802892] >>>>>> [3] 10 - : [75802908, 75830498] -- [75802878, 75830468] >>>>>> --- >>>>>> seqlengths: >>>>>> 1 2 3 ... GL000247.1 GL000248.1 >>>>>> GL000249.1 >>>>>> 249250621 243199373 198022430 ... 36422 39786 >>>>>> 38502 >>>>>> >>>>>> >>>>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr)) >>>>>>> >>>>>>> Error in makeGAlignmentPairs(galn, use.names = use.names, use.mcols >>>>>> = >>>>>> use.mcols) : >>>>>> findMateAlignment() returned an invalid 'mate' vector >>>>>> In addition: Warning message: >>>>>> In findMateAlignment(x) : >>>>>> 4 alignments with ambiguous pairing were dumped. >>>>>> Use 'getDumpedAlignments()' to retrieve them from the dump >>>>>> >>>>>> environment. >>>>> >>>>> >>>>>> sessionInfo() >>>>>>> >>>>>>> R version 3.0.0 (2013-04-03) >>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>> >>>>>> locale: >>>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>>>> >>>>>> attached base packages: >>>>>> [1] parallel stats graphics grDevices utils datasets >>>>>> methods >>>>>> [8] base >>>>>> >>>>>> other attached packages: >>>>>> [1] Rsamtools_1.13.39 Biostrings_2.29.16 GenomicRanges_1.13.41 >>>>>> [4] XVector_0.1.1 IRanges_1.19.31 BiocGenerics_0.7.4 >>>>>> >>>>>> loaded via a namespace (and not attached): >>>>>> [1] bitops_1.0-6 stats4_3.0.0 zlibbioc_1.7.0 >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> [[alternative HTML version deleted]] >>>>> >>>>> ______________________________****_________________ >>>>> Bioconductor mailing list >>>>> Bioconductor@r-project.org >>>>> https://stat.ethz.ch/mailman/****listinfo/bioconductor<https: s="" tat.ethz.ch="" mailman="" **listinfo="" bioconductor=""> >>>>> <https: **="" stat.ethz.ch="" mailman="" **listinfo="" bioconductor<https:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>>>> > >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.****science.biology.informatics.**** >>>>> conductor<http: news.gmane.org="" gmane.**science.biology.informat="" ics.**conductor=""> >>>>> <http: news.gmane.**org="" gmane.science.biology.**informatics.con="" ductor<http:="" news.gmane.org="" gmane.science.biology.informatics.conduct="" or=""> >>>>> > >>>>> >>>>> >>>>> [[alternative HTML version deleted]] >>>> >>>> ______________________________****_________________ >>>> Bioconductor mailing list >>>> Bioconductor@r-project.org >>>> https://stat.ethz.ch/mailman/****listinfo/bioconductor<https: st="" at.ethz.ch="" mailman="" **listinfo="" bioconductor=""> >>>> <https: **="" stat.ethz.ch="" mailman="" **listinfo="" bioconductor<https:="" s="" tat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>>> > >>>> Search the archives: http://news.gmane.org/gmane.** >>>> science.biology.informatics.****conductor<http: news.gmane.**="">>>> org/gmane.science.biology.**informatics.conductor<http: news.gma="" ne.org="" gmane.science.biology.informatics.conductor=""> >>>> > >>>> >>>> >>>> -- >>> Hervé Pagès >>> >>> Program in Computational Biology >>> Division of Public Health Sciences >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, M1-B514 >>> P.O. Box 19024 >>> Seattle, WA 98109-1024 >>> >>> E-mail: hpages@fhcrc.org >>> Phone: (206) 667-5791 >>> Fax: (206) 667-1319 >>> >>> >> [[alternative HTML version deleted]] >> >> >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> >> > [[alternative HTML version deleted]]
ADD REPLYlink written 5.2 years ago by Michael Lawrence10k
Leonard was thinking that the problem of importing the same alignment multiple times when caught by multiple which ranges could be solved by maintaining a hash of what reads have been seen before. This could be done with the read ID + pos + strand + cigar, we think, but even easier would be to get at the file offset. It looks like samtools keeps that hidden, but we could just do some copy-and-pasting of the bam_iter_t struct... On Tue, Sep 17, 2013 at 3:29 PM, Michael Lawrence <michafla@gene.com> wrote: > I'm a big fan of the new paired-end streaming. It's the way to go over > which-based partitioning. Nice job! > > > On Tue, Sep 17, 2013 at 2:46 PM, Valerie Obenchain <vobencha@fhcrc.org>wrote: > >> The new pairing algorithm used by readGAlignmentsList() does find mates >> outside the specified range. When only one of the pairs overlaps the >> 'which', it looks for a potential mate in the rest of the file. >> >> fl <- system.file("extdata", "ex1.bam", package="Rsamtools") >> param <- ScanBamParam(what="flag", >> which=GRanges("seq1", IRanges(1, width=40))) >> >> readGAlignmentPairs() finds no mates in this region. >> >> gapairs <- readGAlignmentPairs(BamFile(**fl), param=param) >> > gapairs >> >> GAlignmentPairs with 0 alignment pairs and 0 metadata columns: >> seqnames strand : ranges -- ranges >> <rle> <rle> : <iranges> -- <iranges> >> >> readGAlignmentsList() finds two mate pairs and several non-mates in this >> region. The 'mates' metadata column indicates which are legitimate mates as >> per the algorithem described at ?readGAlignmentsListFromBam. More control >> over what records are returned can be done by specifying flags in >> ScanBamParam. >> >> galist <- readGAlignmentsList(BamFile(**fl, asMates=TRUE), param=param) >> > galist >> GAlignmentsList of length 17: >> [[1]] >> GAlignments with 2 alignments and 2 metadata columns: >> seqnames strand cigar qwidth start end width ngap | mates flag >> [1] seq1 - 35M 35 229 263 35 0 | 1 83 >> [2] seq1 + 35M 35 37 71 35 0 | 1 163 >> >> [[2]] >> GAlignments with 2 alignments and 2 metadata columns: >> seqnames strand cigar qwidth start end width ngap | mates flag >> [1] seq1 - 35M 35 185 219 35 0 | 1 147 >> [2] seq1 + 35M 35 36 70 35 0 | 1 99 >> >> [[3]] >> GAlignments with 1 alignment and 2 metadata columns: >> seqnames strand cigar qwidth start end width ngap | mates flag >> [1] seq1 + 36M 36 6 41 36 0 | 0 137 >> >> >> > elementLengths(galist) >> [1] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 >> >> The first two list elements contain mates. >> >>> sapply(galist, function(x) sum((mcols(x)$mates)) > 0) >>> >> [1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE >> FALSE >> [13] FALSE FALSE FALSE FALSE FALSE >> >> >> Here are descriptions of the first few flags as per this site: >> http://picard.sourceforge.net/**explain- flags.html<http: picard.sourceforge.net="" explain-flags.html=""> >> Flag 137 indicates an unmapped mate which is why this record was not >> mated. >> >> >> 83: >> Summary: >> read paired >> read mapped in proper pair >> read reverse strand >> first in pair >> >> 163: >> Summary: >> read paired >> read mapped in proper pair >> mate reverse strand >> second in pair >> >> 147: >> Summary: >> read paired >> read mapped in proper pair >> read reverse strand >> second in pair >> >> 99: >> Summary: >> read paired >> read mapped in proper pair >> mate reverse strand >> first in pair >> >> 137: >> Summary: >> read paired >> mate unmapped >> second in pair >> >> >> Valerie >> >> >> >> On 09/17/2013 01:55 PM, Michael Lawrence wrote: >> >>> Yes, I knew there weren't any easy solutions. I just think that somehow >>> we >>> need to point these out to users, because it's not exactly obvious. >>> >>> >>> On Tue, Sep 17, 2013 at 11:52 AM, Hervé Pagès <hpages@fhcrc.org> wrote: >>> >>> Hi guys, >>>> >>>> Sorry for the late answer. >>>> >>>> >>>> On 09/11/2013 11:51 AM, Michael Lawrence wrote: >>>> >>>> There seem to be at least two "gotchas" with the 'which' argument and >>>>> readGAlignmentPairs: (1) as Leonard said, there's no easy way to handle >>>>> cases where one record overlaps multiple which arguments >>>>> >>>>> >>>> This is not just with readGAlignmentPairs, it's also with >>>> readGAlignments and, at a lower level, with scanBam(), which >>>> the 2 formers are based on. If a record overlaps 2 ranges in the >>>> 'which' arg, scanBam() loads it twice. >>>> >>>> I can't think of an easy way to address this unless we change the >>>> semantic of the 'which' argument. For example instead of loading >>>> records that overlap we could make the choice to only load records >>>> that have their POS (i.e. start) within the ranges specified thru >>>> 'which'. It introduces a dissymmetry (because it looks at the start >>>> of the record and not at its end) but it has the advantage of making >>>> sure records are only loaded once (unless the user passes a 'which' >>>> argument that contains overlapping ranges but then s/he's really >>>> looking for it ;-) ). >>>> >>>> >>>> and (2) any read >>>> >>>>> pairs with one end inside which and the other out will be discarded. >>>>> >>>>> >>>> No easy work around for that one with the current implementation of >>>> readGAlignmentPairs which just passes its 'which' argument down to >>>> scanBam(). >>>> >>>> >>>> Excluding per-chromosome iteration, 'which' is a bit dangerous when >>>> >>>>> dealing >>>>> with read pairs. >>>>> >>>>> >>>> It's also dangerous when dealing with single end reads, and it has >>>> been for a while. >>>> >>>> Cheers, >>>> H. >>>> >>>> >>>> Somehow we should make this obvious to the user. >>>> >>>>> >>>>> >>>>> On Wed, Sep 11, 2013 at 11:02 AM, Leonard Goldstein < >>>>> goldstein.leonard@gene.com> wrote: >>>>> >>>>> Dear Herve and list, >>>>> >>>>>> >>>>>> >>>>>> I realized the error might have been caused due to the same bam record >>>>>> being read in multiple times when passing multiple ranges in the which >>>>>> argument. >>>>>> >>>>>> >>>>>> I don't think it is necessarily obvious to users that >>>>>> readGAlignments/Pairs >>>>>> returns a concatenation of records that were read in for each range in >>>>>> the >>>>>> which argument. Instead one might expect to obtain the union of >>>>>> records >>>>>> that overlap any of the ranges included in which. Do you think it >>>>>> would >>>>>> be >>>>>> helpful to issue a warning in case records are read in multiple times? >>>>>> >>>>>> >>>>>> Best wishes, >>>>>> >>>>>> >>>>>> Leonard >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Sep 10, 2013 at 5:58 PM, Leonard Goldstein <goldstel@gene.com>>>>>> >>>>>> wrote: >>>>>>> >>>>>>> >>>>>> Hi Herve, >>>>>> >>>>>>> >>>>>>> >>>>>>> I ran into some issues when using readGAlignmentPairs with the >>>>>>> ScanBamParam 'which' argument. I included an example below. I can see >>>>>>> how >>>>>>> changing 'which' can result in different pairings but was wondering >>>>>>> >>>>>>> whether >>>>>> >>>>>> the error can be avoided and problematic pairings dropped instead. >>>>>>> >>>>>>> >>>>>>> It looks like the issue is introduced in revision 77808 (no error in >>>>>>> 77791). I was hoping you have an idea what causes the problem, >>>>>>> otherwise >>>>>>> >>>>>>> I >>>>>> >>>>>> can try to forward a test case. >>>>>>> >>>>>>> >>>>>>> Thanks for your help. >>>>>>> >>>>>>> >>>>>>> Leonard >>>>>>> >>>>>>> >>>>>>> gr >>>>>>> >>>>>>>> >>>>>>>> GRanges with 2 ranges and 0 metadata columns: >>>>>>> seqnames ranges strand >>>>>>> <rle> <iranges> <rle> >>>>>>> [1] 10 [75758072, 75758133] * >>>>>>> [2] 10 [75802841, 75802911] * >>>>>>> --- >>>>>>> seqlengths: >>>>>>> 10 >>>>>>> NA >>>>>>> >>>>>>> >>>>>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr[1])) >>>>>>>> >>>>>>>> GAlignmentPairs with 0 alignment pairs and 0 metadata columns: >>>>>>> seqnames strand : ranges -- ranges >>>>>>> <rle> <rle> : <iranges> -- <iranges> >>>>>>> --- >>>>>>> seqlengths: >>>>>>> 1 2 3 ... GL000247.1 GL000248.1 >>>>>>> GL000249.1 >>>>>>> 249250621 243199373 198022430 ... 36422 39786 >>>>>>> 38502 >>>>>>> >>>>>>> >>>>>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr[2])) >>>>>>>> >>>>>>>> GAlignmentPairs with 3 alignment pairs and 0 metadata columns: >>>>>>> seqnames strand : ranges -- ranges >>>>>>> <rle> <rle> : <iranges> -- <iranges> >>>>>>> [1] 10 + : [75758115, 75802896] -- [75802892, 75830482] >>>>>>> [2] 10 - : [75802908, 75830498] -- [75758111, 75802892] >>>>>>> [3] 10 - : [75802908, 75830498] -- [75802878, 75830468] >>>>>>> --- >>>>>>> seqlengths: >>>>>>> 1 2 3 ... GL000247.1 GL000248.1 >>>>>>> GL000249.1 >>>>>>> 249250621 243199373 198022430 ... 36422 39786 >>>>>>> 38502 >>>>>>> >>>>>>> >>>>>>>> readGAlignmentPairs(file, param = ScanBamParam(which = gr)) >>>>>>>> >>>>>>>> Error in makeGAlignmentPairs(galn, use.names = use.names, >>>>>>> use.mcols = >>>>>>> use.mcols) : >>>>>>> findMateAlignment() returned an invalid 'mate' vector >>>>>>> In addition: Warning message: >>>>>>> In findMateAlignment(x) : >>>>>>> 4 alignments with ambiguous pairing were dumped. >>>>>>> Use 'getDumpedAlignments()' to retrieve them from the dump >>>>>>> >>>>>>> environment. >>>>>> >>>>>> >>>>>>> sessionInfo() >>>>>>>> >>>>>>>> R version 3.0.0 (2013-04-03) >>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>>> >>>>>>> locale: >>>>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>>>>> >>>>>>> attached base packages: >>>>>>> [1] parallel stats graphics grDevices utils datasets >>>>>>> methods >>>>>>> [8] base >>>>>>> >>>>>>> other attached packages: >>>>>>> [1] Rsamtools_1.13.39 Biostrings_2.29.16 GenomicRanges_1.13.41 >>>>>>> [4] XVector_0.1.1 IRanges_1.19.31 BiocGenerics_0.7.4 >>>>>>> >>>>>>> loaded via a namespace (and not attached): >>>>>>> [1] bitops_1.0-6 stats4_3.0.0 zlibbioc_1.7.0 >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> [[alternative HTML version deleted]] >>>>>> >>>>>> ______________________________****_________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor@r-project.org >>>>>> https://stat.ethz.ch/mailman/****listinfo/bioconductor<https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor=""> >>>>>> <https: **="" stat.ethz.ch="" mailman="" **listinfo="" bioconductor<https:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>>>>> > >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.****science.biology.informatics.**** >>>>>> conductor<http: news.gmane.org="" gmane.**science.biology.informa="" tics.**conductor=""> >>>>>> <http: news.gmane.**org="" gmane.science.biology.**="">>>>>> informatics.conductor<http: news.gmane.org="" gmane.science.biolo="" gy.informatics.conductor=""> >>>>>> > >>>>>> >>>>>> >>>>>> [[alternative HTML version deleted]] >>>>> >>>>> ______________________________****_________________ >>>>> Bioconductor mailing list >>>>> Bioconductor@r-project.org >>>>> https://stat.ethz.ch/mailman/****listinfo/bioconductor<https: s="" tat.ethz.ch="" mailman="" **listinfo="" bioconductor=""> >>>>> <https: **="" stat.ethz.ch="" mailman="" **listinfo="" bioconductor<https:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>>>> > >>>>> Search the archives: http://news.gmane.org/gmane.** >>>>> science.biology.informatics.****conductor<http: news.gmane.**="">>>>> org/gmane.science.biology.**informatics.conductor<http: news.gm="" ane.org="" gmane.science.biology.informatics.conductor=""> >>>>> > >>>>> >>>>> >>>>> -- >>>> Hervé Pagès >>>> >>>> Program in Computational Biology >>>> Division of Public Health Sciences >>>> Fred Hutchinson Cancer Research Center >>>> 1100 Fairview Ave. N, M1-B514 >>>> P.O. Box 19024 >>>> Seattle, WA 98109-1024 >>>> >>>> E-mail: hpages@fhcrc.org >>>> Phone: (206) 667-5791 >>>> Fax: (206) 667-1319 >>>> >>>> >>> [[alternative HTML version deleted]] >>> >>> >>> >>> ______________________________**_________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.="" ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> Search the archives: http://news.gmane.org/gmane.** >>> science.biology.informatics.**conductor<http: news.gmane.org="" gman="" e.science.biology.informatics.conductor=""> >>> >>> >> > [[alternative HTML version deleted]]
ADD REPLYlink written 5.2 years ago by Michael Lawrence10k
Yes, good idea. We were thinking along similar lines and will look into this over the next few weeks. Val On 09/17/2013 04:15 PM, Michael Lawrence wrote: > Leonard was thinking that the problem of importing the same alignment > multiple times when caught by multiple which ranges could be solved by > maintaining a hash of what reads have been seen before. This could be > done with the read ID + pos + strand + cigar, we think, but even easier > would be to get at the file offset. It looks like samtools keeps that > hidden, but we could just do some copy-and-pasting of the bam_iter_t > struct... > > > > > On Tue, Sep 17, 2013 at 3:29 PM, Michael Lawrence <michafla at="" gene.com=""> <mailto:michafla at="" gene.com="">> wrote: > > I'm a big fan of the new paired-end streaming. It's the way to go > over which-based partitioning. Nice job! > > > On Tue, Sep 17, 2013 at 2:46 PM, Valerie Obenchain > <vobencha at="" fhcrc.org="" <mailto:vobencha="" at="" fhcrc.org="">> wrote: > > The new pairing algorithm used by readGAlignmentsList() does > find mates outside the specified range. When only one of the > pairs overlaps the 'which', it looks for a potential mate in the > rest of the file. > > fl <- system.file("extdata", "ex1.bam", package="Rsamtools") > param <- ScanBamParam(what="flag", > which=GRanges("seq1", IRanges(1, width=40))) > > readGAlignmentPairs() finds no mates in this region. > > gapairs <- readGAlignmentPairs(BamFile(__fl), param=param) > > gapairs > > GAlignmentPairs with 0 alignment pairs and 0 metadata columns: > seqnames strand : ranges -- ranges > <rle> <rle> : <iranges> -- <iranges> > > readGAlignmentsList() finds two mate pairs and several non- mates > in this region. The 'mates' metadata column indicates which are > legitimate mates as per the algorithem described at > ?readGAlignmentsListFromBam. More control over what records are > returned can be done by specifying flags in ScanBamParam. > > galist <- readGAlignmentsList(BamFile(__fl, asMates=TRUE), > param=param) > > galist > GAlignmentsList of length 17: > [[1]] > GAlignments with 2 alignments and 2 metadata columns: > seqnames strand cigar qwidth start end width ngap | mates > flag > [1] seq1 - 35M 35 229 263 35 > <tel:35%20%20%20229%20263%20%20%20%2035> 0 | 1 83 > [2] seq1 + 35M 35 37 71 35 0 | 1 > 163 > > [[2]] > GAlignments with 2 alignments and 2 metadata columns: > seqnames strand cigar qwidth start end width ngap | mates > flag > [1] seq1 - 35M 35 185 219 35 > <tel:35%20%20%20185%20219%20%20%20%2035> 0 | 1 147 > [2] seq1 + 35M 35 36 70 35 0 | 1 > 99 > > [[3]] > GAlignments with 1 alignment and 2 metadata columns: > seqnames strand cigar qwidth start end width ngap | mates > flag > [1] seq1 + 36M 36 6 41 36 0 | 0 > 137 > > > > elementLengths(galist) > [1] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 > > The first two list elements contain mates. > > sapply(galist, function(x) sum((mcols(x)$mates)) > 0) > > [1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > FALSE FALSE FALSE > [13] FALSE FALSE FALSE FALSE FALSE > > > Here are descriptions of the first few flags as per this site: > http://picard.sourceforge.net/__explain-flags.html > <http: picard.sourceforge.net="" explain-flags.html=""> > Flag 137 indicates an unmapped mate which is why this record was > not mated. > > > 83: > Summary: > read paired > read mapped in proper pair > read reverse strand > first in pair > > 163: > Summary: > read paired > read mapped in proper pair > mate reverse strand > second in pair > > 147: > Summary: > read paired > read mapped in proper pair > read reverse strand > second in pair > > 99: > Summary: > read paired > read mapped in proper pair > mate reverse strand > first in pair > > 137: > Summary: > read paired > mate unmapped > second in pair > > > Valerie > > > > On 09/17/2013 01:55 PM, Michael Lawrence wrote: > > Yes, I knew there weren't any easy solutions. I just think > that somehow we > need to point these out to users, because it's not exactly > obvious. > > > On Tue, Sep 17, 2013 at 11:52 AM, Hervé Pagès > <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> wrote: > > Hi guys, > > Sorry for the late answer. > > > On 09/11/2013 11:51 AM, Michael Lawrence wrote: > > There seem to be at least two "gotchas" with the > 'which' argument and > readGAlignmentPairs: (1) as Leonard said, there's no > easy way to handle > cases where one record overlaps multiple which arguments > > > This is not just with readGAlignmentPairs, it's also with > readGAlignments and, at a lower level, with scanBam(), which > the 2 formers are based on. If a record overlaps 2 > ranges in the > 'which' arg, scanBam() loads it twice. > > I can't think of an easy way to address this unless we > change the > semantic of the 'which' argument. For example instead of > loading > records that overlap we could make the choice to only > load records > that have their POS (i.e. start) within the ranges > specified thru > 'which'. It introduces a dissymmetry (because it looks > at the start > of the record and not at its end) but it has the > advantage of making > sure records are only loaded once (unless the user > passes a 'which' > argument that contains overlapping ranges but then > s/he's really > looking for it ;-) ). > > > and (2) any read > > pairs with one end inside which and the other out > will be discarded. > > > No easy work around for that one with the current > implementation of > readGAlignmentPairs which just passes its 'which' > argument down to > scanBam(). > > > Excluding per-chromosome iteration, 'which' is a bit > dangerous when > > dealing > with read pairs. > > > It's also dangerous when dealing with single end reads, > and it has > been for a while. > > Cheers, > H. > > > Somehow we should make this obvious to the user. > > > > On Wed, Sep 11, 2013 at 11:02 AM, Leonard Goldstein < > goldstein.leonard at gene.com > <mailto:goldstein.leonard at="" gene.com="">> wrote: > > Dear Herve and list, > > > > I realized the error might have been caused due > to the same bam record > being read in multiple times when passing > multiple ranges in the which > argument. > > > I don't think it is necessarily obvious to users > that > readGAlignments/Pairs > returns a concatenation of records that were > read in for each range in > the > which argument. Instead one might expect to > obtain the union of records > that overlap any of the ranges included in > which. Do you think it would > be > helpful to issue a warning in case records are > read in multiple times? > > > Best wishes, > > > Leonard > > > > On Tue, Sep 10, 2013 at 5:58 PM, Leonard > Goldstein <goldstel at="" gene.com=""> <mailto:goldstel at="" gene.com=""> > > wrote: > > > Hi Herve, > > > > I ran into some issues when using > readGAlignmentPairs with the > ScanBamParam 'which' argument. I included an > example below. I can see > how > changing 'which' can result in different > pairings but was wondering > > whether > > the error can be avoided and problematic > pairings dropped instead. > > > It looks like the issue is introduced in > revision 77808 (no error in > 77791). I was hoping you have an idea what > causes the problem, otherwise > > I > > can try to forward a test case. > > > Thanks for your help. > > > Leonard > > > gr > > > GRanges with 2 ranges and 0 metadata columns: > seqnames ranges strand > <rle> <iranges> <rle> > [1] 10 [75758072, 75758133] * > [2] 10 [75802841, 75802911] * > --- > seqlengths: > 10 > NA > > > readGAlignmentPairs(file, param = > ScanBamParam(which = gr[1])) > > GAlignmentPairs with 0 alignment pairs and 0 > metadata columns: > seqnames strand : ranges -- ranges > <rle> <rle> : <iranges> -- <iranges> > --- > seqlengths: > 1 2 3 ... > GL000247.1 GL000248.1 GL000249.1 > 249250621 243199373 198022430 ... > 36422 39786 38502 > > > readGAlignmentPairs(file, param = > ScanBamParam(which = gr[2])) > > GAlignmentPairs with 3 alignment pairs and 0 > metadata columns: > seqnames strand : > ranges -- ranges > <rle> <rle> : > <iranges> -- <iranges> > [1] 10 + : [75758115, 75802896] > -- [75802892, 75830482] > [2] 10 - : [75802908, 75830498] > -- [75758111, 75802892] > [3] 10 - : [75802908, 75830498] > -- [75802878, 75830468] > --- > seqlengths: > 1 2 3 ... > GL000247.1 GL000248.1 GL000249.1 > 249250621 243199373 198022430 ... > 36422 39786 38502 > > > readGAlignmentPairs(file, param = > ScanBamParam(which = gr)) > > Error in makeGAlignmentPairs(galn, use.names > = use.names, use.mcols = > use.mcols) : > findMateAlignment() returned an invalid > 'mate' vector > In addition: Warning message: > In findMateAlignment(x) : > 4 alignments with ambiguous pairing > were dumped. > Use 'getDumpedAlignments()' to > retrieve them from the dump > > environment. > > > sessionInfo() > > R version 3.0.0 (2013-04-03) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 > LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 > LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C > LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 > LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices > utils datasets methods > [8] base > > other attached packages: > [1] Rsamtools_1.13.39 Biostrings_2.29.16 > GenomicRanges_1.13.41 > [4] XVector_0.1.1 IRanges_1.19.31 > BiocGenerics_0.7.4 > > loaded via a namespace (and not attached): > [1] bitops_1.0-6 stats4_3.0.0 zlibbioc_1.7.0 > > > > > > [[alternative HTML version deleted]] > > ________________________________**_________________ > Bioconductor mailing list > Bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/*__*listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" **listinfo="" bio="" conductor=""><https: __="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">> > Search the archives: > http://news.gmane.org/gmane.**__science.biol ogy.informatics.**__conductor > <http: news.gmane.org="" gmane.**science.biolo="" gy.informatics.**conductor=""><http: news.gmane.__org="" gmane.science.biol="" ogy.__informatics.conductor=""> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">> > > > [[alternative HTML version deleted]] > > ________________________________**_________________ > Bioconductor mailing list > Bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/*__*listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" **listinfo="" biocond="" uctor=""><https: __="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**__conductor<http: news.gmane.__org="" gmane.science.biology.__informatics.conductor=""> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > > > [[alternative HTML version deleted]] > > > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > >
ADD REPLYlink written 5.2 years ago by Valerie Obenchain ♦♦ 6.6k
On 09/17/2013 04:15 PM, Michael Lawrence wrote: > Leonard was thinking that the problem of importing the same alignment > multiple times when caught by multiple which ranges could be solved by > maintaining a hash of what reads have been seen before. This could be > done with the read ID + pos + strand + cigar, we think, FWIW, I've seen BAM files with more than 1 record sharing the same QNAME + POS + strand + CIGAR. The 'Ambiguous pairing' section in the man page for findMateAlignment() shows such an example. It was taken from a real BAM file. IIRC the 2 records were actually indistinguible. > but even easier > would be to get at the file offset. It looks like samtools keeps that > hidden, but we could just do some copy-and-pasting of the bam_iter_t > struct... IIUC the common use case is processing a BAM file by chromosome chunks. Are you saying the record ID would be returned by scanBam or readGAlignemnts() so the end/user has a way to discard records that have already been processed. Note that with a small change of semantic to the 'which' argument, you don't need to return a record ID and the end-user needs to do nothing: the same record cannot be loaded twice. H. > > > > > On Tue, Sep 17, 2013 at 3:29 PM, Michael Lawrence <michafla at="" gene.com=""> <mailto:michafla at="" gene.com="">> wrote: > > I'm a big fan of the new paired-end streaming. It's the way to go > over which-based partitioning. Nice job! > > > On Tue, Sep 17, 2013 at 2:46 PM, Valerie Obenchain > <vobencha at="" fhcrc.org="" <mailto:vobencha="" at="" fhcrc.org="">> wrote: > > The new pairing algorithm used by readGAlignmentsList() does > find mates outside the specified range. When only one of the > pairs overlaps the 'which', it looks for a potential mate in the > rest of the file. > > fl <- system.file("extdata", "ex1.bam", package="Rsamtools") > param <- ScanBamParam(what="flag", > which=GRanges("seq1", IRanges(1, width=40))) > > readGAlignmentPairs() finds no mates in this region. > > gapairs <- readGAlignmentPairs(BamFile(__fl), param=param) > > gapairs > > GAlignmentPairs with 0 alignment pairs and 0 metadata columns: > seqnames strand : ranges -- ranges > <rle> <rle> : <iranges> -- <iranges> > > readGAlignmentsList() finds two mate pairs and several non- mates > in this region. The 'mates' metadata column indicates which are > legitimate mates as per the algorithem described at > ?readGAlignmentsListFromBam. More control over what records are > returned can be done by specifying flags in ScanBamParam. > > galist <- readGAlignmentsList(BamFile(__fl, asMates=TRUE), > param=param) > > galist > GAlignmentsList of length 17: > [[1]] > GAlignments with 2 alignments and 2 metadata columns: > seqnames strand cigar qwidth start end width ngap | mates > flag > [1] seq1 - 35M 35 229 263 35 > <tel:35%20%20%20229%20263%20%20%20%2035> 0 | 1 83 > [2] seq1 + 35M 35 37 71 35 0 | 1 > 163 > > [[2]] > GAlignments with 2 alignments and 2 metadata columns: > seqnames strand cigar qwidth start end width ngap | mates > flag > [1] seq1 - 35M 35 185 219 35 > <tel:35%20%20%20185%20219%20%20%20%2035> 0 | 1 147 > [2] seq1 + 35M 35 36 70 35 0 | 1 > 99 > > [[3]] > GAlignments with 1 alignment and 2 metadata columns: > seqnames strand cigar qwidth start end width ngap | mates > flag > [1] seq1 + 36M 36 6 41 36 0 | 0 > 137 > > > > elementLengths(galist) > [1] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 > > The first two list elements contain mates. > > sapply(galist, function(x) sum((mcols(x)$mates)) > 0) > > [1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > FALSE FALSE FALSE > [13] FALSE FALSE FALSE FALSE FALSE > > > Here are descriptions of the first few flags as per this site: > http://picard.sourceforge.net/__explain-flags.html > <http: picard.sourceforge.net="" explain-flags.html=""> > Flag 137 indicates an unmapped mate which is why this record was > not mated. > > > 83: > Summary: > read paired > read mapped in proper pair > read reverse strand > first in pair > > 163: > Summary: > read paired > read mapped in proper pair > mate reverse strand > second in pair > > 147: > Summary: > read paired > read mapped in proper pair > read reverse strand > second in pair > > 99: > Summary: > read paired > read mapped in proper pair > mate reverse strand > first in pair > > 137: > Summary: > read paired > mate unmapped > second in pair > > > Valerie > > > > On 09/17/2013 01:55 PM, Michael Lawrence wrote: > > Yes, I knew there weren't any easy solutions. I just think > that somehow we > need to point these out to users, because it's not exactly > obvious. > > > On Tue, Sep 17, 2013 at 11:52 AM, Hervé Pagès > <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> wrote: > > Hi guys, > > Sorry for the late answer. > > > On 09/11/2013 11:51 AM, Michael Lawrence wrote: > > There seem to be at least two "gotchas" with the > 'which' argument and > readGAlignmentPairs: (1) as Leonard said, there's no > easy way to handle > cases where one record overlaps multiple which arguments > > > This is not just with readGAlignmentPairs, it's also with > readGAlignments and, at a lower level, with scanBam(), which > the 2 formers are based on. If a record overlaps 2 > ranges in the > 'which' arg, scanBam() loads it twice. > > I can't think of an easy way to address this unless we > change the > semantic of the 'which' argument. For example instead of > loading > records that overlap we could make the choice to only > load records > that have their POS (i.e. start) within the ranges > specified thru > 'which'. It introduces a dissymmetry (because it looks > at the start > of the record and not at its end) but it has the > advantage of making > sure records are only loaded once (unless the user > passes a 'which' > argument that contains overlapping ranges but then > s/he's really > looking for it ;-) ). > > > and (2) any read > > pairs with one end inside which and the other out > will be discarded. > > > No easy work around for that one with the current > implementation of > readGAlignmentPairs which just passes its 'which' > argument down to > scanBam(). > > > Excluding per-chromosome iteration, 'which' is a bit > dangerous when > > dealing > with read pairs. > > > It's also dangerous when dealing with single end reads, > and it has > been for a while. > > Cheers, > H. > > > Somehow we should make this obvious to the user. > > > > On Wed, Sep 11, 2013 at 11:02 AM, Leonard Goldstein < > goldstein.leonard at gene.com > <mailto:goldstein.leonard at="" gene.com="">> wrote: > > Dear Herve and list, > > > > I realized the error might have been caused due > to the same bam record > being read in multiple times when passing > multiple ranges in the which > argument. > > > I don't think it is necessarily obvious to users > that > readGAlignments/Pairs > returns a concatenation of records that were > read in for each range in > the > which argument. Instead one might expect to > obtain the union of records > that overlap any of the ranges included in > which. Do you think it would > be > helpful to issue a warning in case records are > read in multiple times? > > > Best wishes, > > > Leonard > > > > On Tue, Sep 10, 2013 at 5:58 PM, Leonard > Goldstein <goldstel at="" gene.com=""> <mailto:goldstel at="" gene.com=""> > > wrote: > > > Hi Herve, > > > > I ran into some issues when using > readGAlignmentPairs with the > ScanBamParam 'which' argument. I included an > example below. I can see > how > changing 'which' can result in different > pairings but was wondering > > whether > > the error can be avoided and problematic > pairings dropped instead. > > > It looks like the issue is introduced in > revision 77808 (no error in > 77791). I was hoping you have an idea what > causes the problem, otherwise > > I > > can try to forward a test case. > > > Thanks for your help. > > > Leonard > > > gr > > > GRanges with 2 ranges and 0 metadata columns: > seqnames ranges strand > <rle> <iranges> <rle> > [1] 10 [75758072, 75758133] * > [2] 10 [75802841, 75802911] * > --- > seqlengths: > 10 > NA > > > readGAlignmentPairs(file, param = > ScanBamParam(which = gr[1])) > > GAlignmentPairs with 0 alignment pairs and 0 > metadata columns: > seqnames strand : ranges -- ranges > <rle> <rle> : <iranges> -- <iranges> > --- > seqlengths: > 1 2 3 ... > GL000247.1 GL000248.1 GL000249.1 > 249250621 243199373 198022430 ... > 36422 39786 38502 > > > readGAlignmentPairs(file, param = > ScanBamParam(which = gr[2])) > > GAlignmentPairs with 3 alignment pairs and 0 > metadata columns: > seqnames strand : > ranges -- ranges > <rle> <rle> : > <iranges> -- <iranges> > [1] 10 + : [75758115, 75802896] > -- [75802892, 75830482] > [2] 10 - : [75802908, 75830498] > -- [75758111, 75802892] > [3] 10 - : [75802908, 75830498] > -- [75802878, 75830468] > --- > seqlengths: > 1 2 3 ... > GL000247.1 GL000248.1 GL000249.1 > 249250621 243199373 198022430 ... > 36422 39786 38502 > > > readGAlignmentPairs(file, param = > ScanBamParam(which = gr)) > > Error in makeGAlignmentPairs(galn, use.names > = use.names, use.mcols = > use.mcols) : > findMateAlignment() returned an invalid > 'mate' vector > In addition: Warning message: > In findMateAlignment(x) : > 4 alignments with ambiguous pairing > were dumped. > Use 'getDumpedAlignments()' to > retrieve them from the dump > > environment. > > > sessionInfo() > > R version 3.0.0 (2013-04-03) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 > LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 > LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C > LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 > LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices > utils datasets methods > [8] base > > other attached packages: > [1] Rsamtools_1.13.39 Biostrings_2.29.16 > GenomicRanges_1.13.41 > [4] XVector_0.1.1 IRanges_1.19.31 > BiocGenerics_0.7.4 > > loaded via a namespace (and not attached): > [1] bitops_1.0-6 stats4_3.0.0 zlibbioc_1.7.0 > > > > > > [[alternative HTML version deleted]] > > ________________________________**_________________ > Bioconductor mailing list > Bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/*__*listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" **listinfo="" bio="" conductor=""><https: __="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">> > Search the archives: > http://news.gmane.org/gmane.**__science.biol ogy.informatics.**__conductor > <http: news.gmane.org="" gmane.**science.biolo="" gy.informatics.**conductor=""><http: news.gmane.__org="" gmane.science.biol="" ogy.__informatics.conductor=""> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">> > > > [[alternative HTML version deleted]] > > ________________________________**_________________ > Bioconductor mailing list > Bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/*__*listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" **listinfo="" biocond="" uctor=""><https: __="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**__conductor<http: news.gmane.__org="" gmane.science.biology.__informatics.conductor=""> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > > > [[alternative HTML version deleted]] > > > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLYlink written 5.2 years ago by Hervé Pagès ♦♦ 13k
On 09/17/2013 04:39 PM, Hervé Pagès wrote: > On 09/17/2013 04:15 PM, Michael Lawrence wrote: >> Leonard was thinking that the problem of importing the same alignment >> multiple times when caught by multiple which ranges could be solved by >> maintaining a hash of what reads have been seen before. This could be >> done with the read ID + pos + strand + cigar, we think, > > FWIW, I've seen BAM files with more than 1 record sharing the same > QNAME + POS + strand + CIGAR. > > The 'Ambiguous pairing' section in the man page for findMateAlignment() > shows such an example. It was taken from a real BAM file. IIRC the 2 > records were actually indistinguible. > >> but even easier >> would be to get at the file offset. It looks like samtools keeps that >> hidden, but we could just do some copy-and-pasting of the bam_iter_t >> struct... > > IIUC the common use case is processing a BAM file by chromosome chunks. > Are you saying the record ID would be returned by scanBam or > readGAlignemnts() so the end/user has a way to discard records that > have already been processed. > > Note that with a small change of semantic to the 'which' argument, you > don't need to return a record ID and the end-user needs to do nothing: > the same record cannot be loaded twice. To be even clearer about this: with its current semantic, 'which' doesn't feel like the right thing to use to process a chromosome "by tiles". Instead of trying to fix it (by changing its semantic or by maintaining a hash of what reads have been seen before), maybe it would make more sense to add a 'tiles' (or 'tiling') argument to ScanBamParam() that would also take a GRanges object. Then scanBam() would load the records that have their POS in the tiles and scanBam(asMates=TRUE) would load the records that have their POS in the tiles and are *first* mate and it would try to pair them (beyond 'tiles' if needed). It would be enough to enforce 'tiles' to have non-overlapping ranges but it would not be necessary to require it to be a "real" tiling (i.e. a set of adjacent ranges that cover the entire chromosome or genome). In addition, deciding whether a record belongs to 'tiles' would be cheaper (and faster) than deciding whether it overlaps with 'which' because there is no more need to compute the end of the record based on its CIGAR. H. > > H. > >> >> >> >> >> On Tue, Sep 17, 2013 at 3:29 PM, Michael Lawrence <michafla at="" gene.com="">> <mailto:michafla at="" gene.com="">> wrote: >> >> I'm a big fan of the new paired-end streaming. It's the way to go >> over which-based partitioning. Nice job! >> >> >> On Tue, Sep 17, 2013 at 2:46 PM, Valerie Obenchain >> <vobencha at="" fhcrc.org="" <mailto:vobencha="" at="" fhcrc.org="">> wrote: >> >> The new pairing algorithm used by readGAlignmentsList() does >> find mates outside the specified range. When only one of the >> pairs overlaps the 'which', it looks for a potential mate in the >> rest of the file. >> >> fl <- system.file("extdata", "ex1.bam", package="Rsamtools") >> param <- ScanBamParam(what="flag", >> which=GRanges("seq1", IRanges(1, >> width=40))) >> >> readGAlignmentPairs() finds no mates in this region. >> >> gapairs <- readGAlignmentPairs(BamFile(__fl), param=param) >> > gapairs >> >> GAlignmentPairs with 0 alignment pairs and 0 metadata columns: >> seqnames strand : ranges -- ranges >> <rle> <rle> : <iranges> -- <iranges> >> >> readGAlignmentsList() finds two mate pairs and several non- mates >> in this region. The 'mates' metadata column indicates which are >> legitimate mates as per the algorithem described at >> ?readGAlignmentsListFromBam. More control over what records are >> returned can be done by specifying flags in ScanBamParam. >> >> galist <- readGAlignmentsList(BamFile(__fl, asMates=TRUE), >> param=param) >> > galist >> GAlignmentsList of length 17: >> [[1]] >> GAlignments with 2 alignments and 2 metadata columns: >> seqnames strand cigar qwidth start end width ngap | mates >> flag >> [1] seq1 - 35M 35 229 263 35 >> <tel:35%20%20%20229%20263%20%20%20%2035> 0 | 1 83 >> [2] seq1 + 35M 35 37 71 35 0 | 1 >> 163 >> >> [[2]] >> GAlignments with 2 alignments and 2 metadata columns: >> seqnames strand cigar qwidth start end width ngap | mates >> flag >> [1] seq1 - 35M 35 185 219 35 >> <tel:35%20%20%20185%20219%20%20%20%2035> 0 | 1 147 >> [2] seq1 + 35M 35 36 70 35 0 | 1 >> 99 >> >> [[3]] >> GAlignments with 1 alignment and 2 metadata columns: >> seqnames strand cigar qwidth start end width ngap | mates >> flag >> [1] seq1 + 36M 36 6 41 36 0 | 0 >> 137 >> >> >> > elementLengths(galist) >> [1] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 >> >> The first two list elements contain mates. >> >> sapply(galist, function(x) sum((mcols(x)$mates)) > 0) >> >> [1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE >> FALSE FALSE FALSE >> [13] FALSE FALSE FALSE FALSE FALSE >> >> >> Here are descriptions of the first few flags as per this site: >> http://picard.sourceforge.net/__explain-flags.html >> <http: picard.sourceforge.net="" explain-flags.html=""> >> Flag 137 indicates an unmapped mate which is why this record was >> not mated. >> >> >> 83: >> Summary: >> read paired >> read mapped in proper pair >> read reverse strand >> first in pair >> >> 163: >> Summary: >> read paired >> read mapped in proper pair >> mate reverse strand >> second in pair >> >> 147: >> Summary: >> read paired >> read mapped in proper pair >> read reverse strand >> second in pair >> >> 99: >> Summary: >> read paired >> read mapped in proper pair >> mate reverse strand >> first in pair >> >> 137: >> Summary: >> read paired >> mate unmapped >> second in pair >> >> >> Valerie >> >> >> >> On 09/17/2013 01:55 PM, Michael Lawrence wrote: >> >> Yes, I knew there weren't any easy solutions. I just think >> that somehow we >> need to point these out to users, because it's not exactly >> obvious. >> >> >> On Tue, Sep 17, 2013 at 11:52 AM, Hervé Pagès >> <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> wrote: >> >> Hi guys, >> >> Sorry for the late answer. >> >> >> On 09/11/2013 11:51 AM, Michael Lawrence wrote: >> >> There seem to be at least two "gotchas" with the >> 'which' argument and >> readGAlignmentPairs: (1) as Leonard said, there's no >> easy way to handle >> cases where one record overlaps multiple which >> arguments >> >> >> This is not just with readGAlignmentPairs, it's also with >> readGAlignments and, at a lower level, with scanBam(), >> which >> the 2 formers are based on. If a record overlaps 2 >> ranges in the >> 'which' arg, scanBam() loads it twice. >> >> I can't think of an easy way to address this unless we >> change the >> semantic of the 'which' argument. For example instead of >> loading >> records that overlap we could make the choice to only >> load records >> that have their POS (i.e. start) within the ranges >> specified thru >> 'which'. It introduces a dissymmetry (because it looks >> at the start >> of the record and not at its end) but it has the >> advantage of making >> sure records are only loaded once (unless the user >> passes a 'which' >> argument that contains overlapping ranges but then >> s/he's really >> looking for it ;-) ). >> >> >> and (2) any read >> >> pairs with one end inside which and the other out >> will be discarded. >> >> >> No easy work around for that one with the current >> implementation of >> readGAlignmentPairs which just passes its 'which' >> argument down to >> scanBam(). >> >> >> Excluding per-chromosome iteration, 'which' is a bit >> dangerous when >> >> dealing >> with read pairs. >> >> >> It's also dangerous when dealing with single end reads, >> and it has >> been for a while. >> >> Cheers, >> H. >> >> >> Somehow we should make this obvious to the user. >> >> >> >> On Wed, Sep 11, 2013 at 11:02 AM, Leonard Goldstein < >> goldstein.leonard at gene.com >> <mailto:goldstein.leonard at="" gene.com="">> wrote: >> >> Dear Herve and list, >> >> >> >> I realized the error might have been caused due >> to the same bam record >> being read in multiple times when passing >> multiple ranges in the which >> argument. >> >> >> I don't think it is necessarily obvious to users >> that >> readGAlignments/Pairs >> returns a concatenation of records that were >> read in for each range in >> the >> which argument. Instead one might expect to >> obtain the union of records >> that overlap any of the ranges included in >> which. Do you think it would >> be >> helpful to issue a warning in case records are >> read in multiple times? >> >> >> Best wishes, >> >> >> Leonard >> >> >> >> On Tue, Sep 10, 2013 at 5:58 PM, Leonard >> Goldstein <goldstel at="" gene.com="">> <mailto:goldstel at="" gene.com=""> >> >> wrote: >> >> >> Hi Herve, >> >> >> >> I ran into some issues when using >> readGAlignmentPairs with the >> ScanBamParam 'which' argument. I included an >> example below. I can see >> how >> changing 'which' can result in different >> pairings but was wondering >> >> whether >> >> the error can be avoided and problematic >> pairings dropped instead. >> >> >> It looks like the issue is introduced in >> revision 77808 (no error in >> 77791). I was hoping you have an idea what >> causes the problem, otherwise >> >> I >> >> can try to forward a test case. >> >> >> Thanks for your help. >> >> >> Leonard >> >> >> gr >> >> >> GRanges with 2 ranges and 0 metadata columns: >> seqnames ranges strand >> <rle> <iranges> <rle> >> [1] 10 [75758072, 75758133] * >> [2] 10 [75802841, 75802911] * >> --- >> seqlengths: >> 10 >> NA >> >> >> readGAlignmentPairs(file, param = >> ScanBamParam(which = gr[1])) >> >> GAlignmentPairs with 0 alignment pairs and 0 >> metadata columns: >> seqnames strand : ranges -- ranges >> <rle> <rle> : <iranges> -- <iranges> >> --- >> seqlengths: >> 1 2 3 ... >> GL000247.1 GL000248.1 GL000249.1 >> 249250621 243199373 198022430 ... >> 36422 39786 38502 >> >> >> readGAlignmentPairs(file, param = >> ScanBamParam(which = gr[2])) >> >> GAlignmentPairs with 3 alignment pairs and 0 >> metadata columns: >> seqnames strand : >> ranges -- ranges >> <rle> <rle> : >> <iranges> -- <iranges> >> [1] 10 + : [75758115, 75802896] >> -- [75802892, 75830482] >> [2] 10 - : [75802908, 75830498] >> -- [75758111, 75802892] >> [3] 10 - : [75802908, 75830498] >> -- [75802878, 75830468] >> --- >> seqlengths: >> 1 2 3 ... >> GL000247.1 GL000248.1 GL000249.1 >> 249250621 243199373 198022430 ... >> 36422 39786 38502 >> >> >> readGAlignmentPairs(file, param = >> ScanBamParam(which = gr)) >> >> Error in makeGAlignmentPairs(galn, use.names >> = use.names, use.mcols = >> use.mcols) : >> findMateAlignment() returned an invalid >> 'mate' vector >> In addition: Warning message: >> In findMateAlignment(x) : >> 4 alignments with ambiguous pairing >> were dumped. >> Use 'getDumpedAlignments()' to >> retrieve them from the dump >> >> environment. >> >> >> sessionInfo() >> >> R version 3.0.0 (2013-04-03) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 >> LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 >> LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 >> LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C >> LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 >> LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats graphics grDevices >> utils datasets methods >> [8] base >> >> other attached packages: >> [1] Rsamtools_1.13.39 Biostrings_2.29.16 >> GenomicRanges_1.13.41 >> [4] XVector_0.1.1 IRanges_1.19.31 >> BiocGenerics_0.7.4 >> >> loaded via a namespace (and not attached): >> [1] bitops_1.0-6 stats4_3.0.0 >> zlibbioc_1.7.0 >> >> >> >> >> >> [[alternative HTML version deleted]] >> >> >> ________________________________**_________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> <mailto:bioconductor at="" r-project.org=""> >> >> https://stat.ethz.ch/mailman/*__*listinfo/bioconductor >> >> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor=""><https: __="" st="" at.ethz.ch="" mailman="" __listinfo="" bioconductor="">> >> >> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">> >> Search the archives: >> >> http://news.gmane.org/gmane.**__science.biology.informatics.**__con ductor >> >> <http: news.gmane.org="" gmane.**science.biology.informatics.**conduc="" tor=""><http: news.gmane.__org="" gmane.science.biology.__informatics.condu="" ctor="">> >> >> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">> >> >> >> [[alternative HTML version deleted]] >> >> ________________________________**_________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> <mailto:bioconductor at="" r-project.org=""> >> >> https://stat.ethz.ch/mailman/*__*listinfo/bioconductor >> >> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor=""><https: __="" st="" at.ethz.ch="" mailman="" __listinfo="" bioconductor="">> >> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">> >> Search the archives: http://news.gmane.org/gmane.** >> >> science.biology.informatics.**__conductor<http: news.gmane.__org="" g="" mane.science.biology.__informatics.conductor="">> >> >> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> >> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >> >> >> [[alternative HTML version deleted]] >> >> >> >> _________________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> <mailto:bioconductor at="" r-project.org=""> >> https://stat.ethz.ch/mailman/__listinfo/bioconductor >> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> >> http://news.gmane.org/gmane.__science.biology.informatics.__conductor >> >> <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >> >> >> >> > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLYlink written 5.2 years ago by Hervé Pagès ♦♦ 13k
So here's a use case: get all the reads for a set of putative transcribed regions, numbering in the thousands. You don't want just the reads that start in the regions, and you want to process the reads in vectorized fashion, ideally with a single query to the BAM, returning a GAlignmentPairs. There are at least two ways to do this: a) Have readGAlignmentPairs and friends return non-redundant alignments, even with multiple which, b) Query the BAM separately (like in an lapply) for each which, then combine the results, and keep around a partitioning that records the original 'which' for each alignment. This would be slower and require more work of the user, but it would work right now. Hopefully others have additional ideas. Michael On Tue, Sep 17, 2013 at 5:46 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > On 09/17/2013 04:39 PM, Hervé Pagès wrote: > >> On 09/17/2013 04:15 PM, Michael Lawrence wrote: >> >>> Leonard was thinking that the problem of importing the same alignment >>> multiple times when caught by multiple which ranges could be solved by >>> maintaining a hash of what reads have been seen before. This could be >>> done with the read ID + pos + strand + cigar, we think, >>> >> >> FWIW, I've seen BAM files with more than 1 record sharing the same >> QNAME + POS + strand + CIGAR. >> >> The 'Ambiguous pairing' section in the man page for findMateAlignment() >> shows such an example. It was taken from a real BAM file. IIRC the 2 >> records were actually indistinguible. >> >> but even easier >>> would be to get at the file offset. It looks like samtools keeps that >>> hidden, but we could just do some copy-and-pasting of the bam_iter_t >>> struct... >>> >> >> IIUC the common use case is processing a BAM file by chromosome chunks. >> Are you saying the record ID would be returned by scanBam or >> readGAlignemnts() so the end/user has a way to discard records that >> have already been processed. >> >> Note that with a small change of semantic to the 'which' argument, you >> don't need to return a record ID and the end-user needs to do nothing: >> the same record cannot be loaded twice. >> > > To be even clearer about this: with its current semantic, 'which' doesn't > feel like the right thing to use to process a chromosome "by > tiles". Instead of trying to fix it (by changing its semantic or by > maintaining a hash of what reads have been seen before), maybe it > would make more sense to add a 'tiles' (or 'tiling') argument to > ScanBamParam() that would also take a GRanges object. Then scanBam() > would load the records that have their POS in the tiles and > scanBam(asMates=TRUE) would load the records that have their POS > in the tiles and are *first* mate and it would try to pair them > (beyond 'tiles' if needed). > > It would be enough to enforce 'tiles' to have non-overlapping ranges > but it would not be necessary to require it to be a "real" tiling (i.e. > a set of adjacent ranges that cover the entire chromosome or genome). > > In addition, deciding whether a record belongs to 'tiles' would be > cheaper (and faster) than deciding whether it overlaps with > 'which' because there is no more need to compute the end of the > record based on its CIGAR. > > > H. > > >> H. >> >> >>> >>> >>> >>> On Tue, Sep 17, 2013 at 3:29 PM, Michael Lawrence <michafla@gene.com>>> <mailto:michafla@gene.com>> wrote: >>> >>> I'm a big fan of the new paired-end streaming. It's the way to go >>> over which-based partitioning. Nice job! >>> >>> >>> On Tue, Sep 17, 2013 at 2:46 PM, Valerie Obenchain >>> <vobencha@fhcrc.org <mailto:vobencha@fhcrc.org="">> wrote: >>> >>> The new pairing algorithm used by readGAlignmentsList() does >>> find mates outside the specified range. When only one of the >>> pairs overlaps the 'which', it looks for a potential mate in the >>> rest of the file. >>> >>> fl <- system.file("extdata", "ex1.bam", package="Rsamtools") >>> param <- ScanBamParam(what="flag", >>> which=GRanges("seq1", IRanges(1, >>> width=40))) >>> >>> readGAlignmentPairs() finds no mates in this region. >>> >>> gapairs <- readGAlignmentPairs(BamFile(__**fl), param=param) >>> > gapairs >>> >>> GAlignmentPairs with 0 alignment pairs and 0 metadata columns: >>> seqnames strand : ranges -- ranges >>> <rle> <rle> : <iranges> -- <iranges> >>> >>> readGAlignmentsList() finds two mate pairs and several non-mates >>> in this region. The 'mates' metadata column indicates which are >>> legitimate mates as per the algorithem described at >>> ?readGAlignmentsListFromBam. More control over what records are >>> returned can be done by specifying flags in ScanBamParam. >>> >>> galist <- readGAlignmentsList(BamFile(__**fl, asMates=TRUE), >>> param=param) >>> > galist >>> GAlignmentsList of length 17: >>> [[1]] >>> GAlignments with 2 alignments and 2 metadata columns: >>> seqnames strand cigar qwidth start end width ngap | mates >>> flag >>> [1] seq1 - 35M 35 229 263 35 >>> <tel:35%20%20%20229%20263%20%**20%20%2035> 0 | 1 83 >>> [2] seq1 + 35M 35 37 71 35 0 | 1 >>> 163 >>> >>> [[2]] >>> GAlignments with 2 alignments and 2 metadata columns: >>> seqnames strand cigar qwidth start end width ngap | mates >>> flag >>> [1] seq1 - 35M 35 185 219 35 >>> <tel:35%20%20%20185%20219%20%**20%20%2035> 0 | 1 147 >>> [2] seq1 + 35M 35 36 70 35 0 | 1 >>> 99 >>> >>> [[3]] >>> GAlignments with 1 alignment and 2 metadata columns: >>> seqnames strand cigar qwidth start end width ngap | mates >>> flag >>> [1] seq1 + 36M 36 6 41 36 0 | 0 >>> 137 >>> >>> >>> > elementLengths(galist) >>> [1] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 >>> >>> The first two list elements contain mates. >>> >>> sapply(galist, function(x) sum((mcols(x)$mates)) > 0) >>> >>> [1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE >>> FALSE FALSE FALSE >>> [13] FALSE FALSE FALSE FALSE FALSE >>> >>> >>> Here are descriptions of the first few flags as per this site: >>> http://picard.sourceforge.net/**__explain- flags.html<http: picard.sourceforge.net="" __explain-flags.html=""> >>> <http: picard.sourceforge.**net="" explain-="" flags.html<http:="" picard.sourceforge.net="" explain-flags.html=""> >>> > >>> Flag 137 indicates an unmapped mate which is why this record was >>> not mated. >>> >>> >>> 83: >>> Summary: >>> read paired >>> read mapped in proper pair >>> read reverse strand >>> first in pair >>> >>> 163: >>> Summary: >>> read paired >>> read mapped in proper pair >>> mate reverse strand >>> second in pair >>> >>> 147: >>> Summary: >>> read paired >>> read mapped in proper pair >>> read reverse strand >>> second in pair >>> >>> 99: >>> Summary: >>> read paired >>> read mapped in proper pair >>> mate reverse strand >>> first in pair >>> >>> 137: >>> Summary: >>> read paired >>> mate unmapped >>> second in pair >>> >>> >>> Valerie >>> >>> >>> >>> On 09/17/2013 01:55 PM, Michael Lawrence wrote: >>> >>> Yes, I knew there weren't any easy solutions. I just think >>> that somehow we >>> need to point these out to users, because it's not exactly >>> obvious. >>> >>> >>> On Tue, Sep 17, 2013 at 11:52 AM, Hervé Pagès >>> <hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> wrote: >>> >>> Hi guys, >>> >>> Sorry for the late answer. >>> >>> >>> On 09/11/2013 11:51 AM, Michael Lawrence wrote: >>> >>> There seem to be at least two "gotchas" with the >>> 'which' argument and >>> readGAlignmentPairs: (1) as Leonard said, there's no >>> easy way to handle >>> cases where one record overlaps multiple which >>> arguments >>> >>> >>> This is not just with readGAlignmentPairs, it's also with >>> readGAlignments and, at a lower level, with scanBam(), >>> which >>> the 2 formers are based on. If a record overlaps 2 >>> ranges in the >>> 'which' arg, scanBam() loads it twice. >>> >>> I can't think of an easy way to address this unless we >>> change the >>> semantic of the 'which' argument. For example instead of >>> loading >>> records that overlap we could make the choice to only >>> load records >>> that have their POS (i.e. start) within the ranges >>> specified thru >>> 'which'. It introduces a dissymmetry (because it looks >>> at the start >>> of the record and not at its end) but it has the >>> advantage of making >>> sure records are only loaded once (unless the user >>> passes a 'which' >>> argument that contains overlapping ranges but then >>> s/he's really >>> looking for it ;-) ). >>> >>> >>> and (2) any read >>> >>> pairs with one end inside which and the other out >>> will be discarded. >>> >>> >>> No easy work around for that one with the current >>> implementation of >>> readGAlignmentPairs which just passes its 'which' >>> argument down to >>> scanBam(). >>> >>> >>> Excluding per-chromosome iteration, 'which' is a bit >>> dangerous when >>> >>> dealing >>> with read pairs. >>> >>> >>> It's also dangerous when dealing with single end reads, >>> and it has >>> been for a while. >>> >>> Cheers, >>> H. >>> >>> >>> Somehow we should make this obvious to the user. >>> >>> >>> >>> On Wed, Sep 11, 2013 at 11:02 AM, Leonard Goldstein < >>> goldstein.leonard@gene.com >>> <mailto:goldstein.leonard@**gene.com<goldstein.leonard@gene.com>>> >>> wrote: >>> >>> Dear Herve and list, >>> >>> >>> >>> I realized the error might have been caused due >>> to the same bam record >>> being read in multiple times when passing >>> multiple ranges in the which >>> argument. >>> >>> >>> I don't think it is necessarily obvious to users >>> that >>> readGAlignments/Pairs >>> returns a concatenation of records that were >>> read in for each range in >>> the >>> which argument. Instead one might expect to >>> obtain the union of records >>> that overlap any of the ranges included in >>> which. Do you think it would >>> be >>> helpful to issue a warning in case records are >>> read in multiple times? >>> >>> >>> Best wishes, >>> >>> >>> Leonard >>> >>> >>> >>> On Tue, Sep 10, 2013 at 5:58 PM, Leonard >>> Goldstein <goldstel@gene.com>>> <mailto:goldstel@gene.com> >>> >>> wrote: >>> >>> >>> Hi Herve, >>> >>> >>> >>> I ran into some issues when using >>> readGAlignmentPairs with the >>> ScanBamParam 'which' argument. I included an >>> example below. I can see >>> how >>> changing 'which' can result in different >>> pairings but was wondering >>> >>> whether >>> >>> the error can be avoided and problematic >>> pairings dropped instead. >>> >>> >>> It looks like the issue is introduced in >>> revision 77808 (no error in >>> 77791). I was hoping you have an idea what >>> causes the problem, otherwise >>> >>> I >>> >>> can try to forward a test case. >>> >>> >>> Thanks for your help. >>> >>> >>> Leonard >>> >>> >>> gr >>> >>> >>> GRanges with 2 ranges and 0 metadata columns: >>> seqnames ranges strand >>> <rle> <iranges> <rle> >>> [1] 10 [75758072, 75758133] * >>> [2] 10 [75802841, 75802911] * >>> --- >>> seqlengths: >>> 10 >>> NA >>> >>> >>> readGAlignmentPairs(file, param = >>> ScanBamParam(which = gr[1])) >>> >>> GAlignmentPairs with 0 alignment pairs and 0 >>> metadata columns: >>> seqnames strand : ranges -- ranges >>> <rle> <rle> : <iranges> -- <iranges> >>> --- >>> seqlengths: >>> 1 2 3 ... >>> GL000247.1 GL000248.1 GL000249.1 >>> 249250621 243199373 198022430 ... >>> 36422 39786 38502 >>> >>> >>> readGAlignmentPairs(file, param = >>> ScanBamParam(which = gr[2])) >>> >>> GAlignmentPairs with 3 alignment pairs and 0 >>> metadata columns: >>> seqnames strand : >>> ranges -- ranges >>> <rle> <rle> : >>> <iranges> -- <iranges> >>> [1] 10 + : [75758115, 75802896] >>> -- [75802892, 75830482] >>> [2] 10 - : [75802908, 75830498] >>> -- [75758111, 75802892] >>> [3] 10 - : [75802908, 75830498] >>> -- [75802878, 75830468] >>> --- >>> seqlengths: >>> 1 2 3 ... >>> GL000247.1 GL000248.1 GL000249.1 >>> 249250621 243199373 198022430 ... >>> 36422 39786 38502 >>> >>> >>> readGAlignmentPairs(file, param = >>> ScanBamParam(which = gr)) >>> >>> Error in makeGAlignmentPairs(galn, use.names >>> = use.names, use.mcols = >>> use.mcols) : >>> findMateAlignment() returned an invalid >>> 'mate' vector >>> In addition: Warning message: >>> In findMateAlignment(x) : >>> 4 alignments with ambiguous pairing >>> were dumped. >>> Use 'getDumpedAlignments()' to >>> retrieve them from the dump >>> >>> environment. >>> >>> >>> sessionInfo() >>> >>> R version 3.0.0 (2013-04-03) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 >>> LC_NUMERIC=C >>> [3] LC_TIME=en_US.UTF-8 >>> LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=en_US.UTF-8 >>> LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=C LC_NAME=C >>> [9] LC_ADDRESS=C >>> LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 >>> LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices >>> utils datasets methods >>> [8] base >>> >>> other attached packages: >>> [1] Rsamtools_1.13.39 Biostrings_2.29.16 >>> GenomicRanges_1.13.41 >>> [4] XVector_0.1.1 IRanges_1.19.31 >>> BiocGenerics_0.7.4 >>> >>> loaded via a namespace (and not attached): >>> [1] bitops_1.0-6 stats4_3.0.0 >>> zlibbioc_1.7.0 >>> >>> >>> >>> >>> >>> [[alternative HTML version deleted]] >>> >>> >>> ______________________________**__**_________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >>> > >>> >>> https://stat.ethz.ch/mailman/***__*listinfo/bioconductor<https: s="" tat.ethz.ch="" mailman="" *__*listinfo="" bioconductor=""> >>> >>> <https: stat.ethz.ch="" mailman="" ****listinfo="" bioconductor<https:="" st="" at.ethz.ch="" mailman="" **listinfo="" bioconductor=""> >>> ><**https:/__/stat.ethz.ch/**mailman/__listinfo/**bioconductor<htt p:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >>> >>> >>> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<https:="" stat="" .ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> >> >>> Search the archives: >>> >>> http://news.gmane.org/gmane.****__science.biology.informatics.** >>> **__conductor<http: news.gmane.org="" gmane.**__science.biology.info="" rmatics.**__conductor=""> >>> >>> <http: news.gmane.org="" gmane.****science.biology.informatics.***="">>> *conductor<http: news.gmane.org="" gmane.**science.biology.informati="" cs.**conductor=""> >>> ><http: news.gmane.**__org="" gmane.science.biology.__**="">>> informatics.conductor >>> >>> >>> <http: news.gmane.org="" gmane.**science.biology.informatics.**condu="" ctor<http:="" news.gmane.org="" gmane.science.biology.informatics.conductor=""> >>> >> >>> >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________** >>> __**_________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >>> > >>> >>> https://stat.ethz.ch/mailman/***__*listinfo/bioconductor<https: s="" tat.ethz.ch="" mailman="" *__*listinfo="" bioconductor=""> >>> >>> <https: stat.ethz.ch="" mailman="" ****listinfo="" bioconductor<https:="" st="" at.ethz.ch="" mailman="" **listinfo="" bioconductor=""> >>> ><**https:/__/stat.ethz.ch/**mailman/__listinfo/**bioconductor<htt p:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >>> >>> <https: stat.ethz.ch="" mailman="" **="">>> listinfo/bioconductor<https: stat.ethz.ch="" mailman="" listinfo="" biocon="" ductor=""> >>> >> >>> Search the archives: http://news.gmane.org/gmane.** >>> >>> science.biology.informatics.****__conductor<http: news.gmane.**="">>> __org/gmane.science.biology.__**informatics.conductor >>> >>> >>> <http: news.gmane.org="" gmane.**science.biology.informatics.**condu="" ctor<http:="" news.gmane.org="" gmane.science.biology.informatics.conductor=""> >>> >> >>> >>> >>> -- >>> Hervé Pagès >>> >>> Program in Computational Biology >>> Division of Public Health Sciences >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, M1-B514 >>> P.O. Box 19024 >>> Seattle, WA 98109-1024 >>> >>> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >>> >>> >>> [[alternative HTML version deleted]] >>> >>> >>> >>> ______________________________**___________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org="">> >>> https://stat.ethz.ch/mailman/_**_listinfo/bioconductor <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >>> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<="" https:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> > >>> Search the archives: >>> >>> http://news.gmane.org/gmane.__**science.biology.informatics.__** >>> conductor<http: news.gmane.org="" gmane.__science.biology.informatic="" s.__conductor=""> >>> >>> <http: news.gmane.org="" gmane.**science.biology.informatics.**condu="" ctor<http:="" news.gmane.org="" gmane.science.biology.informatics.conductor=""> >>> > >>> >>> >>> >>> >>> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]
ADD REPLYlink written 5.2 years ago by Michael Lawrence10k
Just to clarify, I was only suggesting the 'tiles' arg as a simple way to process a chromosome by chunks/tiles, which seems to be a common use case. 'tiles' is also trivial to split over multiple cpus. 'which' is harder, because of the risk to process twice the same read. Using a hash to work around the problem is probably very hard in the multiple-cpu context. H. On 09/17/2013 09:10 PM, Michael Lawrence wrote: > So here's a use case: get all the reads for a set of putative > transcribed regions, numbering in the thousands. You don't want just the > reads that start in the regions, and you want to process the reads in > vectorized fashion, ideally with a single query to the BAM, returning a > GAlignmentPairs. > > There are at least two ways to do this: > a) Have readGAlignmentPairs and friends return non-redundant alignments, > even with multiple which, > b) Query the BAM separately (like in an lapply) for each which, then > combine the results, and keep around a partitioning that records the > original 'which' for each alignment. This would be slower and require > more work of the user, but it would work right now. > > Hopefully others have additional ideas. > > Michael > > > > > > > > On Tue, Sep 17, 2013 at 5:46 PM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> wrote: > > On 09/17/2013 04:39 PM, Hervé Pagès wrote: > > On 09/17/2013 04:15 PM, Michael Lawrence wrote: > > Leonard was thinking that the problem of importing the same > alignment > multiple times when caught by multiple which ranges could be > solved by > maintaining a hash of what reads have been seen before. This > could be > done with the read ID + pos + strand + cigar, we think, > > > FWIW, I've seen BAM files with more than 1 record sharing the same > QNAME + POS + strand + CIGAR. > > The 'Ambiguous pairing' section in the man page for > findMateAlignment() > shows such an example. It was taken from a real BAM file. IIRC the 2 > records were actually indistinguible. > > but even easier > would be to get at the file offset. It looks like samtools > keeps that > hidden, but we could just do some copy-and-pasting of the > bam_iter_t > struct... > > > IIUC the common use case is processing a BAM file by chromosome > chunks. > Are you saying the record ID would be returned by scanBam or > readGAlignemnts() so the end/user has a way to discard records that > have already been processed. > > Note that with a small change of semantic to the 'which' > argument, you > don't need to return a record ID and the end-user needs to do > nothing: > the same record cannot be loaded twice. > > > To be even clearer about this: with its current semantic, 'which' > doesn't feel like the right thing to use to process a chromosome "by > tiles". Instead of trying to fix it (by changing its semantic or by > maintaining a hash of what reads have been seen before), maybe it > would make more sense to add a 'tiles' (or 'tiling') argument to > ScanBamParam() that would also take a GRanges object. Then scanBam() > would load the records that have their POS in the tiles and > scanBam(asMates=TRUE) would load the records that have their POS > in the tiles and are *first* mate and it would try to pair them > (beyond 'tiles' if needed). > > It would be enough to enforce 'tiles' to have non-overlapping ranges > but it would not be necessary to require it to be a "real" tiling (i.e. > a set of adjacent ranges that cover the entire chromosome or genome). > > In addition, deciding whether a record belongs to 'tiles' would be > cheaper (and faster) than deciding whether it overlaps with > 'which' because there is no more need to compute the end of the > record based on its CIGAR. > > > H. > > > H. > > > > > > On Tue, Sep 17, 2013 at 3:29 PM, Michael Lawrence > <michafla at="" gene.com="" <mailto:michafla="" at="" gene.com=""> > <mailto:michafla at="" gene.com="" <mailto:michafla="" at="" gene.com="">>> wrote: > > I'm a big fan of the new paired-end streaming. It's the > way to go > over which-based partitioning. Nice job! > > > On Tue, Sep 17, 2013 at 2:46 PM, Valerie Obenchain > <vobencha at="" fhcrc.org="" <mailto:vobencha="" at="" fhcrc.org=""> > <mailto:vobencha at="" fhcrc.org="" <mailto:vobencha="" at="" fhcrc.org="">>> wrote: > > The new pairing algorithm used by > readGAlignmentsList() does > find mates outside the specified range. When only > one of the > pairs overlaps the 'which', it looks for a > potential mate in the > rest of the file. > > fl <- system.file("extdata", "ex1.bam", > package="Rsamtools") > param <- ScanBamParam(what="flag", > which=GRanges("seq1", IRanges(1, > width=40))) > > readGAlignmentPairs() finds no mates in this region. > > gapairs <- readGAlignmentPairs(BamFile(____fl), > param=param) > > gapairs > > GAlignmentPairs with 0 alignment pairs and 0 > metadata columns: > seqnames strand : ranges -- ranges > <rle> <rle> : <iranges> -- <iranges> > > readGAlignmentsList() finds two mate pairs and > several non-mates > in this region. The 'mates' metadata column > indicates which are > legitimate mates as per the algorithem described at > ?readGAlignmentsListFromBam. More control over what > records are > returned can be done by specifying flags in > ScanBamParam. > > galist <- readGAlignmentsList(BamFile(____fl, > asMates=TRUE), > param=param) > > galist > GAlignmentsList of length 17: > [[1]] > GAlignments with 2 alignments and 2 metadata columns: > seqnames strand cigar qwidth start end width > ngap | mates > flag > [1] seq1 - 35M 35 229 263 35 > <tel:35%20229%20263%2035> > <tel:35%20%20%20229%20263%20%__20%20%2035> 0 | > 1 83 > [2] seq1 + 35M 35 37 71 35 > 0 | 1 > 163 > > [[2]] > GAlignments with 2 alignments and 2 metadata columns: > seqnames strand cigar qwidth start end width > ngap | mates > flag > [1] seq1 - 35M 35 185 219 35 > <tel:35%20185%20219%2035> > <tel:35%20%20%20185%20219%20%__20%20%2035> 0 | > 1 147 > [2] seq1 + 35M 35 36 70 35 > 0 | 1 > 99 > > [[3]] > GAlignments with 1 alignment and 2 metadata columns: > seqnames strand cigar qwidth start end width > ngap | mates > flag > [1] seq1 + 36M 36 6 41 36 > 0 | 0 > 137 > > > > elementLengths(galist) > [1] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 > > The first two list elements contain mates. > > sapply(galist, function(x) > sum((mcols(x)$mates)) > 0) > > [1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE > FALSE FALSE > FALSE FALSE FALSE > [13] FALSE FALSE FALSE FALSE FALSE > > > Here are descriptions of the first few flags as per > this site: > http://picard.sourceforge.net/____explain-flags.html > <http: picard.sourceforge.net="" __explain-flags.html=""> > <http: picard.sourceforge.__net="" explain-="" flags.html=""> <http: picard.sourceforge.net="" explain-flags.html="">> > Flag 137 indicates an unmapped mate which is why > this record was > not mated. > > > 83: > Summary: > read paired > read mapped in proper pair > read reverse strand > first in pair > > 163: > Summary: > read paired > read mapped in proper pair > mate reverse strand > second in pair > > 147: > Summary: > read paired > read mapped in proper pair > read reverse strand > second in pair > > 99: > Summary: > read paired > read mapped in proper pair > mate reverse strand > first in pair > > 137: > Summary: > read paired > mate unmapped > second in pair > > > Valerie > > > > On 09/17/2013 01:55 PM, Michael Lawrence wrote: > > Yes, I knew there weren't any easy solutions. I > just think > that somehow we > need to point these out to users, because it's > not exactly > obvious. > > > On Tue, Sep 17, 2013 at 11:52 AM, Hervé Pagès > <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> wrote: > > Hi guys, > > Sorry for the late answer. > > > On 09/11/2013 11:51 AM, Michael Lawrence wrote: > > There seem to be at least two "gotchas" > with the > 'which' argument and > readGAlignmentPairs: (1) as Leonard > said, there's no > easy way to handle > cases where one record overlaps > multiple which > arguments > > > This is not just with readGAlignmentPairs, > it's also with > readGAlignments and, at a lower level, with > scanBam(), > which > the 2 formers are based on. If a record > overlaps 2 > ranges in the > 'which' arg, scanBam() loads it twice. > > I can't think of an easy way to address > this unless we > change the > semantic of the 'which' argument. For > example instead of > loading > records that overlap we could make the > choice to only > load records > that have their POS (i.e. start) within the > ranges > specified thru > 'which'. It introduces a dissymmetry > (because it looks > at the start > of the record and not at its end) but it > has the > advantage of making > sure records are only loaded once (unless > the user > passes a 'which' > argument that contains overlapping ranges > but then > s/he's really > looking for it ;-) ). > > > and (2) any read > > pairs with one end inside which and the > other out > will be discarded. > > > No easy work around for that one with the > current > implementation of > readGAlignmentPairs which just passes its > 'which' > argument down to > scanBam(). > > > Excluding per-chromosome iteration, > 'which' is a bit > dangerous when > > dealing > with read pairs. > > > It's also dangerous when dealing with > single end reads, > and it has > been for a while. > > Cheers, > H. > > > Somehow we should make this obvious to > the user. > > > > On Wed, Sep 11, 2013 at 11:02 AM, > Leonard Goldstein < > goldstein.leonard at gene.com <mailto:goldstein.leonard at="" gene.com=""> > <mailto:goldstein.leonard at="" __gene.com=""> <mailto:goldstein.leonard at="" gene.com="">>> wrote: > > Dear Herve and list, > > > > I realized the error might have > been caused due > to the same bam record > being read in multiple times when > passing > multiple ranges in the which > argument. > > > I don't think it is necessarily > obvious to users > that > readGAlignments/Pairs > returns a concatenation of records > that were > read in for each range in > the > which argument. Instead one might > expect to > obtain the union of records > that overlap any of the ranges > included in > which. Do you think it would > be > helpful to issue a warning in case > records are > read in multiple times? > > > Best wishes, > > > Leonard > > > > On Tue, Sep 10, 2013 at 5:58 PM, > Leonard > Goldstein <goldstel at="" gene.com=""> <mailto:goldstel at="" gene.com=""> > <mailto:goldstel at="" gene.com=""> <mailto:goldstel at="" gene.com="">> > > wrote: > > > Hi Herve, > > > > I ran into some issues when using > readGAlignmentPairs with the > ScanBamParam 'which' argument. > I included an > example below. I can see > how > changing 'which' can result in > different > pairings but was wondering > > whether > > the error can be avoided and > problematic > pairings dropped instead. > > > It looks like the issue is > introduced in > revision 77808 (no error in > 77791). I was hoping you have > an idea what > causes the problem, otherwise > > I > > can try to forward a test case. > > > Thanks for your help. > > > Leonard > > > gr > > > GRanges with 2 ranges and 0 > metadata columns: > seqnames > ranges strand > <rle> > <iranges> <rle> > [1] 10 [75758072, > 75758133] * > [2] 10 [75802841, > 75802911] * > --- > seqlengths: > 10 > NA > > > readGAlignmentPairs(file, > param = > ScanBamParam(which = gr[1])) > > GAlignmentPairs with 0 > alignment pairs and 0 > metadata columns: > seqnames strand : ranges > -- ranges > <rle> <rle> : <iranges> > -- <iranges> > --- > seqlengths: > 1 2 > 3 ... > GL000247.1 GL000248.1 GL000249.1 > 249250621 243199373 > 198022430 ... > 36422 39786 38502 > > > readGAlignmentPairs(file, > param = > ScanBamParam(which = gr[2])) > > GAlignmentPairs with 3 > alignment pairs and 0 > metadata columns: > seqnames strand : > ranges -- ranges > <rle> <rle> : > <iranges> -- <iranges> > [1] 10 + : > [75758115, 75802896] > -- [75802892, 75830482] > [2] 10 - : > [75802908, 75830498] > -- [75758111, 75802892] > [3] 10 - : > [75802908, 75830498] > -- [75802878, 75830468] > --- > seqlengths: > 1 2 > 3 ... > GL000247.1 GL000248.1 GL000249.1 > 249250621 243199373 > 198022430 ... > 36422 39786 38502 > > > readGAlignmentPairs(file, > param = > ScanBamParam(which = gr)) > > Error in > makeGAlignmentPairs(galn, use.names > = use.names, use.mcols = > use.mcols) : > findMateAlignment() > returned an invalid > 'mate' vector > In addition: Warning message: > In findMateAlignment(x) : > 4 alignments with > ambiguous pairing > were dumped. > Use > 'getDumpedAlignments()' to > retrieve them from the dump > > environment. > > > sessionInfo() > > R version 3.0.0 (2013-04-03) > Platform: > x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 > LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 > LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 > LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C > LC_NAME=C > [9] LC_ADDRESS=C > LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 > LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats > graphics grDevices > utils datasets methods > [8] base > > other attached packages: > [1] Rsamtools_1.13.39 > Biostrings_2.29.16 > GenomicRanges_1.13.41 > [4] XVector_0.1.1 > IRanges_1.19.31 > BiocGenerics_0.7.4 > > loaded via a namespace (and not > attached): > [1] bitops_1.0-6 stats4_3.0.0 > zlibbioc_1.7.0 > > > > > > [[alternative HTML > version deleted]] > > > __________________________________**_________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > > https://stat.ethz.ch/mailman/*____*listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" *__*listinfo="" bioconductor=""> > > <https: stat.ethz.ch="" mailman="" __**listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor="">>< __https:/__/stat.ethz.ch/__mailman/__listinfo/__bioconductor > <http: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> > > > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">>> > Search the archives: > > http://news.gmane.org/gmane.**____science.biology.inform atics.__**__conductor > <http: news.gmane.org="" gmane.**__science.biology.informa="" tics.**__conductor=""> > > <http: news.gmane.org="" gmane.*__*science.biology.informa="" tics.*__*conductor=""> <http: news.gmane.org="" gmane.**science.biology.informati="" cs.**conductor="">><http: news.gmane.____org="" gmane.science.biology.____i="" nformatics.conductor=""> > > <http: news.gmane.org="" gmane.__science.biology.informatics.__conductor=""> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">>> > > > [[alternative HTML version > deleted]] > > > __________________________________**_________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > > https://stat.ethz.ch/mailman/*____*listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" *__*listinfo="" bioconductor=""> > > <https: stat.ethz.ch="" mailman="" __**listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor="">>< __https:/__/stat.ethz.ch/__mailman/__listinfo/__bioconductor > <http: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> > > > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">>> > Search the archives: > http://news.gmane.org/gmane.** > > science.biology.informatics.**____conductor<http: news.="" gmane.____org="" gmane.science.biology.____informatics.conductor=""> > > <http: news.gmane.org="" gmane.__science.biology.informatics.__conductor=""> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">>> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> > Phone: (206) 667-5791 > <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 > <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319> > > > [[alternative HTML version deleted]] > > > > ___________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > https://stat.ethz.ch/mailman/____listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> > > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">> > Search the archives: > > http://news.gmane.org/gmane.____science.biology.informat ics.____conductor > <http: news.gmane.org="" gmane.__science.biology.informati="" cs.__conductor=""> > > <http: news.gmane.org="" gmane.__science.biology.informatics.__conductor=""> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">> > > > > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLYlink written 5.2 years ago by Hervé Pagès ♦♦ 13k
For processing an entire BAM, I think the yieldSize streaming is the way to go. Each read is read only once, and there is now the option of guaranteed pairs. On Tue, Sep 17, 2013 at 10:43 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > Just to clarify, I was only suggesting the 'tiles' arg as a simple way > to process a chromosome by chunks/tiles, which seems to be a common use > case. 'tiles' is also trivial to split over multiple cpus. 'which' is > harder, because of the risk to process twice the same read. Using a hash > to work around the problem is probably very hard in the multiple-cpu > context. > > H. > > > > On 09/17/2013 09:10 PM, Michael Lawrence wrote: > >> So here's a use case: get all the reads for a set of putative >> transcribed regions, numbering in the thousands. You don't want just the >> reads that start in the regions, and you want to process the reads in >> vectorized fashion, ideally with a single query to the BAM, returning a >> GAlignmentPairs. >> >> There are at least two ways to do this: >> a) Have readGAlignmentPairs and friends return non-redundant alignments, >> even with multiple which, >> b) Query the BAM separately (like in an lapply) for each which, then >> combine the results, and keep around a partitioning that records the >> original 'which' for each alignment. This would be slower and require >> more work of the user, but it would work right now. >> >> Hopefully others have additional ideas. >> >> Michael >> >> >> >> >> >> >> >> On Tue, Sep 17, 2013 at 5:46 PM, Hervé Pagès <hpages@fhcrc.org>> <mailto:hpages@fhcrc.org>> wrote: >> >> On 09/17/2013 04:39 PM, Hervé Pagès wrote: >> >> On 09/17/2013 04:15 PM, Michael Lawrence wrote: >> >> Leonard was thinking that the problem of importing the same >> alignment >> multiple times when caught by multiple which ranges could be >> solved by >> maintaining a hash of what reads have been seen before. This >> could be >> done with the read ID + pos + strand + cigar, we think, >> >> >> FWIW, I've seen BAM files with more than 1 record sharing the same >> QNAME + POS + strand + CIGAR. >> >> The 'Ambiguous pairing' section in the man page for >> findMateAlignment() >> shows such an example. It was taken from a real BAM file. IIRC >> the 2 >> records were actually indistinguible. >> >> but even easier >> would be to get at the file offset. It looks like samtools >> keeps that >> hidden, but we could just do some copy-and-pasting of the >> bam_iter_t >> struct... >> >> >> IIUC the common use case is processing a BAM file by chromosome >> chunks. >> Are you saying the record ID would be returned by scanBam or >> readGAlignemnts() so the end/user has a way to discard records >> that >> have already been processed. >> >> Note that with a small change of semantic to the 'which' >> argument, you >> don't need to return a record ID and the end-user needs to do >> nothing: >> the same record cannot be loaded twice. >> >> >> To be even clearer about this: with its current semantic, 'which' >> doesn't feel like the right thing to use to process a chromosome "by >> tiles". Instead of trying to fix it (by changing its semantic or by >> maintaining a hash of what reads have been seen before), maybe it >> would make more sense to add a 'tiles' (or 'tiling') argument to >> ScanBamParam() that would also take a GRanges object. Then scanBam() >> would load the records that have their POS in the tiles and >> scanBam(asMates=TRUE) would load the records that have their POS >> in the tiles and are *first* mate and it would try to pair them >> (beyond 'tiles' if needed). >> >> It would be enough to enforce 'tiles' to have non-overlapping ranges >> but it would not be necessary to require it to be a "real" tiling >> (i.e. >> a set of adjacent ranges that cover the entire chromosome or genome). >> >> In addition, deciding whether a record belongs to 'tiles' would be >> cheaper (and faster) than deciding whether it overlaps with >> 'which' because there is no more need to compute the end of the >> record based on its CIGAR. >> >> >> H. >> >> >> H. >> >> >> >> >> >> On Tue, Sep 17, 2013 at 3:29 PM, Michael Lawrence >> <michafla@gene.com <mailto:michafla@gene.com=""> >> <mailto:michafla@gene.com <mailto:michafla@gene.com="">>> wrote: >> >> I'm a big fan of the new paired-end streaming. It's the >> way to go >> over which-based partitioning. Nice job! >> >> >> On Tue, Sep 17, 2013 at 2:46 PM, Valerie Obenchain >> <vobencha@fhcrc.org <mailto:vobencha@fhcrc.org=""> >> <mailto:vobencha@fhcrc.org <mailto:vobencha@fhcrc.org="">>> >> wrote: >> >> The new pairing algorithm used by >> readGAlignmentsList() does >> find mates outside the specified range. When only >> one of the >> pairs overlaps the 'which', it looks for a >> potential mate in the >> rest of the file. >> >> fl <- system.file("extdata", "ex1.bam", >> package="Rsamtools") >> param <- ScanBamParam(what="flag", >> which=GRanges("seq1", >> IRanges(1, >> width=40))) >> >> readGAlignmentPairs() finds no mates in this region. >> >> gapairs <- readGAlignmentPairs(BamFile(__**__fl), >> >> param=param) >> > gapairs >> >> GAlignmentPairs with 0 alignment pairs and 0 >> metadata columns: >> seqnames strand : ranges -- ranges >> <rle> <rle> : <iranges> -- <iranges> >> >> readGAlignmentsList() finds two mate pairs and >> several non-mates >> in this region. The 'mates' metadata column >> indicates which are >> legitimate mates as per the algorithem described at >> ?readGAlignmentsListFromBam. More control over what >> records are >> returned can be done by specifying flags in >> ScanBamParam. >> >> galist <- readGAlignmentsList(BamFile(__**__fl, >> >> asMates=TRUE), >> param=param) >> > galist >> GAlignmentsList of length 17: >> [[1]] >> GAlignments with 2 alignments and 2 metadata columns: >> seqnames strand cigar qwidth start end width >> ngap | mates >> flag >> [1] seq1 - 35M 35 229 263 35 >> <tel:35%20229%20263%2035> >> <tel:35%20%20%20229%20263%20%_**_20%20%2035> 0 | >> >> 1 83 >> [2] seq1 + 35M 35 37 71 35 >> 0 | 1 >> 163 >> >> [[2]] >> GAlignments with 2 alignments and 2 metadata columns: >> seqnames strand cigar qwidth start end width >> ngap | mates >> flag >> [1] seq1 - 35M 35 185 219 35 >> <tel:35%20185%20219%2035> >> <tel:35%20%20%20185%20219%20%_**_20%20%2035> 0 | >> >> 1 147 >> [2] seq1 + 35M 35 36 70 35 >> 0 | 1 >> 99 >> >> [[3]] >> GAlignments with 1 alignment and 2 metadata columns: >> seqnames strand cigar qwidth start end width >> ngap | mates >> flag >> [1] seq1 + 36M 36 6 41 36 >> 0 | 0 >> 137 >> >> >> > elementLengths(galist) >> [1] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 >> >> The first two list elements contain mates. >> >> sapply(galist, function(x) >> sum((mcols(x)$mates)) > 0) >> >> [1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE >> FALSE FALSE >> FALSE FALSE FALSE >> [13] FALSE FALSE FALSE FALSE FALSE >> >> >> Here are descriptions of the first few flags as per >> this site: >> http://picard.sourceforge.net/**____explain- flags.html<http: picard.sourceforge.net="" ____explain-flags.html=""> >> <http: picard.sourceforge.**net="" __explain-="" flags.html<http:="" picard.sourceforge.net="" __explain-flags.html=""> >> > >> <http: picard.sourceforge.__**="">> net/explain-flags.html >> >> <http: picard.sourceforge.**net="" explain-="" flags.html<http:="" picard.sourceforge.net="" explain-flags.html=""> >> >> >> Flag 137 indicates an unmapped mate which is why >> this record was >> not mated. >> >> >> 83: >> Summary: >> read paired >> read mapped in proper pair >> read reverse strand >> first in pair >> >> 163: >> Summary: >> read paired >> read mapped in proper pair >> mate reverse strand >> second in pair >> >> 147: >> Summary: >> read paired >> read mapped in proper pair >> read reverse strand >> second in pair >> >> 99: >> Summary: >> read paired >> read mapped in proper pair >> mate reverse strand >> first in pair >> >> 137: >> Summary: >> read paired >> mate unmapped >> second in pair >> >> >> Valerie >> >> >> >> On 09/17/2013 01:55 PM, Michael Lawrence wrote: >> >> Yes, I knew there weren't any easy solutions. I >> just think >> that somehow we >> need to point these out to users, because it's >> not exactly >> obvious. >> >> >> On Tue, Sep 17, 2013 at 11:52 AM, Hervé Pagès >> <hpages@fhcrc.org <mailto:hpages@fhcrc.org=""> >> <mailto:hpages@fhcrc.org <mailto:hpages@fhcrc.org="">>> wrote: >> >> Hi guys, >> >> Sorry for the late answer. >> >> >> On 09/11/2013 11:51 AM, Michael Lawrence >> wrote: >> >> There seem to be at least two "gotchas" >> with the >> 'which' argument and >> readGAlignmentPairs: (1) as Leonard >> said, there's no >> easy way to handle >> cases where one record overlaps >> multiple which >> arguments >> >> >> This is not just with readGAlignmentPairs, >> it's also with >> readGAlignments and, at a lower level, with >> scanBam(), >> which >> the 2 formers are based on. If a record >> overlaps 2 >> ranges in the >> 'which' arg, scanBam() loads it twice. >> >> I can't think of an easy way to address >> this unless we >> change the >> semantic of the 'which' argument. For >> example instead of >> loading >> records that overlap we could make the >> choice to only >> load records >> that have their POS (i.e. start) within the >> ranges >> specified thru >> 'which'. It introduces a dissymmetry >> (because it looks >> at the start >> of the record and not at its end) but it >> has the >> advantage of making >> sure records are only loaded once (unless >> the user >> passes a 'which' >> argument that contains overlapping ranges >> but then >> s/he's really >> looking for it ;-) ). >> >> >> and (2) any read >> >> pairs with one end inside which and the >> other out >> will be discarded. >> >> >> No easy work around for that one with the >> current >> implementation of >> readGAlignmentPairs which just passes its >> 'which' >> argument down to >> scanBam(). >> >> >> Excluding per-chromosome iteration, >> 'which' is a bit >> dangerous when >> >> dealing >> with read pairs. >> >> >> It's also dangerous when dealing with >> single end reads, >> and it has >> been for a while. >> >> Cheers, >> H. >> >> >> Somehow we should make this obvious to >> the user. >> >> >> >> On Wed, Sep 11, 2013 at 11:02 AM, >> Leonard Goldstein < >> goldstein.leonard@gene.com <mailto:goldstein.leonard@**>> gene.com <goldstein.leonard@gene.com>> >> <mailto:goldstein.leonard@__ge**ne.com<http: gene.com=""> >> >> <mailto:goldstein.leonard@**gene.com<goldstein.leonard@gene.com>>>> >> wrote: >> >> Dear Herve and list, >> >> >> >> I realized the error might have >> been caused due >> to the same bam record >> being read in multiple times when >> passing >> multiple ranges in the which >> argument. >> >> >> I don't think it is necessarily >> obvious to users >> that >> readGAlignments/Pairs >> returns a concatenation of records >> that were >> read in for each range in >> the >> which argument. Instead one might >> expect to >> obtain the union of records >> that overlap any of the ranges >> included in >> which. Do you think it would >> be >> helpful to issue a warning in case >> records are >> read in multiple times? >> >> >> Best wishes, >> >> >> Leonard >> >> >> >> On Tue, Sep 10, 2013 at 5:58 PM, >> Leonard >> Goldstein <goldstel@gene.com>> <mailto:goldstel@gene.com> >> <mailto:goldstel@gene.com>> >> <mailto:goldstel@gene.com>> >> >> wrote: >> >> >> Hi Herve, >> >> >> >> I ran into some issues when using >> readGAlignmentPairs with the >> ScanBamParam 'which' argument. >> I included an >> example below. I can see >> how >> changing 'which' can result in >> different >> pairings but was wondering >> >> whether >> >> the error can be avoided and >> problematic >> pairings dropped instead. >> >> >> It looks like the issue is >> introduced in >> revision 77808 (no error in >> 77791). I was hoping you have >> an idea what >> causes the problem, otherwise >> >> I >> >> can try to forward a test case. >> >> >> Thanks for your help. >> >> >> Leonard >> >> >> gr >> >> >> GRanges with 2 ranges and 0 >> metadata columns: >> seqnames >> ranges strand >> <rle> >> <iranges> <rle> >> [1] 10 [75758072, >> 75758133] * >> [2] 10 [75802841, >> 75802911] * >> --- >> seqlengths: >> 10 >> NA >> >> >> readGAlignmentPairs(file, >> param = >> ScanBamParam(which = gr[1])) >> >> GAlignmentPairs with 0 >> alignment pairs and 0 >> metadata columns: >> seqnames strand : ranges >> -- ranges >> <rle> <rle> : <iranges> >> -- <iranges> >> --- >> seqlengths: >> 1 2 >> 3 ... >> GL000247.1 GL000248.1 GL000249.1 >> 249250621 243199373 >> 198022430 ... >> 36422 39786 38502 >> >> >> readGAlignmentPairs(file, >> param = >> ScanBamParam(which = gr[2])) >> >> GAlignmentPairs with 3 >> alignment pairs and 0 >> metadata columns: >> seqnames strand : >> ranges -- ranges >> <rle> <rle> : >> <iranges> -- >> <iranges> >> [1] 10 + : >> [75758115, 75802896] >> -- [75802892, 75830482] >> [2] 10 - : >> [75802908, 75830498] >> -- [75758111, 75802892] >> [3] 10 - : >> [75802908, 75830498] >> -- [75802878, 75830468] >> --- >> seqlengths: >> 1 2 >> 3 ... >> GL000247.1 GL000248.1 GL000249.1 >> 249250621 243199373 >> 198022430 ... >> 36422 39786 38502 >> >> >> readGAlignmentPairs(file, >> param = >> ScanBamParam(which = gr)) >> >> Error in >> makeGAlignmentPairs(galn, use.names >> = use.names, use.mcols = >> use.mcols) : >> findMateAlignment() >> returned an invalid >> 'mate' vector >> In addition: Warning message: >> In findMateAlignment(x) : >> 4 alignments with >> ambiguous pairing >> were dumped. >> Use >> 'getDumpedAlignments()' to >> retrieve them from the dump >> >> environment. >> >> >> sessionInfo() >> >> R version 3.0.0 (2013-04-03) >> Platform: >> x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 >> LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 >> LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 >> LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C >> LC_NAME=C >> [9] LC_ADDRESS=C >> LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 >> LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats >> graphics grDevices >> utils datasets methods >> [8] base >> >> other attached packages: >> [1] Rsamtools_1.13.39 >> Biostrings_2.29.16 >> GenomicRanges_1.13.41 >> [4] XVector_0.1.1 >> IRanges_1.19.31 >> BiocGenerics_0.7.4 >> >> loaded via a namespace (and not >> attached): >> [1] bitops_1.0-6 stats4_3.0.0 >> zlibbioc_1.7.0 >> >> >> >> >> >> [[alternative HTML >> version deleted]] >> >> >> ______________________________**____**_________________ >> >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-**>> project.org <bioconductor@r-project.org>> >> >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> >> >> >> https://stat.ethz.ch/mailman/***____*listinfo/bioconduc tor<https: stat.ethz.ch="" mailman="" *____*listinfo="" bioconductor=""> >> <https: stat.ethz.ch="" mailman="" ***__*listinfo="" bioconduct="" or<https:="" stat.ethz.ch="" mailman="" *__*listinfo="" bioconductor=""> >> > >> >> <https: stat.ethz.ch="" mailman="" **__**listinfo="" bioconduct="" or<https:="" stat.ethz.ch="" mailman="" __**listinfo="" bioconductor=""> >> <https: stat.ethz.ch="" mailman="" ****listinfo="" bioconductor="" <https:="" stat.ethz.ch="" mailman="" **listinfo="" bioconductor=""> >> >><__**https:/__/stat.ethz.ch/__**mailman/__listinfo/__**bioconduct or<http: stat.ethz.ch="" __mailman="" __listinfo="" __bioconductor=""> >> <http: stat.ethz.ch="" mailman="" _**_listinfo="" bioconductor<="" http:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> > >> >> >> >> <https: stat.ethz.ch="" mailman="" **__listinfo="" bioconductor="" <https:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<h="" ttps:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> >>> >> Search the archives: >> >> http://news.gmane.org/gmane.****____science.biology.** >> informatics.__**__conductor<http: news.gmane.org="" gmane.**____scien="" ce.biology.informatics.__**__conductor=""> >> <http: news.gmane.org="" gmane.****__science.biology.**="">> informatics.**__conductor<http: news.gmane.org="" gmane.**__science.b="" iology.informatics.**__conductor=""> >> > >> >> <http: news.gmane.org="" gmane.***__*science.biology.**="">> informatics.*__*conductor<http: news.gmane.org="" gmane.*__*science.b="" iology.informatics.*__*conductor=""> >> <http: news.gmane.org="" gmane.***="">> *science.biology.informatics.****conductor<http: news.gmane.org="" gm="" ane.**science.biology.informatics.**conductor=""> >> >><http: news.**gmane.="" <http:="" news.gmane.="">____org/gmane.science.** >> biology.____informatics.**conductor >> >> >> <http: news.gmane.org="" gmane._**="">> _science.biology.informatics._**_conductor<http: news.gmane.org="" gm="" ane.__science.biology.informatics.__conductor=""> >> >> <http: news.gmane.org="" gmane.**science.biology.informatics.**="">> conductor<http: news.gmane.org="" gmane.science.biology.informatics.c="" onductor=""> >> >>> >> >> >> [[alternative HTML version >> deleted]] >> >> >> ______________________________**____**_________________ >> >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-**>> project.org <bioconductor@r-project.org>> >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> >> >> >> https://stat.ethz.ch/mailman/***____*listinfo/bioconduc tor<https: stat.ethz.ch="" mailman="" *____*listinfo="" bioconductor=""> >> <https: stat.ethz.ch="" mailman="" ***__*listinfo="" bioconduct="" or<https:="" stat.ethz.ch="" mailman="" *__*listinfo="" bioconductor=""> >> > >> >> <https: stat.ethz.ch="" mailman="" **__**listinfo="" bioconduct="" or<https:="" stat.ethz.ch="" mailman="" __**listinfo="" bioconductor=""> >> <https: stat.ethz.ch="" mailman="" ****listinfo="" bioconductor="" <https:="" stat.ethz.ch="" mailman="" **listinfo="" bioconductor=""> >> >><__**https:/__/stat.ethz.ch/__**mailman/__listinfo/__**bioconduct or<http: stat.ethz.ch="" __mailman="" __listinfo="" __bioconductor=""> >> <http: stat.ethz.ch="" mailman="" _**_listinfo="" bioconductor<="" http:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> > >> >> >> >> <https: stat.ethz.ch="" mailman="" **__listinfo="" bioconductor="" <https:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<h="" ttps:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> >>> >> Search the archives: >> http://news.gmane.org/gmane.** >> >> science.biology.informatics.****____conductor<http: news.**="">> gmane. <http: news.gmane.="">____org/gmane.science.** >> biology.____informatics.**conductor >> >> >> >> <http: news.gmane.org="" gmane._**="">> _science.biology.informatics._**_conductor<http: news.gmane.org="" gm="" ane.__science.biology.informatics.__conductor=""> >> <http: news.gmane.org="" gmane.**science.biology.informatics.**="">> conductor<http: news.gmane.org="" gmane.science.biology.informatics.c="" onductor=""> >> >>> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org >> <mailto:hpages@fhcrc.org> <mailto:hpages@fhcrc.org>> >> <mailto:hpages@fhcrc.org>> >> Phone: (206) 667-5791 >> <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791> >> Fax: (206) 667-1319 >> <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319> >> >> >> [[alternative HTML version deleted]] >> >> >> >> ______________________________** >> _____________________ >> >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-**>> project.org <bioconductor@r-project.org>> >> <mailto:bioconductor@r-__**project.org<bioconductor@r-__project.org> >> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> >> >> https://stat.ethz.ch/mailman/_**___listinfo/bioconducto r<https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> >> <https: stat.ethz.ch="" mailman="" **__listinfo="" bioconductor="" <https:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> > >> >> >> <https: stat.ethz.ch="" mailman="" **__listinfo="" bioconductor="" <https:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<h="" ttps:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> >> >> Search the archives: >> >> http://news.gmane.org/gmane.__** >> __science.biology.informatics.**____conductor<http: news.gmane.org="" gmane.____science.biology.informatics.____conductor=""> >> <http: news.gmane.org="" gmane._**="">> _science.biology.informatics._**_conductor<http: news.gmane.org="" gm="" ane.__science.biology.informatics.__conductor=""> >> > >> >> >> <http: news.gmane.org="" gmane._**="">> _science.biology.informatics._**_conductor<http: news.gmane.org="" gm="" ane.__science.biology.informatics.__conductor=""> >> <http: news.gmane.org="" gmane.**science.biology.informatics.**="">> conductor<http: news.gmane.org="" gmane.science.biology.informatics.c="" onductor=""> >> >> >> >> >> >> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >> >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]
ADD REPLYlink written 5.2 years ago by Michael Lawrence10k
Yes yieldSize is the way to go for streaming of course (that's what it has been designed for). But for simple parallelization schemes, it's nice to have a simple way to partition your BAM file *ahead* of time, which neither yieldSize or 'which' provide. H. On 09/18/2013 09:22 AM, Michael Lawrence wrote: > For processing an entire BAM, I think the yieldSize streaming is the way > to go. Each read is read only once, and there is now the option of > guaranteed pairs. > > > > > On Tue, Sep 17, 2013 at 10:43 PM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> wrote: > > Just to clarify, I was only suggesting the 'tiles' arg as a simple way > to process a chromosome by chunks/tiles, which seems to be a common use > case. 'tiles' is also trivial to split over multiple cpus. 'which' is > harder, because of the risk to process twice the same read. Using a hash > to work around the problem is probably very hard in the multiple-cpu > context. > > H. > > > > On 09/17/2013 09:10 PM, Michael Lawrence wrote: > > So here's a use case: get all the reads for a set of putative > transcribed regions, numbering in the thousands. You don't want > just the > reads that start in the regions, and you want to process the > reads in > vectorized fashion, ideally with a single query to the BAM, > returning a > GAlignmentPairs. > > There are at least two ways to do this: > a) Have readGAlignmentPairs and friends return non-redundant > alignments, > even with multiple which, > b) Query the BAM separately (like in an lapply) for each which, then > combine the results, and keep around a partitioning that records the > original 'which' for each alignment. This would be slower and > require > more work of the user, but it would work right now. > > Hopefully others have additional ideas. > > Michael > > > > > > > > On Tue, Sep 17, 2013 at 5:46 PM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> wrote: > > On 09/17/2013 04:39 PM, Hervé Pagès wrote: > > On 09/17/2013 04:15 PM, Michael Lawrence wrote: > > Leonard was thinking that the problem of importing > the same > alignment > multiple times when caught by multiple which ranges > could be > solved by > maintaining a hash of what reads have been seen > before. This > could be > done with the read ID + pos + strand + cigar, we think, > > > FWIW, I've seen BAM files with more than 1 record > sharing the same > QNAME + POS + strand + CIGAR. > > The 'Ambiguous pairing' section in the man page for > findMateAlignment() > shows such an example. It was taken from a real BAM > file. IIRC the 2 > records were actually indistinguible. > > but even easier > would be to get at the file offset. It looks like > samtools > keeps that > hidden, but we could just do some copy-and- pasting > of the > bam_iter_t > struct... > > > IIUC the common use case is processing a BAM file by > chromosome > chunks. > Are you saying the record ID would be returned by > scanBam or > readGAlignemnts() so the end/user has a way to discard > records that > have already been processed. > > Note that with a small change of semantic to the 'which' > argument, you > don't need to return a record ID and the end-user needs > to do > nothing: > the same record cannot be loaded twice. > > > To be even clearer about this: with its current semantic, > 'which' > doesn't feel like the right thing to use to process a > chromosome "by > tiles". Instead of trying to fix it (by changing its > semantic or by > maintaining a hash of what reads have been seen before), > maybe it > would make more sense to add a 'tiles' (or 'tiling') > argument to > ScanBamParam() that would also take a GRanges object. Then > scanBam() > would load the records that have their POS in the tiles and > scanBam(asMates=TRUE) would load the records that have > their POS > in the tiles and are *first* mate and it would try to pair them > (beyond 'tiles' if needed). > > It would be enough to enforce 'tiles' to have > non-overlapping ranges > but it would not be necessary to require it to be a "real" > tiling (i.e. > a set of adjacent ranges that cover the entire chromosome > or genome). > > In addition, deciding whether a record belongs to 'tiles' > would be > cheaper (and faster) than deciding whether it overlaps with > 'which' because there is no more need to compute the end of the > record based on its CIGAR. > > > H. > > > H. > > > > > > On Tue, Sep 17, 2013 at 3:29 PM, Michael Lawrence > <michafla at="" gene.com="" <mailto:michafla="" at="" gene.com=""> > <mailto:michafla at="" gene.com="" <mailto:michafla="" at="" gene.com="">> > <mailto:michafla at="" gene.com=""> <mailto:michafla at="" gene.com=""> <mailto:michafla at="" gene.com=""> <mailto:michafla at="" gene.com="">>>> wrote: > > I'm a big fan of the new paired-end streaming. > It's the > way to go > over which-based partitioning. Nice job! > > > On Tue, Sep 17, 2013 at 2:46 PM, Valerie Obenchain > <vobencha at="" fhcrc.org=""> <mailto:vobencha at="" fhcrc.org=""> <mailto:vobencha at="" fhcrc.org=""> <mailto:vobencha at="" fhcrc.org="">> > <mailto:vobencha at="" fhcrc.org=""> <mailto:vobencha at="" fhcrc.org=""> <mailto:vobencha at="" fhcrc.org=""> <mailto:vobencha at="" fhcrc.org="">>>> wrote: > > The new pairing algorithm used by > readGAlignmentsList() does > find mates outside the specified range. > When only > one of the > pairs overlaps the 'which', it looks for a > potential mate in the > rest of the file. > > fl <- system.file("extdata", "ex1.bam", > package="Rsamtools") > param <- ScanBamParam(what="flag", > > which=GRanges("seq1", IRanges(1, > width=40))) > > readGAlignmentPairs() finds no mates in > this region. > > gapairs <- > readGAlignmentPairs(BamFile(______fl), > > param=param) > > gapairs > > GAlignmentPairs with 0 alignment pairs and 0 > metadata columns: > seqnames strand : ranges -- ranges > <rle> <rle> : <iranges> -- <iranges> > > readGAlignmentsList() finds two mate pairs and > several non-mates > in this region. The 'mates' metadata column > indicates which are > legitimate mates as per the algorithem > described at > ?readGAlignmentsListFromBam. More control > over what > records are > returned can be done by specifying flags in > ScanBamParam. > > galist <- > readGAlignmentsList(BamFile(______fl, > > asMates=TRUE), > param=param) > > galist > GAlignmentsList of length 17: > [[1]] > GAlignments with 2 alignments and 2 > metadata columns: > seqnames strand cigar qwidth start > end width > ngap | mates > flag > [1] seq1 - 35M 35 229 263 35 > <tel:35%20229%20263%2035> > <tel:35%20229%20263%2035> > > <tel:35%20%20%20229%20263%20%____20%20%2035> 0 | > > 1 83 > [2] seq1 + 35M 35 37 > 71 35 > 0 | 1 > 163 > > [[2]] > GAlignments with 2 alignments and 2 > metadata columns: > seqnames strand cigar qwidth start > end width > ngap | mates > flag > [1] seq1 - 35M 35 185 219 35 > <tel:35%20185%20219%2035> > <tel:35%20185%20219%2035> > > <tel:35%20%20%20185%20219%20%____20%20%2035> 0 | > > 1 147 > [2] seq1 + 35M 35 36 > 70 35 > 0 | 1 > 99 > > [[3]] > GAlignments with 1 alignment and 2 > metadata columns: > seqnames strand cigar qwidth start > end width > ngap | mates > flag > [1] seq1 + 36M 36 6 > 41 36 > 0 | 0 > 137 > > > > elementLengths(galist) > [1] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 > > The first two list elements contain mates. > > sapply(galist, function(x) > sum((mcols(x)$mates)) > 0) > > [1] TRUE TRUE FALSE FALSE FALSE FALSE > FALSE > FALSE FALSE > FALSE FALSE FALSE > [13] FALSE FALSE FALSE FALSE FALSE > > > Here are descriptions of the first few > flags as per > this site: > http://picard.sourceforge.net/______explain-flags.html > <http: picard.sourceforge.net="" ____explain-flags.html=""> > > <http: picard.sourceforge.__net="" __explain-flags.html=""> <http: picard.sourceforge.net="" __explain-flags.html="">> > > <http: picard.sourceforge.____net="" explain-flags.html=""> > <http: picard.sourceforge.__net="" explain-="" flags.html=""> <http: picard.sourceforge.net="" explain-flags.html="">>> > Flag 137 indicates an unmapped mate which > is why > this record was > not mated. > > > 83: > Summary: > read paired > read mapped in proper pair > read reverse strand > first in pair > > 163: > Summary: > read paired > read mapped in proper pair > mate reverse strand > second in pair > > 147: > Summary: > read paired > read mapped in proper pair > read reverse strand > second in pair > > 99: > Summary: > read paired > read mapped in proper pair > mate reverse strand > first in pair > > 137: > Summary: > read paired > mate unmapped > second in pair > > > Valerie > > > > On 09/17/2013 01:55 PM, Michael Lawrence > wrote: > > Yes, I knew there weren't any easy > solutions. I > just think > that somehow we > need to point these out to users, > because it's > not exactly > obvious. > > > On Tue, Sep 17, 2013 at 11:52 AM, > Hervé Pagès > <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>>> wrote: > > Hi guys, > > Sorry for the late answer. > > > On 09/11/2013 11:51 AM, Michael > Lawrence wrote: > > There seem to be at least two > "gotchas" > with the > 'which' argument and > readGAlignmentPairs: (1) as > Leonard > said, there's no > easy way to handle > cases where one record overlaps > multiple which > arguments > > > This is not just with > readGAlignmentPairs, > it's also with > readGAlignments and, at a lower > level, with > scanBam(), > which > the 2 formers are based on. If a > record > overlaps 2 > ranges in the > 'which' arg, scanBam() loads it twice. > > I can't think of an easy way to > address > this unless we > change the > semantic of the 'which' argument. For > example instead of > loading > records that overlap we could make the > choice to only > load records > that have their POS (i.e. start) > within the > ranges > specified thru > 'which'. It introduces a dissymmetry > (because it looks > at the start > of the record and not at its end) > but it > has the > advantage of making > sure records are only loaded once > (unless > the user > passes a 'which' > argument that contains overlapping > ranges > but then > s/he's really > looking for it ;-) ). > > > and (2) any read > > pairs with one end inside > which and the > other out > will be discarded. > > > No easy work around for that one > with the > current > implementation of > readGAlignmentPairs which just > passes its > 'which' > argument down to > scanBam(). > > > Excluding per-chromosome iteration, > 'which' is a bit > dangerous when > > dealing > with read pairs. > > > It's also dangerous when dealing with > single end reads, > and it has > been for a while. > > Cheers, > H. > > > Somehow we should make this > obvious to > the user. > > > > On Wed, Sep 11, 2013 at 11:02 AM, > Leonard Goldstein < > goldstein.leonard at gene.com <mailto:goldstein.leonard at="" gene.com=""> > <mailto:goldstein.leonard at="" __gene.com=""> <mailto:goldstein.leonard at="" gene.com="">> > <mailto:goldstein.leonard@> <mailto:goldstein.leonard@>__ge__ne.com <http: gene.com=""> > > <mailto:goldstein.leonard at="" __gene.com=""> <mailto:goldstein.leonard at="" gene.com="">>>> wrote: > > Dear Herve and list, > > > > I realized the error might > have > been caused due > to the same bam record > being read in multiple > times when > passing > multiple ranges in the which > argument. > > > I don't think it is > necessarily > obvious to users > that > readGAlignments/Pairs > returns a concatenation of > records > that were > read in for each range in > the > which argument. Instead > one might > expect to > obtain the union of records > that overlap any of the ranges > included in > which. Do you think it would > be > helpful to issue a warning > in case > records are > read in multiple times? > > > Best wishes, > > > Leonard > > > > On Tue, Sep 10, 2013 at > 5:58 PM, > Leonard > Goldstein > <goldstel at="" gene.com="" <mailto:goldstel="" at="" gene.com=""> > <mailto:goldstel at="" gene.com="" <mailto:goldstel="" at="" gene.com="">> > <mailto:goldstel at="" gene.com=""> <mailto:goldstel at="" gene.com=""> > > <mailto:goldstel at="" gene.com="" <mailto:goldstel="" at="" gene.com="">>> > > wrote: > > > Hi Herve, > > > > I ran into some issues > when using > readGAlignmentPairs > with the > ScanBamParam 'which' > argument. > I included an > example below. I can see > how > changing 'which' can > result in > different > pairings but was wondering > > whether > > the error can be > avoided and > problematic > pairings dropped instead. > > > It looks like the issue is > introduced in > revision 77808 (no > error in > 77791). I was hoping > you have > an idea what > causes the problem, > otherwise > > I > > can try to forward a > test case. > > > Thanks for your help. > > > Leonard > > > gr > > > GRanges with 2 ranges > and 0 > metadata columns: > seqnames > ranges strand > <rle> > <iranges> <rle> > [1] 10 > [75758072, > 75758133] * > [2] 10 > [75802841, > 75802911] * > --- > seqlengths: > 10 > NA > > > > readGAlignmentPairs(file, > param = > ScanBamParam(which > = gr[1])) > > GAlignmentPairs with 0 > alignment pairs and 0 > metadata columns: > seqnames strand : > ranges > -- ranges > <rle> <rle> : > <iranges> > -- <iranges> > --- > seqlengths: > 1 2 > 3 ... > GL000247.1 GL000248.1 > GL000249.1 > 249250621 243199373 > 198022430 ... > 36422 39786 > 38502 > > > > readGAlignmentPairs(file, > param = > ScanBamParam(which > = gr[2])) > > GAlignmentPairs with 3 > alignment pairs and 0 > metadata columns: > seqnames strand : > ranges -- > ranges > <rle> <rle> : > <iranges> -- > <iranges> > [1] 10 + : > [75758115, 75802896] > -- [75802892, 75830482] > [2] 10 - : > [75802908, 75830498] > -- [75758111, 75802892] > [3] 10 - : > [75802908, 75830498] > -- [75802878, 75830468] > --- > seqlengths: > 1 2 > 3 ... > GL000247.1 GL000248.1 > GL000249.1 > 249250621 243199373 > 198022430 ... > 36422 39786 > 38502 > > > > readGAlignmentPairs(file, > param = > ScanBamParam(which > = gr)) > > Error in > makeGAlignmentPairs(galn, use.names > = use.names, use.mcols = > use.mcols) : > findMateAlignment() > returned an invalid > 'mate' vector > In addition: Warning > message: > In findMateAlignment(x) : > 4 alignments with > ambiguous pairing > were dumped. > Use > 'getDumpedAlignments()' to > retrieve them from the > dump > > environment. > > > sessionInfo() > > R version 3.0.0 > (2013-04-03) > Platform: > x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] > LC_CTYPE=en_US.UTF-8 > LC_NUMERIC=C > [3] > LC_TIME=en_US.UTF-8 > LC_COLLATE=en_US.UTF-8 > [5] > LC_MONETARY=en_US.UTF-8 > LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C > LC_NAME=C > [9] LC_ADDRESS=C > LC_TELEPHONE=C > [11] > LC_MEASUREMENT=en_US.UTF-8 > LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats > graphics grDevices > utils datasets > methods > [8] base > > other attached packages: > [1] Rsamtools_1.13.39 > Biostrings_2.29.16 > GenomicRanges_1.13.41 > [4] XVector_0.1.1 > IRanges_1.19.31 > BiocGenerics_0.7.4 > > loaded via a namespace > (and not > attached): > [1] bitops_1.0-6 > stats4_3.0.0 > zlibbioc_1.7.0 > > > > > > [[alternative HTML > version deleted]] > > > ____________________________________**_________________ > > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>> > > https://stat.ethz.ch/mailman/*______*listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" *____*listinfo="" bioconductor=""> > > <https: stat.ethz.ch="" mailman="" __*__*listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" *__*listinfo="" bioconductor="">> > > > <https: stat.ethz.ch="" mailman="" ____**listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" __**listinfo="" bioconductor=""> > > <https: stat.ethz.ch="" mailman="" __**listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor="">>><___ _https:/__/stat.ethz.ch/____mailman/__listinfo/____bioconductor > <http: stat.ethz.ch="" __mailman="" __listinfo="" __bioconductor=""> > > <http: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> <http: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">> > > > > > <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> > > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">>>> > Search the archives: > > http://news.gmane.org/gmane.**______science.biology.__inform atics.__**__conductor > <http: news.gmane.org="" gmane.**____science.biology.informati="" cs.__**__conductor=""> > > <http: news.gmane.org="" gmane.*__*__science.biology.__informa="" tics.**__conductor=""> <http: news.gmane.org="" gmane.**__science.biology.informatics="" .**__conductor="">> > > > <http: news.gmane.org="" gmane.*____*science.biology.__informa="" tics.*__*conductor=""> <http: news.gmane.org="" gmane.*__*science.biology.informatics="" .*__*conductor=""> > > <http: news.gmane.org="" gmane.*__*science.biology.informatics="" .*__*conductor=""> <http: news.gmane.org="" gmane.**science.biology.informatics.*="" *conductor="">>><http: news.__gmane.=""> <http: news.gmane.="">____org/gmane.science.__biology.____info rmatics.__conductor > > > > <http: news.gmane.org="" gmane.____science.biology.informatics="" .____conductor=""> <http: news.gmane.org="" gmane.__science.biology.informatics._="" _conductor=""> > > > <http: news.gmane.org="" gmane.__science.biology.informatics.__conductor=""> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">>>> > > > [[alternative HTML > version > deleted]] > > > ____________________________________**_________________ > > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>> > > https://stat.ethz.ch/mailman/*______*listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" *____*listinfo="" bioconductor=""> > > <https: stat.ethz.ch="" mailman="" __*__*listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" *__*listinfo="" bioconductor="">> > > > <https: stat.ethz.ch="" mailman="" ____**listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" __**listinfo="" bioconductor=""> > > <https: stat.ethz.ch="" mailman="" __**listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor="">>><___ _https:/__/stat.ethz.ch/____mailman/__listinfo/____bioconductor > <http: stat.ethz.ch="" __mailman="" __listinfo="" __bioconductor=""> > > <http: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> <http: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">> > > > > > <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> > > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">>>> > Search the archives: > http://news.gmane.org/gmane.** > > > science.biology.informatics.**______conductor<http: news.__gmane.=""> <http: news.gmane.="">____org/gmane.science.__biology.____info rmatics.__conductor > > > > > <http: news.gmane.org="" gmane.____science.biology.informatics="" .____conductor=""> <http: news.gmane.org="" gmane.__science.biology.informatics._="" _conductor=""> > > <http: news.gmane.org="" gmane.__science.biology.informatics.__conductor=""> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">>>> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org=""> > > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">>> > Phone: (206) 667-5791 > <tel:%28206%29%20667-5791> > <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 > <tel:%28206%29%20667-1319> > <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319> > > > [[alternative HTML version > deleted]] > > > > > _____________________________________________________ > > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>> > https://stat.ethz.ch/mailman/______listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> > > <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">> > > > > <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> > > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">>> > Search the archives: > > http://news.gmane.org/gmane.______science.biology.informatic s.______conductor > <http: news.gmane.org="" gmane.____science.biology.informatics="" .____conductor=""> > > <http: news.gmane.org="" gmane.____science.biology.informatics="" .____conductor=""> <http: news.gmane.org="" gmane.__science.biology.informatics._="" _conductor="">> > > > > <http: news.gmane.org="" gmane.____science.biology.informatics="" .____conductor=""> <http: news.gmane.org="" gmane.__science.biology.informatics._="" _conductor=""> > > <http: news.gmane.org="" gmane.__science.biology.informatics.__conductor=""> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">>> > > > > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > <mailto:hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > <tel:%28206%29%20667-1319> > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLYlink written 5.2 years ago by Hervé Pagès ♦♦ 13k
0
gravatar for Hervé Pagès
5.1 years ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:
Hi Leonard, Good timing. I was actually going to get back to you today about this because yesterday I fixed a bug in findMateAlignment() that I think was causing the problem you originally reported on the mailing list. It's also probably what is causing this new problem you are reporting below. Totally agree with you that even if using a 'which' argument can alter the pairing, in any case it should produce an error. More generally the intended behavior of readGAlignmentPairs() is to *drop* records that don't have a mate and to *dump* ambiguous pairings. Not to choke on them. The difference between "drop" and "dump" here is that what's dumped can be examined latter with getDumpedAlignments(). See ?getDumpedAlignments for the details. Please let me know if you still run into those issues with the latest Rsamtools (1.13.45). Works fine for me on your test.bam file (thanks for providing the file). Cheers, H. On 09/27/2013 04:23 PM, Leonard Goldstein wrote: > Hi Herv? > > Thanks for your and Valerie's comments on using 'which' with > readGAlignmentPairs. > > I actually encountered the same error again without using 'which' (see > below). I attached test.bam, which narrows down the problematic > alignment(s) to around ~600. If readGAlignmentPairs could drop any > unusual alignments rather than generating an error that would be really > helpful. > > Many thanks for your help. > > Leonard > > > > readGAlignmentPairs(file = "test.bam") > Error in makeGAlignmentPairs(galn, use.names = use.names, use.mcols = > use.mcols) : > findMateAlignment() returned an invalid 'mate' vector > In addition: Warning message: > In findMateAlignment(x) : > 2 alignments with ambiguous pairing were dumped. > Use 'getDumpedAlignments()' to retrieve them from the dump environment. > > > > sessionInfo() > R version 3.0.0 (2013-04-03) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] Rsamtools_1.13.44 Biostrings_2.29.19 GenomicRanges_1.13.45 > [4] XVector_0.1.4 IRanges_1.19.38 BiocGenerics_0.7.5 > > loaded via a namespace (and not attached): > [1] bitops_1.0-6 stats4_3.0.0 zlibbioc_1.7.0 > > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENTlink written 5.1 years ago by Hervé Pagès ♦♦ 13k
Hi Hervé, Yes the latest version of Rsamtools works fine for both the original and the new problem. Many thanks for fixing this bug, and the quick reply. Best wishes Leonard On Fri, Sep 27, 2013 at 4:55 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > Hi Leonard, > > Good timing. I was actually going to get back to you today about this > because yesterday I fixed a bug in findMateAlignment() that I think was > causing the problem you originally reported on the mailing list. It's > also probably what is causing this new problem you are reporting below. > Totally agree with you that even if using a 'which' argument can alter > the pairing, in any case it should produce an error. More generally the > intended behavior of readGAlignmentPairs() is to *drop* records that > don't have a mate and to *dump* ambiguous pairings. Not to choke on > them. The difference between "drop" and "dump" here is that what's > dumped can be examined latter with getDumpedAlignments(). See > ?getDumpedAlignments for the details. > > Please let me know if you still run into those issues with the latest > Rsamtools (1.13.45). Works fine for me on your test.bam file (thanks > for providing the file). > > Cheers, > H. > > > > On 09/27/2013 04:23 PM, Leonard Goldstein wrote: > >> Hi Hervé >> >> Thanks for your and Valerie's comments on using 'which' with >> readGAlignmentPairs. >> >> I actually encountered the same error again without using 'which' (see >> below). I attached test.bam, which narrows down the problematic >> alignment(s) to around ~600. If readGAlignmentPairs could drop any >> unusual alignments rather than generating an error that would be really >> helpful. >> >> Many thanks for your help. >> >> Leonard >> >> >> > readGAlignmentPairs(file = "test.bam") >> Error in makeGAlignmentPairs(galn, use.names = use.names, use.mcols = >> use.mcols) : >> findMateAlignment() returned an invalid 'mate' vector >> In addition: Warning message: >> In findMateAlignment(x) : >> 2 alignments with ambiguous pairing were dumped. >> Use 'getDumpedAlignments()' to retrieve them from the dump >> environment. >> > >> > sessionInfo() >> R version 3.0.0 (2013-04-03) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> [8] base >> >> other attached packages: >> [1] Rsamtools_1.13.44 Biostrings_2.29.19 GenomicRanges_1.13.45 >> [4] XVector_0.1.4 IRanges_1.19.38 BiocGenerics_0.7.5 >> >> loaded via a namespace (and not attached): >> [1] bitops_1.0-6 stats4_3.0.0 zlibbioc_1.7.0 >> > >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]
ADD REPLYlink written 5.1 years ago by Leonard Goldstein80
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 403 users visited in the last hour