Rsamtools: Realloc integer overflow?
1
0
Entering edit mode
@michael-lawrence-3846
Last seen 2.4 years ago
United States
Hey guys, Whenever I try to calculate the coverage for a BAM file with more than say 500 million reads, I get this error: Error in coverage(readBamGappedAlignments(x, param = param), shift = shift, : \n error in evaluating the argument 'x' in selecting a method for function 'coverage': Error in value[[3L]](cond) (from #2) : \n 'Realloc' could not re-allocate memory (18446744065128005632 bytes)\n This looks like integer overflow, possibly within _grow_SCAN_BAM_DATA(). Could we just use long there? Michael [[alternative HTML version deleted]]
Coverage Coverage • 1.8k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 3 days ago
United States
On 06/03/2013 05:27 PM, Michael Lawrence wrote: > Hey guys, > > Whenever I try to calculate the coverage for a BAM file with more than say > 500 million reads, I get this error: > > Error in coverage(readBamGappedAlignments(x, param = param), shift = > shift, : \n error in evaluating the argument 'x' in selecting a method > for function 'coverage': Error in value[[3L]](cond) (from #2) : \n > 'Realloc' could not re-allocate memory (18446744065128005632 bytes)\n > > This looks like integer overflow, possibly within _grow_SCAN_BAM_DATA(). > Could we just use long there? I wonder if it would be more sensible if less convenient to do this (under Bioc-devel) bf <- open(BamFile(fl, yieldSize=100000000)) cvg <- coverage(readGAlignmentsFromBam(bf)) while (length(aln <- readGAlignmentsFromBam(bf))) cvg <- cvg + coverage(aln) close(bf) ? It opens the door for better memory management and parallel evaluation. I'm concerned that using size_t (Realloc casts to this) or ptrdiff_t (the size of R long vectors) would only get us through the C code; the representation of this in R would require R long vectors, and Rsamtools does not (yet?) support that. Martin > > Michael > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENT
0
Entering edit mode
Hi Martin, On 06/03/2013 06:26 PM, Martin Morgan wrote: > On 06/03/2013 05:27 PM, Michael Lawrence wrote: >> Hey guys, >> >> Whenever I try to calculate the coverage for a BAM file with more than >> say >> 500 million reads, I get this error: >> >> Error in coverage(readBamGappedAlignments(x, param = param), shift = >> shift, : \n error in evaluating the argument 'x' in selecting a method >> for function 'coverage': Error in value[[3L]](cond) (from #2) : \n >> 'Realloc' could not re-allocate memory (18446744065128005632 bytes)\n >> >> This looks like integer overflow, possibly within _grow_SCAN_BAM_DATA(). >> Could we just use long there? > > I wonder if it would be more sensible if less convenient to do this > (under Bioc-devel) > > bf <- open(BamFile(fl, yieldSize=100000000)) > cvg <- coverage(readGAlignmentsFromBam(bf)) > while (length(aln <- readGAlignmentsFromBam(bf))) > cvg <- cvg + coverage(aln) > close(bf) > > ? It opens the door for better memory management and parallel evaluation. > > I'm concerned that using size_t (Realloc casts to this) or ptrdiff_t > (the size of R long vectors) would only get us through the C code; the > representation of this in R would require R long vectors, and Rsamtools > does not (yet?) support that. Sorry if I'm missing something obvious but why would the representation of 500 million reads (either as a GappedAlignments object or as a plain list as returned by scanBam()) require R long vectors? Thanks, H. > > Martin > >> >> Michael >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
On 06/03/2013 07:33 PM, Hervé Pagès wrote: > Hi Martin, > > On 06/03/2013 06:26 PM, Martin Morgan wrote: >> On 06/03/2013 05:27 PM, Michael Lawrence wrote: >>> Hey guys, >>> >>> Whenever I try to calculate the coverage for a BAM file with more than >>> say >>> 500 million reads, I get this error: >>> >>> Error in coverage(readBamGappedAlignments(x, param = param), shift = >>> shift, : \n error in evaluating the argument 'x' in selecting a method >>> for function 'coverage': Error in value[[3L]](cond) (from #2) : \n >>> 'Realloc' could not re-allocate memory (18446744065128005632 bytes)\n >>> >>> This looks like integer overflow, possibly within _grow_SCAN_BAM_DATA(). >>> Could we just use long there? >> >> I wonder if it would be more sensible if less convenient to do this >> (under Bioc-devel) >> >> bf <- open(BamFile(fl, yieldSize=100000000)) >> cvg <- coverage(readGAlignmentsFromBam(bf)) >> while (length(aln <- readGAlignmentsFromBam(bf))) >> cvg <- cvg + coverage(aln) >> close(bf) >> >> ? It opens the door for better memory management and parallel evaluation. >> >> I'm concerned that using size_t (Realloc casts to this) or ptrdiff_t >> (the size of R long vectors) would only get us through the C code; the >> representation of this in R would require R long vectors, and Rsamtools >> does not (yet?) support that. > > Sorry if I'm missing something obvious but why would the representation > of 500 million reads (either as a GappedAlignments object or as a plain > list as returned by scanBam()) require R long vectors? not that 500 million would, but that going for 'more' will eventually (when Michael gets 5 times more ambitious than he is now). At the least the software should go up to the limit of R vectors gracefully; as you point out it shouldn't be having problems with 500 million reads. Martin > > Thanks, > H. > > >> >> Martin >> >>> >>> Michael >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD REPLY
0
Entering edit mode
I'm working on getting you more information with gdb. In fact I have been processing this data iteratively and in parallel; but sometimes I get lazy and just want coverage() to work ;) While it's true we'll encounter issues above 2 billion, I'm not sure why I would ever need that many reads. 500 million reads is equivalent to 65X coverage WGS for the human genome. Doubling that to 1 billion would be nice for heterogeneous samples; but then returns start diminishing. And reads are getting longer, so we'll need fewer of them. Thanks for dealing with my extreme requests, Michael On Mon, Jun 3, 2013 at 7:48 PM, Martin Morgan <mtmorgan@fhcrc.org> wrote: > On 06/03/2013 07:33 PM, Hervé Pagès wrote: > >> Hi Martin, >> >> On 06/03/2013 06:26 PM, Martin Morgan wrote: >> >>> On 06/03/2013 05:27 PM, Michael Lawrence wrote: >>> >>>> Hey guys, >>>> >>>> Whenever I try to calculate the coverage for a BAM file with more than >>>> say >>>> 500 million reads, I get this error: >>>> >>>> Error in coverage(**readBamGappedAlignments(x, param = param), shift = >>>> shift, : \n error in evaluating the argument 'x' in selecting a method >>>> for function 'coverage': Error in value[[3L]](cond) (from #2) : \n >>>> 'Realloc' could not re-allocate memory (18446744065128005632 bytes)\n >>>> >>>> This looks like integer overflow, possibly within _grow_SCAN_BAM_DATA(). >>>> Could we just use long there? >>>> >>> >>> I wonder if it would be more sensible if less convenient to do this >>> (under Bioc-devel) >>> >>> bf <- open(BamFile(fl, yieldSize=100000000)) >>> cvg <- coverage(**readGAlignmentsFromBam(bf)) >>> while (length(aln <- readGAlignmentsFromBam(bf))) >>> cvg <- cvg + coverage(aln) >>> close(bf) >>> >>> ? It opens the door for better memory management and parallel evaluation. >>> >>> I'm concerned that using size_t (Realloc casts to this) or ptrdiff_t >>> (the size of R long vectors) would only get us through the C code; the >>> representation of this in R would require R long vectors, and Rsamtools >>> does not (yet?) support that. >>> >> >> Sorry if I'm missing something obvious but why would the representation >> of 500 million reads (either as a GappedAlignments object or as a plain >> list as returned by scanBam()) require R long vectors? >> > > not that 500 million would, but that going for 'more' will eventually > (when Michael gets 5 times more ambitious than he is now). > > At the least the software should go up to the limit of R vectors > gracefully; as you point out it shouldn't be having problems with 500 > million reads. > > Martin > > > >> Thanks, >> H. >> >> >> >>> Martin >>> >>> >>>> Michael >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> ______________________________**_________________ >>>> Bioconductor mailing list >>>> Bioconductor@r-project.org >>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat="" .ethz.ch="" mailman="" listinfo="" bioconductor=""> >>>> Search the archives: >>>> http://news.gmane.org/gmane.**science.biology.informatics.**condu ctor<http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >>>> >>>> >>> >>> >> > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 779 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6