processing BamFiles - streaming vs not streaming
1
0
Entering edit mode
Stefanie ▴ 360
@stefanie-5192
Last seen 9.6 years ago
Dear list, I have a question regarding processing of large bamfiles (such as reading in via ‚readGAlignemnts‘ or computing the coverage). I know about the option of iterative processing such as shown in the example below. mybam = open(Bamfile(„bamfile“, yieldSize = 2000000)) gAln <- GAlignments() while(length(chunk <- readGAlignmentsFromBam(mybam))){ gAln <- c(gAln,chunk) } close(mybam) Obviously the efficiency of iterating depends on the (i) file-size of the bam file and (ii) the available memory. Can I somehow pinpoint (e.g. file-size, number of alignments, memory requirements) when it is more efficient ( = faster and memory requirements are feasible) to process the bam-file in one batch or, alternatively, do it in an iterative manner? Best, Stefanie [[alternative HTML version deleted]]
PROcess PROcess • 830 views
ADD COMMENT
0
Entering edit mode
@michael-lawrence-3846
Last seen 2.4 years ago
United States
On Tue, Dec 3, 2013 at 2:16 AM, Stefanie Tauber < stefanie.tauber@univie.ac.at> wrote: > Dear list, > > I have a question regarding processing of large bamfiles (such as reading > in via ‚readGAlignemnts‘ or computing the coverage). > I know about the option of iterative processing such as shown in the > example below. > > > mybam = open(Bamfile(„bamfile“, yieldSize = 2000000)) > > gAln <- GAlignments() > > while(length(chunk <- readGAlignmentsFromBam(mybam))){ > gAln <- c(gAln,chunk) > } > close(mybam) > > Note that this example largely defeats the purpose of iteration, because there is very little reduction and depending on how much duplication is caused by c(), it is probably less efficient than reading the data all at once. > > Obviously the efficiency of iterating depends on the (i) file-size of the > bam file and (ii) the available memory. > > Can I somehow pinpoint (e.g. file-size, number of alignments, memory > requirements) when it is more efficient ( = faster and memory > requirements are feasible) to process the bam-file in one batch or, > alternatively, do it in an iterative manner? > > The point of iteration is to restrict resource consumption at any one point in time, with each iteration summarizing the data, so that the end result is of manageable size. There is overhead to each iteration, mostly due to the I/O and other system calls, as well as the R evaluator. Thus, one strategy is to increase the size of each iteration (and reduce the number of iterations) until resource consumption is maximized without exceeding the limits. It would be interesting to see how memory memory is consumed per GAlignments record. This would probably mostly vary by the complexity of the alignments, i.e., the length of the CIGAR. So the biggest difference is probably between DNA-seq and RNA-seq. I'll actually perform this analysis over the next couple of days, because I'm working on a paper related to this. > > Best, > Stefanie > > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENT

Login before adding your answer.

Traffic: 918 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6