The main parameters to summarizeOverlaps are features and reads. I would like to know what one can do to tune the memory consumption of summarizeOverlaps. One could limit the number of features in play, or could define a ScanBamParam to limit the scope of reads being processed, or one could set a yieldSize in the bam file reference. Does anyone have data on the options here? Are details of the reads such as length, or size of bam files, additional determinants of memory consumption?
summarizeOverlaps is basically two steps: scanBam and findOverlaps.
- This step is run in parallel, by file, with bplaply. If you have many large files I would reduce the number of workers in BPPARAM so you aren't maxed out (default).
- If the files are big (ie, generally > 1000000) it pays to use yieldSize.
- ScanBamParam is useful if you are after a subset of records but assuming you want to count all it doesn't provide an advantage. The code already reads in the minimal information needed to perform overlaps (ie, doesn't bring in other fields, flags etc.).
The overlap step will be faster with a smaller number of features so if you really don't need the full annotation then yes, subset it. The new NCList algorithm counts at the C level and the hits are not kept. This has reduced the memory considerably when there are many hits. While a smaller annotation may increase performance slightly I don't think it will affect memory much.
Because we aren't reading/manipulating sequences I can't imagine read length plays a role here. Read positions are stored as 'start' and 'end' (or width) but essentially just 2 integers. I don't know if there is much of a difference finding overlaps on small vs large ranges. I believe Herve saw a performance difference with many small nested ranges vs non-nested but not just large vs small.
In my experience, with multiple large files, moderating yieldSize and the number of workers have been the most effective in controlling memory.
I'm sure Martin and Herve have more to add.