Search
Question: tuning memory consumption of summarizeOverlaps?
0
gravatar for Vincent J. Carey, Jr.
2.7 years ago by
United States
Vincent J. Carey, Jr.6.2k wrote:

The main parameters to summarizeOverlaps are features and reads.  I would like to know what one can do to tune the memory consumption of summarizeOverlaps.  One could limit the number of features in play, or could define a ScanBamParam to limit the scope of reads being processed, or one could set a yieldSize in the bam file reference.  Does anyone have data on the options here?  Are details of the reads such as length, or size of bam files, additional determinants of memory consumption?

ADD COMMENTlink modified 2.7 years ago by Valerie Obenchain ♦♦ 6.4k • written 2.7 years ago by Vincent J. Carey, Jr.6.2k
0
gravatar for Valerie Obenchain
2.7 years ago by
Valerie Obenchain ♦♦ 6.4k
United States
Valerie Obenchain ♦♦ 6.4k wrote:

Hi Vince,

summarizeOverlaps is basically two steps: scanBam and findOverlaps.

scanBam:

- This step is run in parallel, by file, with bplaply. If you have many large files I would reduce the number of workers in BPPARAM so you aren't maxed out (default).

- If the files are big (ie, generally > 1000000) it pays to use yieldSize.

- ScanBamParam is useful if you are after a subset of records but assuming you want to count all it doesn't provide an advantage. The code already reads in the minimal information needed to perform overlaps (ie, doesn't bring in other fields, flags etc.).


findOverlaps:

The overlap step will be faster with a smaller number of features so if you really don't need the full annotation then yes, subset it. The new NCList algorithm counts at the C level and the hits are not kept. This has reduced the memory considerably when there are many hits. While a smaller annotation may increase performance slightly I don't think it will affect memory much.

Because we aren't reading/manipulating sequences I can't imagine read length plays a role here. Read positions are stored as 'start' and 'end' (or width) but essentially just 2 integers. I don't know if there is much of a difference finding overlaps on small vs large ranges. I believe Herve saw a performance difference with many small nested ranges vs non-nested but not just large vs small.


In my experience, with multiple large files, moderating yieldSize and the number of workers have been the most effective in controlling memory.

I'm sure Martin and Herve have more to add.

Val

ADD COMMENTlink written 2.7 years ago by Valerie Obenchain ♦♦ 6.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 139 users visited in the last hour