Question

Fast callback functions for scanTabix?

0

Entering edit mode

Tim Triche ★ 4.2k

@tim-triche-3561

Last seen 3.6 years ago

United States

This is one of those things I imagine *somebody* must have done, but I'm not finding an example of it. In order to side-step compatibility issues between mac/windows/linux and also allow for reading of enormous (hundreds of millions to billions of loci from hundreds of subjects) files from (wait for it) tabix'ed files, I'd like to use a C or C++ callback function for each line (or each million lines, which would probably be faster, to be honest).

I tried something like this with an R function and my desktop machine (basically an HPC node with a graphics card and a mouse) was still grinding away when I came back 2 hours later. So, that's right out. Most of the rows, especially upon merging, are sparse, so the memory usage isn't really that much. In principle if I can parse things cleverly I can just rbind() a million-row Matrix() with each pass, and the entire object won't be very big at all. But first I need to parse it prior to the heat death of the universe. GenomicFiles isn't precisely what I'm looking for, and parsing each `elt` with strsplit() is *definitely* not what I am looking for, after trying it.

The current solution, which is an ugly kludge, is to use TabixFile to extract the headers, seqnames, and nrows, then use data.table to read in the actual data. But this breaks with more than a few hundred million rows because zcat fills up the temporary drive (I'm aware that there are kludges for this, too, but probably not on Windows or Macs, and certainly not that I expect users to rely upon). So it would be great if I could quickly and incrementally load chunks of big files by using TabixFile with a sensible yieldSize. At present, I can't.

Any experiences, suggestions ("run screaming in the opposite direction"), pointers, etc. are much appreciated.

tabix • 671 views

ADD COMMENT • link updated 6.0 years ago by Martin Morgan 25k • written 6.0 years ago by Tim Triche ★ 4.2k

score 0 · Answer 1 · 2018-04-18

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 4 days ago

United States

My approach would chunk through the tabix in memory-sized components -- 10M lines? -- processing these in a vectorized way. A secondary approach would chunk through and then call C on the chunk. GenomicFiles::reduceByYield may provide the necessary iteration (possibly in parallel) infrastructure.

ADD COMMENT • link 6.0 years ago Martin Morgan 25k