#### The support.bioconductor.org editor has been updated to markdown! Please see more info at: Tutorial: Updated Support Site Editor

Question: Fast callback functions for scanTabix?
0
10 months ago by
Tim Triche4.2k
United States
Tim Triche4.2k wrote:

This is one of those things I imagine *somebody* must have done, but I'm not finding an example of it.  In order to side-step compatibility issues between mac/windows/linux and also allow for reading of enormous (hundreds of millions to billions of loci from hundreds of subjects) files from (wait for it) tabix'ed files, I'd like to use a C or C++ callback function for each line (or each million lines, which would probably be faster, to be honest).

I tried something like this with an R function and my desktop machine (basically an HPC node with a graphics card and a mouse) was still grinding away when I came back 2 hours later.  So, that's right out.  Most of the rows, especially upon merging, are sparse, so the memory usage isn't really that much.  In principle if I can parse things cleverly I can just rbind() a million-row Matrix() with each pass, and the entire object won't be very big at all.  But first I need to parse it prior to the heat death of the universe.  GenomicFiles isn't precisely what I'm looking for, and parsing each elt with strsplit() is *definitely* not what I am looking for, after trying it.

The current solution, which is an ugly kludge, is to use TabixFile to extract the headers, seqnames, and nrows, then use data.table to read in the actual data.  But this breaks with more than a few hundred million rows because zcat fills up the temporary drive (I'm aware that there are kludges for this, too, but probably not on Windows or Macs, and certainly not that I expect users to rely upon).  So it would be great if I could quickly and incrementally load chunks of big files by using TabixFile with a sensible yieldSize.  At present, I can't.

Any experiences, suggestions ("run screaming in the opposite direction"), pointers, etc. are much appreciated.

tabix • 149 views
modified 10 months ago by Martin Morgan ♦♦ 22k • written 10 months ago by Tim Triche4.2k
Answer: Fast callback functions for scanTabix?
0
10 months ago by
Martin Morgan ♦♦ 22k
United States
Martin Morgan ♦♦ 22k wrote:

My approach would chunk through the tabix in memory-sized components -- 10M lines? -- processing these in a vectorized way. A secondary approach would chunk through and then call C on the chunk. GenomicFiles::reduceByYield may provide the necessary iteration (possibly in parallel) infrastructure.

Content
Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.