Question: Fast callback functions for scanTabix?
gravatar for Tim Triche
18 months ago by
Tim Triche4.2k
United States
Tim Triche4.2k wrote:

This is one of those things I imagine *somebody* must have done, but I'm not finding an example of it.  In order to side-step compatibility issues between mac/windows/linux and also allow for reading of enormous (hundreds of millions to billions of loci from hundreds of subjects) files from (wait for it) tabix'ed files, I'd like to use a C or C++ callback function for each line (or each million lines, which would probably be faster, to be honest). 


I tried something like this with an R function and my desktop machine (basically an HPC node with a graphics card and a mouse) was still grinding away when I came back 2 hours later.  So, that's right out.  Most of the rows, especially upon merging, are sparse, so the memory usage isn't really that much.  In principle if I can parse things cleverly I can just rbind() a million-row Matrix() with each pass, and the entire object won't be very big at all.  But first I need to parse it prior to the heat death of the universe.  GenomicFiles isn't precisely what I'm looking for, and parsing each `elt` with strsplit() is *definitely* not what I am looking for, after trying it.


The current solution, which is an ugly kludge, is to use TabixFile to extract the headers, seqnames, and nrows, then use data.table to read in the actual data.  But this breaks with more than a few hundred million rows because zcat fills up the temporary drive (I'm aware that there are kludges for this, too, but probably not on Windows or Macs, and certainly not that I expect users to rely upon).  So it would be great if I could quickly and incrementally load chunks of big files by using TabixFile with a sensible yieldSize.  At present, I can't. 


Any experiences, suggestions ("run screaming in the opposite direction"), pointers, etc. are much appreciated.



tabix • 223 views
ADD COMMENTlink modified 18 months ago by Martin Morgan ♦♦ 23k • written 18 months ago by Tim Triche4.2k
Answer: Fast callback functions for scanTabix?
gravatar for Martin Morgan
18 months ago by
Martin Morgan ♦♦ 23k
United States
Martin Morgan ♦♦ 23k wrote:

My approach would chunk through the tabix in memory-sized components -- 10M lines? -- processing these in a vectorized way. A secondary approach would chunk through and then call C on the chunk. GenomicFiles::reduceByYield may provide the necessary iteration (possibly in parallel) infrastructure.

ADD COMMENTlink written 18 months ago by Martin Morgan ♦♦ 23k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 303 users visited in the last hour