This is one of those things I imagine *somebody* must have done, but I'm not finding an example of it. In order to side-step compatibility issues between mac/windows/linux and also allow for reading of enormous (hundreds of millions to billions of loci from hundreds of subjects) files from (wait for it) tabix'ed files, I'd like to use a C or C++ callback function for each line (or each million lines, which would probably be faster, to be honest).
I tried something like this with an R function and my desktop machine (basically an HPC node with a graphics card and a mouse) was still grinding away when I came back 2 hours later. So, that's right out. Most of the rows, especially upon merging, are sparse, so the memory usage isn't really that much. In principle if I can parse things cleverly I can just rbind() a million-row Matrix() with each pass, and the entire object won't be very big at all. But first I need to parse it prior to the heat death of the universe. GenomicFiles isn't precisely what I'm looking for, and parsing each `elt` with strsplit() is *definitely* not what I am looking for, after trying it.
The current solution, which is an ugly kludge, is to use TabixFile to extract the headers, seqnames, and nrows, then use data.table to read in the actual data. But this breaks with more than a few hundred million rows because zcat fills up the temporary drive (I'm aware that there are kludges for this, too, but probably not on Windows or Macs, and certainly not that I expect users to rely upon). So it would be great if I could quickly and incrementally load chunks of big files by using TabixFile with a sensible yieldSize. At present, I can't.
Any experiences, suggestions ("run screaming in the opposite direction"), pointers, etc. are much appreciated.