Fast callback functions for scanTabix?
1
0
Entering edit mode
Tim Triche ★ 4.2k
@tim-triche-3561
Last seen 4.2 years ago
United States

This is one of those things I imagine *somebody* must have done, but I'm not finding an example of it.  In order to side-step compatibility issues between mac/windows/linux and also allow for reading of enormous (hundreds of millions to billions of loci from hundreds of subjects) files from (wait for it) tabix'ed files, I'd like to use a C or C++ callback function for each line (or each million lines, which would probably be faster, to be honest). 

 

I tried something like this with an R function and my desktop machine (basically an HPC node with a graphics card and a mouse) was still grinding away when I came back 2 hours later.  So, that's right out.  Most of the rows, especially upon merging, are sparse, so the memory usage isn't really that much.  In principle if I can parse things cleverly I can just rbind() a million-row Matrix() with each pass, and the entire object won't be very big at all.  But first I need to parse it prior to the heat death of the universe.  GenomicFiles isn't precisely what I'm looking for, and parsing each `elt` with strsplit() is *definitely* not what I am looking for, after trying it.

 

The current solution, which is an ugly kludge, is to use TabixFile to extract the headers, seqnames, and nrows, then use data.table to read in the actual data.  But this breaks with more than a few hundred million rows because zcat fills up the temporary drive (I'm aware that there are kludges for this, too, but probably not on Windows or Macs, and certainly not that I expect users to rely upon).  So it would be great if I could quickly and incrementally load chunks of big files by using TabixFile with a sensible yieldSize.  At present, I can't. 

 

Any experiences, suggestions ("run screaming in the opposite direction"), pointers, etc. are much appreciated.

 

 

tabix • 784 views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 4 months ago
United States

My approach would chunk through the tabix in memory-sized components -- 10M lines? -- processing these in a vectorized way. A secondary approach would chunk through and then call C on the chunk. GenomicFiles::reduceByYield may provide the necessary iteration (possibly in parallel) infrastructure.

ADD COMMENT

Login before adding your answer.

Traffic: 803 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6