Question: append to bgzipped VCF file
gravatar for TimothéeFlutre
4.0 years ago by
TimothéeFlutre70 wrote:

I have a (large) bgzipped VCF file. I would like to read through it by chunks of records (say, 1000), doing something on these, and saving them back to another bgzipped VCF file.

Following the examples at the bottom of ?TabixFile and ?readVCF, here is what I'm doing for the moment:

tabix.file <- TabixFile(file="input.vcf.gz", yieldSize=1000)
con <- file("output.vcf", open="a")
while(nrow(vcf <- readVcf(tabix.file, "mygenome"))){
    ... # do something on the "vcf" object
    writeVcf(obj=vcf, filename=con)
indexTabix("output.vcf.bgz", "vcf")

It doesn't seem very efficient to first append iteratively to an uncompressed file, and then to compress it. But is there a way to directly append to a bgzipped VCF file? Or maybe there is a completely different way?

variantannotation tabix • 1.3k views
ADD COMMENTlink modified 4.0 years ago by Michael Lawrence11k • written 4.0 years ago by TimothéeFlutre70
Answer: append to bgzipped VCF file
gravatar for Michael Lawrence
4.0 years ago by
United States
Michael Lawrence11k wrote:

In theory, one could use a gzfile()  in append mode to add gzip-compressed blocks to the file (assuming a bgzip is essentially a set of concatenated gzip files, which I think it is). You would need to pass index=FALSE to writeVcf() and index explicitly when finished.

ADD COMMENTlink written 4.0 years ago by Michael Lawrence11k

bgzip is indeed close to gzip, but it has an indexing scheme to allow random access (more here). This means that using gzfile() would not have the indexing scheme, unfortunately. In fact, after looking at the source code of writeVcf(), it looks like there is no other way than what I wrote in my question.

ADD REPLYlink written 4.0 years ago by TimothéeFlutre70

It looks like all you would need is a BC field in the gzip header that indicates the size of the block. Since that depends only on the block, blocks are independent, and it should be possible to concatenate bgzf streams. In theory, Rsamtools could support writing bgzf to an anonymous pipe and return the data to R as a raw vector or connection. The data could then be written in append mode. It's also conceivable that one could use gzcon to write a gzip block into a raw vector, and then stream over the gzip header in R, and insert the BC field before writing. The header can't be that long.

ADD REPLYlink written 4.0 years ago by Michael Lawrence11k

I've a hard time concretely understanding what you suggest. Inside the while loop, I have a VCF object to write, but bgzip requires a file as input. So do you mean I should hack into the Rsamtools pkg to allow bgzip to take a connection as input? If so, by looking at the current C code, I guess, I should add a bgzip_con function, taking a pointer to a BGZF object as input?

ADD REPLYlink written 4.0 years ago by TimothéeFlutre70

Maybe it would be as simple as opening the bgzf file in append mode? From looking at the code, it looks like the bgzf_open function in samtools does not support appending, but since Rsamtools has already modified that function for Windows support, perhaps we could make another simple change.

ADD REPLYlink written 4.0 years ago by Michael Lawrence11k

That would do the trick, indeed. Do you want me to ask first the SAMtools team to add support for "append"? By looking at the latest version of htslib (master branch, here), it seems that bgzf_open does now support "a". Thus, is it possible to update Rsamtools with htslib v1.2.1?

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by TimothéeFlutre70

I think that there is some movement afoot to port Rsamtools to use htslib (via Rhtslib), but that has not yet happened.

ADD REPLYlink written 4.0 years ago by Michael Lawrence11k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 245 users visited in the last hour