Question: append to bgzipped VCF file
0
4.0 years ago by
France
TimothéeFlutre70 wrote:

I have a (large) bgzipped VCF file. I would like to read through it by chunks of records (say, 1000), doing something on these, and saving them back to another bgzipped VCF file.

Following the examples at the bottom of ?TabixFile and ?readVCF, here is what I'm doing for the moment:

library(VariantAnnotation)
tabix.file <- TabixFile(file="input.vcf.gz", yieldSize=1000)
open(tabix.file)
con <- file("output.vcf", open="a")
... # do something on the "vcf" object
writeVcf(obj=vcf, filename=con)
}
close(con)
close(tabix.file)
bgzip("output.vcf")
indexTabix("output.vcf.bgz", "vcf")


It doesn't seem very efficient to first append iteratively to an uncompressed file, and then to compress it. But is there a way to directly append to a bgzipped VCF file? Or maybe there is a completely different way?

variantannotation tabix • 1.3k views
modified 4.0 years ago by Michael Lawrence11k • written 4.0 years ago by TimothéeFlutre70
Answer: append to bgzipped VCF file
0
4.0 years ago by
United States
Michael Lawrence11k wrote:

In theory, one could use a gzfile()  in append mode to add gzip-compressed blocks to the file (assuming a bgzip is essentially a set of concatenated gzip files, which I think it is). You would need to pass index=FALSE to writeVcf() and index explicitly when finished.

bgzip is indeed close to gzip, but it has an indexing scheme to allow random access (more here). This means that using gzfile() would not have the indexing scheme, unfortunately. In fact, after looking at the source code of writeVcf(), it looks like there is no other way than what I wrote in my question.

It looks like all you would need is a BC field in the gzip header that indicates the size of the block. Since that depends only on the block, blocks are independent, and it should be possible to concatenate bgzf streams. In theory, Rsamtools could support writing bgzf to an anonymous pipe and return the data to R as a raw vector or connection. The data could then be written in append mode. It's also conceivable that one could use gzcon to write a gzip block into a raw vector, and then stream over the gzip header in R, and insert the BC field before writing. The header can't be that long.

I've a hard time concretely understanding what you suggest. Inside the while loop, I have a VCF object to write, but bgzip requires a file as input. So do you mean I should hack into the Rsamtools pkg to allow bgzip to take a connection as input? If so, by looking at the current C code, I guess, I should add a bgzip_con function, taking a pointer to a BGZF object as input?

Maybe it would be as simple as opening the bgzf file in append mode? From looking at the code, it looks like the bgzf_open function in samtools does not support appending, but since Rsamtools has already modified that function for Windows support, perhaps we could make another simple change.

That would do the trick, indeed. Do you want me to ask first the SAMtools team to add support for "append"? By looking at the latest version of htslib (master branch, here), it seems that bgzf_open does now support "a". Thus, is it possible to update Rsamtools with htslib v1.2.1?