append to bgzipped VCF file
1
0
Entering edit mode
@timotheeflutre-6727
Last seen 5.7 years ago
France

I have a (large) bgzipped VCF file. I would like to read through it by chunks of records (say, 1000), doing something on these, and saving them back to another bgzipped VCF file.

Following the examples at the bottom of ?TabixFile and ?readVCF, here is what I'm doing for the moment:

library(VariantAnnotation)
tabix.file <- TabixFile(file="input.vcf.gz", yieldSize=1000)
open(tabix.file)
con <- file("output.vcf", open="a")
while(nrow(vcf <- readVcf(tabix.file, "mygenome"))){
    ... # do something on the "vcf" object
    writeVcf(obj=vcf, filename=con)
}
close(con)
close(tabix.file)
bgzip("output.vcf")
indexTabix("output.vcf.bgz", "vcf")

It doesn't seem very efficient to first append iteratively to an uncompressed file, and then to compress it. But is there a way to directly append to a bgzipped VCF file? Or maybe there is a completely different way?

variantannotation tabix • 3.2k views
ADD COMMENT
0
Entering edit mode
@michael-lawrence-3846
Last seen 3.1 years ago
United States

In theory, one could use a gzfile()  in append mode to add gzip-compressed blocks to the file (assuming a bgzip is essentially a set of concatenated gzip files, which I think it is). You would need to pass index=FALSE to writeVcf() and index explicitly when finished.

ADD COMMENT
0
Entering edit mode

bgzip is indeed close to gzip, but it has an indexing scheme to allow random access (more here). This means that using gzfile() would not have the indexing scheme, unfortunately. In fact, after looking at the source code of writeVcf(), it looks like there is no other way than what I wrote in my question.

ADD REPLY
0
Entering edit mode

It looks like all you would need is a BC field in the gzip header that indicates the size of the block. Since that depends only on the block, blocks are independent, and it should be possible to concatenate bgzf streams. In theory, Rsamtools could support writing bgzf to an anonymous pipe and return the data to R as a raw vector or connection. The data could then be written in append mode. It's also conceivable that one could use gzcon to write a gzip block into a raw vector, and then stream over the gzip header in R, and insert the BC field before writing. The header can't be that long.
 

ADD REPLY
0
Entering edit mode

I've a hard time concretely understanding what you suggest. Inside the while loop, I have a VCF object to write, but bgzip requires a file as input. So do you mean I should hack into the Rsamtools pkg to allow bgzip to take a connection as input? If so, by looking at the current C code, I guess, I should add a bgzip_con function, taking a pointer to a BGZF object as input?

ADD REPLY
0
Entering edit mode

Maybe it would be as simple as opening the bgzf file in append mode? From looking at the code, it looks like the bgzf_open function in samtools does not support appending, but since Rsamtools has already modified that function for Windows support, perhaps we could make another simple change.

ADD REPLY
0
Entering edit mode

That would do the trick, indeed. Do you want me to ask first the SAMtools team to add support for "append"? By looking at the latest version of htslib (master branch, here), it seems that bgzf_open does now support "a". Thus, is it possible to update Rsamtools with htslib v1.2.1?

ADD REPLY
0
Entering edit mode

I think that there is some movement afoot to port Rsamtools to use htslib (via Rhtslib), but that has not yet happened.

ADD REPLY

Login before adding your answer.

Traffic: 629 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6