Hi,
I have a set of VCF files with `tabix` indices, 1 per chromosome. I also have a `GRanges` object detailing loci of interest.
In the VariantAnnotation introduction PDF, the recommendation for only loading that subset of the data which overlap loci of interest is to first create a `TabixFile` and then pass that to `readVcf()`:
vcf.fl <- list.files("path/to/vcfs", full.names=TRUE, pattern=".vcf.gz$") tab <- TabixFile(vcf.fl, index=paste(vcf.fl, "tbi", sep=".")) vcf.rng <- readVcf(tab, "hg19", param=GRObject) Error: scanVcf: 'filename' must be character(1)
As `vcf.fl` is a list of files instead of 1 file as in the introduction, I attempted to use `TabixFileList`:
tab <- TabixFile(vcf.fl, index=paste(vcf.fl, "tbi", sep=".")) vcf.rng <- readVcf(tab, "hg19", param=GRObject) Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘readVcf’ for signature ‘"TabixFileList", "character", "GRanges"’
Clearly I am misunderstanding something simple, but Google does not suggest how to proceed. What is the right way to load in a subset of my data?
Good point about TabixFileLIst:
However, when I try your suggestion, I get the following:
The files are sorted as "chr10", "chr11" ... "chr1" ... and so perhaps there is some sort of mismatch occurring? Not so - directly subsetting both to the VCF for chr1 and the GRanges for chr1 yields a similar error:
I am marking your answer as correct bc it pointed me in the right direction - embarrassingly the VCF files are all under the NCBI format for chr names, while my GRanges object was using the UCSC format :(
Nonetheless, I have benefitted from this discussion, as I was wholly unaware of VcfStack and related methods.
On a more general note, you say:
It is not a challenging development problem with available infrastructure. A question is whether tabix-indexed vcf is durable enough to warrant the investment. Or will cloud-oriented or other more modern storage disciplines make this approach obsolete?
we should be able to encapsulate a vcf collection like 1000 genomes using GenomicFiles, and "efficiently" interrogate the collection for content occupying regions given by a GRanges without manually pairing the GRanges to the files.
While, like many users, I have not come up against this problem previously, I expect that it will not go away as integrative analyses become more and more in vogue. Rather, I think it will become more important, and while implementation details may change from *.tbi files to something else, the concept of 'genomic iteration' warrants some further thought, as the current approach of querying for all regions in a GRanges object every time a chromosome-specific file is opened leaves something to be desired ... Not that I have any suggestions lol
What does
headerTabix(vcf.fl[[11]])$seqnames
have to say? It should include "chr1". If not, can you independently verify that the index does include "chr1"?