Question

Best way to import from multiple VCF files?

0

Entering edit mode

Kipper Fletez-Brant ▴ 150

@kipper-fletez-brant-6421

Last seen 6.8 years ago

United States

Hi,

I have a set of VCF files with `tabix` indices, 1 per chromosome. I also have a `GRanges` object detailing loci of interest.

In the VariantAnnotation introduction PDF, the recommendation for only loading that subset of the data which overlap loci of interest is to first create a `TabixFile` and then pass that to `readVcf()`:

vcf.fl <- list.files("path/to/vcfs", full.names=TRUE, pattern=".vcf.gz$")

tab <- TabixFile(vcf.fl, index=paste(vcf.fl, "tbi", sep="."))

vcf.rng <- readVcf(tab, "hg19", param=GRObject)

Error: scanVcf: 'filename' must be character(1)

As `vcf.fl` is a list of files instead of 1 file as in the introduction, I attempted to use `TabixFileList`:

tab <- TabixFile(vcf.fl, index=paste(vcf.fl, "tbi", sep="."))

vcf.rng <- readVcf(tab, "hg19", param=GRObject)
 Error in (function (classes, fdef, mtable)  :
   unable to find an inherited method for function ‘readVcf’ for signature ‘"TabixFileList", "character", "GRanges"’

Clearly I am misunderstanding something simple, but Google does not suggest how to proceed. What is the right way to load in a subset of my data?

variantannotation • 3.4k views

ADD COMMENT • link updated 8.4 years ago by shepherl 4.1k • written 8.4 years ago by Kipper Fletez-Brant ▴ 150

1

Entering edit mode

shepherl 4.1k

@lshep

Last seen 1 day ago

United States

You might try using a VcfStack() object from the GenomicFiles pacakge. The class helps manage related VCF files as a group.

Something along the lines of

vs <- VcfStack(vcf.fl).

The VcfStack class allows subsetting by GRanges objects utilizing

subVS <- vs[GRObject,]

The class also utilizes a function to read all the VCF files in the object or a subset of the files

vcfinfo <- readVcfStack(subVS)

or

vcfinfo <- readVcfStack(vs, i=GRObject)

The updated functionality of this class is currently available in the devel version of the package. See GenomicFiles

ADD COMMENT • link 8.4 years ago shepherl 4.1k

0

Entering edit mode

I am working with the VcfStack/RangedVcfStack ideas now ... I just wanted to point out a discrepancy between the docs and the current package:

The following:

slens <- seqlengths(BSgenome.Hsapiens.UCSC.hg19)[1:24]
slens <- Seqinfo(seqnames=names(slens), seqlengths=slens, genome="hg19")
names(vcf.fl) <- gsub(".*.chr([0-9XY]+).*", "\\1", vcf.fl)

vs <- RangedVcfStack(VcfStack(vcf.fl, slens), GRObject)

vcfinfo <- readVcfStack(vs, i=GRObject)

Yields the following error:

 Error in readVcfStack(vs, unique(GRObject)) :
seqnames(i) must have one unique seqname

However, the following works:

rvs <- mclapply(unique(seqnames(GRObject)), function(chr){
    print(paste0("chr: ", chr))
    readVcfStack(vs, i=subset(GRObject, seqnames(GRObject) == chr))
}, mc.cores=10)

I get similarly unexpected errors attempting other, more intuitive, attempts at subsetting:

 head(assay(vs, GRObject))
 Error in (function (classes, fdef, mtable)  :
   unable to find an inherited method for function ‘assay’ for signature ‘"VcfStack", "GRanges"’

vs[GRObject,]
 Error: length(unique(querseq)) == 1 is not TRUE

vs[c("7", "X"),]
 Error in vs[c("7", "X"), ] : object of type 'S4' is not subsettable

While I know that this is a moving target presently, I am putting this here in part so others trying to Google for errors will find it.

ADD REPLY • link 8.4 years ago Kipper Fletez-Brant ▴ 150

0

Entering edit mode

Thank you for the comments.

Are you sure you are using the devel version of Bioconductor: see here

Also I am having trouble reproducing some of your errors. Could you share how you created your GRObject

ADD REPLY • link 8.4 years ago shepherl 4.1k

0

Entering edit mode

Hmm... re: devel, I suppose I am not using it. Will try that specifically. As far the GRObject goes:

library(AnnotationHub)
ah <- subset(AnnotationHub(), species == "Homo sapiens")
qhs <- query(query(ah, "CTCF"), "GM12878")
ctcf <- subset(qhs, title == "wgEncodeAwgTfbsUwGm12878CtcfUniPk.narrowPeak.gz")[[1]]

ADD REPLY • link 8.4 years ago Kipper Fletez-Brant ▴ 150

0

Entering edit mode

Please see if some of the errors are corrected when using the devel version. I just made a few changes so the most current version of the package is GenomicFiles_1.9.12

ADD REPLY • link 8.4 years ago shepherl 4.1k

score 1 · Accepted Answer · 2016-07-26

1

Entering edit mode

Vincent J. Carey, Jr. 6.7k

@vincent-j-carey-jr-4

Last seen 3 days ago

United States

The documentation for readVcf includes

file: A ‘TabixFile’ instance or character() name of the VCF file to

be processed. When ranges are specified in ‘param’, ‘file’

must be a ‘TabixFile’.

Use of the ‘TabixFile’ methods are encouraged as they are

more efficient than the character() methods. See ?‘TabixFile’

and ?‘indexTabix’ for help creating a ‘TabixFile’.

There is no accommodation for TabixFileList. Your first error comes from passing a TabixFile instance

that has multiple paths. I think this may be a bug in TabixFile. TabixFile() should fail when handed a vector of filenames.

I think the way you triggered the second error is not correctly given in your text. You do not show how

you create the TabixFileList. However I believe that your call will succeed if you did make a TabixFileList

as the signature error indicates, and you use readVcf(tab[[1]], "hg19", param.=GRObject). You use readVcf

one file at a time.

GenomicFiles package should provide tools that help with such family-of-files operations but I have a feeling that

all the connections have not yet been made.

ADD COMMENT • link 8.4 years ago Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

Good point about TabixFileLIst:

tab <- TabixFileList(vcf.fl, index=paste(vcf.fl, "tbi", sep=".")) ## vcf.fl is the same as above

However, when I try your suggestion, I get the following:

vcf.rng <- readVcf(tab[[1]], "hg19", param=ctcf)
Error: scanVcf: scanVcf: scanTabix: 'chr1' not present in tabix index
  path: /path/to/chr10.vcf.gz
  index: /path/to/chr10.vcf.gz.tbi
   path: /path/to/chr1.vcf.gz

The files are sorted as "chr10", "chr11" ... "chr1" ... and so perhaps there is some sort of mismatch occurring? Not so - directly subsetting both to the VCF for chr1 and the GRanges for chr1 yields a similar error:

chr1 <- vcf.fl[[11]] ## this is chr1

chr1.rng <- readVcf(chr1, "hg19", param=subset(GRObject, seqnames=="chr1"))

 Error: scanVcf: scanVcf: scanTabix: 'chr1' not present in tabix index
  path: /path/to/chr1.vcf.gz
  index: /path/to/chr1.vcf.gz.tbi
   path: /path/to/chr1.vcf.gz

ADD REPLY • link 8.4 years ago Kipper Fletez-Brant ▴ 150

1

Entering edit mode

On Wed, Jul 27, 2016 at 7:03 AM, Kipper Fletez-Brant [bioc] < noreply@bioconductor.org> wrote: > Activity on a post you are following on support.bioconductor.org > > User Kipper Fletez-Brant <https: support.bioconductor.org="" u="" 6421=""/> wrote Comment: > Best way to import from multiple VCF files? > <https: support.bioconductor.org="" p="" 85458="" #85476="">: > > Good point about TabixFileLIst: > > tab <- TabixFileList(vcf.fl, index=paste(vcf.fl, "tbi", sep=".")) ## vcf.fl is the same as above > > However, when I try your suggestion, I get the following: > > vcf.rng <- readVcf(tab[[1]], "hg19", param=ctcf) > Error: scanVcf: scanVcf: scanTabix: 'chr1' not present in tabix index > path: /path/to/chr10.vcf.gz > index: /path/to/chr10.vcf.gz.tbi > path: /path/to/chr1.vcf.gz > > if the (a?) seqname in your ScanVcfParam 'which' component is not present in the tabix index you will get this error. If your vcfs are indeed chromosome specific (one chromosome at most per vcf) you will probably benefit from the VcfStack discipline noted by Lori Shepherd, but you will have to set up your selection parameters to match the Vcf content to which they are applied. This whole process deserves a higher-level solution but there do not seem to be many bioconductor users confronting it thus far. In essence, we should be able to encapsulate a vcf collection like 1000 genomes using GenomicFiles, and "efficiently" interrogate the collection for content occupying regions given by a GRanges without manually pairing the GRanges to the files. We would need some concept like "genomic iteration" underlying this because a holistic solution is not likely to work in available environments owing to the likely volumes involved. tileGenome would be a useful tool for setting up the iteration scheme. At a recent meeting it was mentioned that there is some vcf-oriented indexing strategy that allows sample-based extraction in addition to range-based extraction and that would be worth following up. I can't recall the details. It is not a challenging development problem with available infrastructure. A question is whether tabix-indexed vcf is durable enough to warrant the investment. Or will cloud-oriented or other more modern storage disciplines make this approach obsolete? > The files are sorted as "chr10", "chr11" ... "chr1" ... and so perhaps there is some sort of mismatch occurring? Not so - directly subsetting both to the VCF for chr1 and the GRanges for chr1 yields a similar error: > > chr1 <- vcf.fl[[11]] ## this is chr1 > > chr1.rng <- readVcf(chr1, "hg19", param=grSub(GRObject, "chr1")) > > Error: scanVcf: scanVcf: scanTabix: 'chr1' not present in tabix index > path: /path/to/chr1.vcf.gz > index: /path/to/chr1.vcf.gz.tbi > path: /path/to/chr1.vcf.gz > > > > ------------------------------ > > Post tags: variantannotation > > You may reply via email or visit > C: Best way to import from multiple VCF files? >

ADD REPLY • link 8.4 years ago Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

I am marking your answer as correct bc it pointed me in the right direction - embarrassingly the VCF files are all under the NCBI format for chr names, while my GRanges object was using the UCSC format :(

Nonetheless, I have benefitted from this discussion, as I was wholly unaware of VcfStack and related methods.

On a more general note, you say:

It is not a challenging development problem with available infrastructure. A question is whether tabix-indexed vcf is durable enough to warrant the investment. Or will cloud-oriented or other more modern storage disciplines make this approach obsolete?

we should be able to encapsulate a vcf collection like 1000 genomes using GenomicFiles, and "efficiently" interrogate the collection for content occupying regions given by a GRanges without manually pairing the GRanges to the files.

While, like many users, I have not come up against this problem previously, I expect that it will not go away as integrative analyses become more and more in vogue. Rather, I think it will become more important, and while implementation details may change from *.tbi files to something else, the concept of 'genomic iteration' warrants some further thought, as the current approach of querying for all regions in a GRanges object every time a chromosome-specific file is opened leaves something to be desired ... Not that I have any suggestions lol

ADD REPLY • link 8.4 years ago Kipper Fletez-Brant ▴ 150

1

Entering edit mode

What does headerTabix(vcf.fl[[11]])$seqnames have to say? It should include "chr1". If not, can you independently verify that the index does include "chr1"?

ADD REPLY • link 8.4 years ago Martin Morgan 25k