Question: need additional sanity checks in TabixFile or readVcf
0
gravatar for Jeremy Leipzig
5.9 years ago by
Jeremy Leipzig70 wrote:
I think the Bioconductor workflows for variant studies are great, but TabixFile or readVcf might need some additional sanity checks. If a careless user (not me of course - I'm asking for a friend) mistakenly assumes a VCF is the "TabixFile" and the vcf.bgz.idx is the Tabix index, they will get this cryptic error when attempting to load certain ranges from a VCF file. > tab<-TabixFile(vcfFile,paste(vcfFile,"bgz.tbi",sep=".")) > vcf<-readVcf(tab,genome="b37",param=some_param) Error in lapply(names(vcf[[1]]), function(elt) { : error in evaluating the argument 'X' in selecting a method for function 'lapply': Error in vcf[[1]] : subscript out of bounds This will lead to a lot of futile debugging with the assumption that either the ranges or the vcf itself are corrupt, since loading the vcf without ranges will not rely on Tabix. The Tabix setup is especially prone to error since compressing the VCF just seems like an intermediate step and is often performed in the shell instead of within R. Something that tests whether a Tabix index is really associated with a "TabixFile" would be helpful. Thanks, Jeremy [[alternative HTML version deleted]]
• 1.5k views
ADD COMMENTlink modified 5.9 years ago by Martin Morgan ♦♦ 23k • written 5.9 years ago by Jeremy Leipzig70
Answer: need additional sanity checks in TabixFile or readVcf
0
gravatar for Martin Morgan
5.9 years ago by
Martin Morgan ♦♦ 23k
United States
Martin Morgan ♦♦ 23k wrote:
On 10/01/2013 07:25 AM, Jeremy Leipzig wrote: > I think the Bioconductor workflows for variant studies are great, but > TabixFile or readVcf might need some additional sanity checks. > > If a careless user (not me of course - I'm asking for a friend) mistakenly > assumes a VCF is the "TabixFile" and the vcf.bgz.idx is the Tabix index, > they will get this cryptic error when attempting to load certain ranges > from a VCF file. > >> tab<-TabixFile(vcfFile,paste(vcfFile,"bgz.tbi",sep=".")) >> vcf<-readVcf(tab,genome="b37",param=some_param) > Error in lapply(names(vcf[[1]]), function(elt) { : > error in evaluating the argument 'X' in selecting a method for > function 'lapply': Error in vcf[[1]] : subscript out of bounds > > > This will lead to a lot of futile debugging with the assumption that either > the ranges or the vcf itself are corrupt, since loading the vcf without > ranges will not rely on Tabix. > > The Tabix setup is especially prone to error since compressing the VCF just > seems like an intermediate step and is often performed in the shell instead > of within R. Something that tests whether a Tabix index is really > associated with a "TabixFile" would be helpful. Hi Jeremy -- I'm not 100% sure what your 'friend' did, could you get him / her to illustrate starting with say system.file(package="VariantAnnotation", "extdata", "ex2.vcf") ? Thanks, Martin > > Thanks, > Jeremy > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENTlink written 5.9 years ago by Martin Morgan ♦♦ 23k
well that's odd. managed to get the same error using the most kosher workflow I could find: > vcfFile <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") > from <- vcfFile > to <- tempfile() > compressVcf <- bgzip(from, to) > idx <- indexTabix(compressVcf, "vcf") > tab <- TabixFile(compressVcf, idx) > rng <- GRanges(seqnames="20", ranges=IRanges(start=c(250000, 500000),end=c(300000,600000))) > myparam<-ScanVcfParam(which=rng) > vcf <- readVcf(tab, "hg19",myparam) Error in lapply(names(vcf[[1]]), function(elt) { : error in evaluating the argument 'X' in selecting a method for function 'lapply': Error in vcf[[1]] : subscript out of bounds > sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-redhat-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics utils datasets grDevices methods [8] base other attached packages: [1] VariantAnnotation_1.6.7 Rsamtools_1.12.4 Biostrings_2.28.0 [4] GenomicRanges_1.12.5 IRanges_1.18.4 BiocGenerics_0.6.0 loaded via a namespace (and not attached): [1] AnnotationDbi_1.22.6 Biobase_2.20.1 biomaRt_2.16.0 [4] bitops_1.0-6 BSgenome_1.28.0 DBI_0.2-7 [7] GenomicFeatures_1.12.4 RCurl_1.95-4.1 RSQLite_0.11.4 [10] rtracklayer_1.20.4 stats4_3.0.1 tools_3.0.1 [13] XML_3.98-1.1 zlibbioc_1.6.0 [[alternative HTML version deleted]]
ADD REPLYlink written 5.9 years ago by Jeremy Leipzig70
Hi Jeremy, The problem here is that the positions specified in 'rng' are not in the bam file. In this case readVcf() should return an empty VCF - this was fixed in devel some time ago but not in release. This is now fixed in release v 1.6.8. > vcf <- readVcf(tab, "hg19",myparam) > vcf class: CollapsedVCF dim: 0 3 ... In your previous post you asked about supplying a VCF with corresponding .tbi file as a TabixFile. This should be fine. If the .tbi file does not exist an error will be thrown, > fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") > tab<-TabixFile(fl, paste(fl,"bgz.tbi",sep=".")) Error: TabixFile: file(s) do not exist: '/home/vobencha/R/R-rel/R-3-0-branch/library/VariantAnnotation/extdata /ex2.vcf' '/home/vobencha/R/R-rel/R-3-0-branch/library/VariantAnnotation/extdata /ex2.vcf.bgz.tbi' Because the error was the same in both situations my guess is that your 'param' was again out of range. If this wasn't the case please provide a reproducible example of the TabixFile case and I'll look into it. Valerie On 10/01/2013 09:03 AM, Jeremy Leipzig wrote: > well that's odd. managed to get the same error using the most kosher > workflow I could find: >> vcfFile <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") >> from <- vcfFile >> to <- tempfile() >> compressVcf <- bgzip(from, to) >> idx <- indexTabix(compressVcf, "vcf") >> tab <- TabixFile(compressVcf, idx) >> rng <- GRanges(seqnames="20", ranges=IRanges(start=c(250000, > 500000),end=c(300000,600000))) >> myparam<-ScanVcfParam(which=rng) >> vcf <- readVcf(tab, "hg19",myparam) > Error in lapply(names(vcf[[1]]), function(elt) { : > error in evaluating the argument 'X' in selecting a method for function > 'lapply': Error in vcf[[1]] : subscript out of bounds >> sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-redhat-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics utils datasets grDevices methods > [8] base > > other attached packages: > [1] VariantAnnotation_1.6.7 Rsamtools_1.12.4 Biostrings_2.28.0 > [4] GenomicRanges_1.12.5 IRanges_1.18.4 BiocGenerics_0.6.0 > > loaded via a namespace (and not attached): > [1] AnnotationDbi_1.22.6 Biobase_2.20.1 biomaRt_2.16.0 > [4] bitops_1.0-6 BSgenome_1.28.0 DBI_0.2-7 > [7] GenomicFeatures_1.12.4 RCurl_1.95-4.1 RSQLite_0.11.4 > [10] rtracklayer_1.20.4 stats4_3.0.1 tools_3.0.1 > [13] XML_3.98-1.1 zlibbioc_1.6.0 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLYlink written 5.9 years ago by Valerie Obenchain6.7k
Valerie, Thanks. You will get the same cryptic error if you use the wrong file as the TabixFile. vcfFile <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") from <- vcfFile to <- tempfile() compressVcf <- bgzip(from, to) idx <- indexTabix(compressVcf, "vcf") rng <- GRanges(seqnames="20", ranges=IRanges(start=c(14000, 1230000),end=c(15000,1240000))) myparam<-ScanVcfParam(which=rng) #incorrect - the idx file does not match up with the vcf here, it matches up with the compressed one tab <- TabixFile(vcfFile,idx) vcf <- readVcf(tab, "hg19",myparam) Error in lapply(names(vcf[[1]]), function(elt) { : error in evaluating the argument 'X' in selecting a method for function 'lapply': Error in vcf[[1]] : subscript out of bounds #correct tab <- TabixFile(compressVcf,idx) vcf <- readVcf(tab, "hg19",myparam) Regards, Jeremy [[alternative HTML version deleted]]
ADD REPLYlink written 5.9 years ago by Jeremy Leipzig70
Thanks for the example. In 1.7.52 (devel) an error will now be thrown if the supplied file is not compressed. Ideally we'd confirm the file is compressed when creating the TabixFile object but there isn't a fast easy way to check that (not that I know of). The next best thing is to capture the error when scanTabix() tries to read an uncompressed file. Here is the error you will see, > vcf <- readVcf(tab, "hg19",myparam) Error: scanVcf: scanTabix: read line failed, corrupt or invalid file? path: /home/vobencha/R/library/VariantAnnotation/extdata/ex2.vcf Valerie On 10/01/2013 10:46 AM, Jeremy Leipzig wrote: > Valerie, > Thanks. You will get the same cryptic error if you use the wrong file as > the TabixFile. > > vcfFile <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") > from <- vcfFile > to <- tempfile() > compressVcf <- bgzip(from, to) > idx <- indexTabix(compressVcf, "vcf") > rng <- GRanges(seqnames="20", ranges=IRanges(start=c(14000, > 1230000),end=c(15000,1240000))) > myparam<-ScanVcfParam(which=rng) > > #incorrect - the idx file does not match up with the vcf here, it > matches up with the compressed one > tab <- TabixFile(vcfFile,idx) > vcf <- readVcf(tab, "hg19",myparam) > Error in lapply(names(vcf[[1]]), function(elt) { : > error in evaluating the argument 'X' in selecting a method for > function 'lapply': Error in vcf[[1]] : subscript out of bounds > > #correct > tab <- TabixFile(compressVcf,idx) > vcf <- readVcf(tab, "hg19",myparam) > > Regards, > Jeremy >
ADD REPLYlink written 5.9 years ago by Valerie Obenchain6.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 274 users visited in the last hour