need additional sanity checks in TabixFile or readVcf
1
0
Entering edit mode
@jeremy-leipzig-4924
Last seen 6.0 years ago
I think the Bioconductor workflows for variant studies are great, but TabixFile or readVcf might need some additional sanity checks. If a careless user (not me of course - I'm asking for a friend) mistakenly assumes a VCF is the "TabixFile" and the vcf.bgz.idx is the Tabix index, they will get this cryptic error when attempting to load certain ranges from a VCF file. > tab<-TabixFile(vcfFile,paste(vcfFile,"bgz.tbi",sep=".")) > vcf<-readVcf(tab,genome="b37",param=some_param) Error in lapply(names(vcf[[1]]), function(elt) { : error in evaluating the argument 'X' in selecting a method for function 'lapply': Error in vcf[[1]] : subscript out of bounds This will lead to a lot of futile debugging with the assumption that either the ranges or the vcf itself are corrupt, since loading the vcf without ranges will not rely on Tabix. The Tabix setup is especially prone to error since compressing the VCF just seems like an intermediate step and is often performed in the shell instead of within R. Something that tests whether a Tabix index is really associated with a "TabixFile" would be helpful. Thanks, Jeremy [[alternative HTML version deleted]]
• 2.8k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 4 months ago
United States
On 10/01/2013 07:25 AM, Jeremy Leipzig wrote: > I think the Bioconductor workflows for variant studies are great, but > TabixFile or readVcf might need some additional sanity checks. > > If a careless user (not me of course - I'm asking for a friend) mistakenly > assumes a VCF is the "TabixFile" and the vcf.bgz.idx is the Tabix index, > they will get this cryptic error when attempting to load certain ranges > from a VCF file. > >> tab<-TabixFile(vcfFile,paste(vcfFile,"bgz.tbi",sep=".")) >> vcf<-readVcf(tab,genome="b37",param=some_param) > Error in lapply(names(vcf[[1]]), function(elt) { : > error in evaluating the argument 'X' in selecting a method for > function 'lapply': Error in vcf[[1]] : subscript out of bounds > > > This will lead to a lot of futile debugging with the assumption that either > the ranges or the vcf itself are corrupt, since loading the vcf without > ranges will not rely on Tabix. > > The Tabix setup is especially prone to error since compressing the VCF just > seems like an intermediate step and is often performed in the shell instead > of within R. Something that tests whether a Tabix index is really > associated with a "TabixFile" would be helpful. Hi Jeremy -- I'm not 100% sure what your 'friend' did, could you get him / her to illustrate starting with say system.file(package="VariantAnnotation", "extdata", "ex2.vcf") ? Thanks, Martin > > Thanks, > Jeremy > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENT
0
Entering edit mode
well that's odd. managed to get the same error using the most kosher workflow I could find: > vcfFile <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") > from <- vcfFile > to <- tempfile() > compressVcf <- bgzip(from, to) > idx <- indexTabix(compressVcf, "vcf") > tab <- TabixFile(compressVcf, idx) > rng <- GRanges(seqnames="20", ranges=IRanges(start=c(250000, 500000),end=c(300000,600000))) > myparam<-ScanVcfParam(which=rng) > vcf <- readVcf(tab, "hg19",myparam) Error in lapply(names(vcf[[1]]), function(elt) { : error in evaluating the argument 'X' in selecting a method for function 'lapply': Error in vcf[[1]] : subscript out of bounds > sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-redhat-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics utils datasets grDevices methods [8] base other attached packages: [1] VariantAnnotation_1.6.7 Rsamtools_1.12.4 Biostrings_2.28.0 [4] GenomicRanges_1.12.5 IRanges_1.18.4 BiocGenerics_0.6.0 loaded via a namespace (and not attached): [1] AnnotationDbi_1.22.6 Biobase_2.20.1 biomaRt_2.16.0 [4] bitops_1.0-6 BSgenome_1.28.0 DBI_0.2-7 [7] GenomicFeatures_1.12.4 RCurl_1.95-4.1 RSQLite_0.11.4 [10] rtracklayer_1.20.4 stats4_3.0.1 tools_3.0.1 [13] XML_3.98-1.1 zlibbioc_1.6.0 [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Hi Jeremy, The problem here is that the positions specified in 'rng' are not in the bam file. In this case readVcf() should return an empty VCF - this was fixed in devel some time ago but not in release. This is now fixed in release v 1.6.8. > vcf <- readVcf(tab, "hg19",myparam) > vcf class: CollapsedVCF dim: 0 3 ... In your previous post you asked about supplying a VCF with corresponding .tbi file as a TabixFile. This should be fine. If the .tbi file does not exist an error will be thrown, > fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") > tab<-TabixFile(fl, paste(fl,"bgz.tbi",sep=".")) Error: TabixFile: file(s) do not exist: '/home/vobencha/R/R-rel/R-3-0-branch/library/VariantAnnotation/extdata /ex2.vcf' '/home/vobencha/R/R-rel/R-3-0-branch/library/VariantAnnotation/extdata /ex2.vcf.bgz.tbi' Because the error was the same in both situations my guess is that your 'param' was again out of range. If this wasn't the case please provide a reproducible example of the TabixFile case and I'll look into it. Valerie On 10/01/2013 09:03 AM, Jeremy Leipzig wrote: > well that's odd. managed to get the same error using the most kosher > workflow I could find: >> vcfFile <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") >> from <- vcfFile >> to <- tempfile() >> compressVcf <- bgzip(from, to) >> idx <- indexTabix(compressVcf, "vcf") >> tab <- TabixFile(compressVcf, idx) >> rng <- GRanges(seqnames="20", ranges=IRanges(start=c(250000, > 500000),end=c(300000,600000))) >> myparam<-ScanVcfParam(which=rng) >> vcf <- readVcf(tab, "hg19",myparam) > Error in lapply(names(vcf[[1]]), function(elt) { : > error in evaluating the argument 'X' in selecting a method for function > 'lapply': Error in vcf[[1]] : subscript out of bounds >> sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-redhat-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics utils datasets grDevices methods > [8] base > > other attached packages: > [1] VariantAnnotation_1.6.7 Rsamtools_1.12.4 Biostrings_2.28.0 > [4] GenomicRanges_1.12.5 IRanges_1.18.4 BiocGenerics_0.6.0 > > loaded via a namespace (and not attached): > [1] AnnotationDbi_1.22.6 Biobase_2.20.1 biomaRt_2.16.0 > [4] bitops_1.0-6 BSgenome_1.28.0 DBI_0.2-7 > [7] GenomicFeatures_1.12.4 RCurl_1.95-4.1 RSQLite_0.11.4 > [10] rtracklayer_1.20.4 stats4_3.0.1 tools_3.0.1 > [13] XML_3.98-1.1 zlibbioc_1.6.0 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLY
0
Entering edit mode
Valerie, Thanks. You will get the same cryptic error if you use the wrong file as the TabixFile. vcfFile <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") from <- vcfFile to <- tempfile() compressVcf <- bgzip(from, to) idx <- indexTabix(compressVcf, "vcf") rng <- GRanges(seqnames="20", ranges=IRanges(start=c(14000, 1230000),end=c(15000,1240000))) myparam<-ScanVcfParam(which=rng) #incorrect - the idx file does not match up with the vcf here, it matches up with the compressed one tab <- TabixFile(vcfFile,idx) vcf <- readVcf(tab, "hg19",myparam) Error in lapply(names(vcf[[1]]), function(elt) { : error in evaluating the argument 'X' in selecting a method for function 'lapply': Error in vcf[[1]] : subscript out of bounds #correct tab <- TabixFile(compressVcf,idx) vcf <- readVcf(tab, "hg19",myparam) Regards, Jeremy [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Thanks for the example. In 1.7.52 (devel) an error will now be thrown if the supplied file is not compressed. Ideally we'd confirm the file is compressed when creating the TabixFile object but there isn't a fast easy way to check that (not that I know of). The next best thing is to capture the error when scanTabix() tries to read an uncompressed file. Here is the error you will see, > vcf <- readVcf(tab, "hg19",myparam) Error: scanVcf: scanTabix: read line failed, corrupt or invalid file? path: /home/vobencha/R/library/VariantAnnotation/extdata/ex2.vcf Valerie On 10/01/2013 10:46 AM, Jeremy Leipzig wrote: > Valerie, > Thanks. You will get the same cryptic error if you use the wrong file as > the TabixFile. > > vcfFile <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") > from <- vcfFile > to <- tempfile() > compressVcf <- bgzip(from, to) > idx <- indexTabix(compressVcf, "vcf") > rng <- GRanges(seqnames="20", ranges=IRanges(start=c(14000, > 1230000),end=c(15000,1240000))) > myparam<-ScanVcfParam(which=rng) > > #incorrect - the idx file does not match up with the vcf here, it > matches up with the compressed one > tab <- TabixFile(vcfFile,idx) > vcf <- readVcf(tab, "hg19",myparam) > Error in lapply(names(vcf[[1]]), function(elt) { : > error in evaluating the argument 'X' in selecting a method for > function 'lapply': Error in vcf[[1]] : subscript out of bounds > > #correct > tab <- TabixFile(compressVcf,idx) > vcf <- readVcf(tab, "hg19",myparam) > > Regards, > Jeremy >
ADD REPLY

Login before adding your answer.

Traffic: 770 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6