Bug in subsetting vcf file by subject on more than 5 samples
1
0
Entering edit mode
Robert Castelo ★ 3.3k
@rcastelo
Last seen 16 hours ago
Barcelona/Universitat Pompeu Fabra
hi, when using the recently added feature of subsetting a vcf file by subject, described initially here: https://stat.ethz.ch/pipermail/bioconductor/2013-April/052146.html i found out that it breaks when the samples to subset are more than 5. here is the code reproducing the problem: suppressPackageStartupMessages(library(VariantAnnotation)) fl <- system.file("extdata", "gl_chr1.vcf", package="VariantAnnotation") hdr <- scanVcfHeader(fl) length(samples(hdr)) [1] 85 so the above example file has 85 samples, let's try to subset on the first 6 ones: param <- ScanVcfParam(samples=samples(hdr)[1:6]) vcf <- readVcf(fl, "hg19", param) Warning message: In .vcf_map_samples(samples(hdr), samples) : samples not in file: ?...? next to this warning, the resulting 'CollapsedVCF' object has only 4 samples instead of 6: dim(vcf) [1] 9 4 a hint about what might be happening comes from the sample names in the parameter object and the '...' referred to in the warning: vcfSamples(param) [1] "NA06984" "NA06986" "..." "NA07000" "NA07037" this happens in both the release and devel versions, here's my sessionInfo() for the release version: R version 3.0.2 (2013-09-25) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF8 LC_NUMERIC=C LC_TIME=en_US.UTF8 [4] LC_COLLATE=en_US.UTF8 LC_MONETARY=en_US.UTF8 LC_MESSAGES=en_US.UTF8 [7] LC_PAPER=en_US.UTF8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] VariantAnnotation_1.8.8 Rsamtools_1.14.2 Biostrings_2.30.1 [4] GenomicRanges_1.14.4 XVector_0.2.0 IRanges_1.20.6 [7] BiocGenerics_0.8.0 vimcom_0.9-92 setwidth_1.0-3 [10] colorout_1.0-1 loaded via a namespace (and not attached): [1] AnnotationDbi_1.24.0 Biobase_2.22.0 biomaRt_2.18.0 bitops_1.0-6 [5] BSgenome_1.30.0 DBI_0.2-7 GenomicFeatures_1.14.2 RCurl_1.95-4.1 [9] RSQLite_0.11.4 rtracklayer_1.22.0 stats4_3.0.2 tools_3.0.2 [13] XML_3.98-1.1 zlibbioc_1.8.0 thanks! robert. -- Robert Castelo, PhD Associate Professor Dept. of Experimental and Health Sciences Universitat Pompeu Fabra (UPF) Barcelona Biomedical Research Park (PRBB) Dr Aiguader 88 E-08003 Barcelona, Spain telf: +34.933.160.514 fax: +34.933.160.550
• 902 views
ADD COMMENT
0
Entering edit mode
@valerie-obenchain-4275
Last seen 2.4 years ago
United States
Now fixed in 1.9.26. I was using vcfSamples() on a ScanVcfParam to pass the sample names down to scanVccf(). This extractor has a fancy show method that for n > 5 displays a few names and '...' to indicate more. This character '...' was being passed as a sample name and triggered the check for non-existent sample names. Thanks for catching this and for providing a reproducible example. Valerie On 12/17/2013 03:28 AM, Robert Castelo wrote: > hi, > > when using the recently added feature of subsetting a vcf file by > subject, described initially here: > > https://stat.ethz.ch/pipermail/bioconductor/2013-April/052146.html > > i found out that it breaks when the samples to subset are more than 5. > here is the code reproducing the problem: > > suppressPackageStartupMessages(library(VariantAnnotation)) > fl <- system.file("extdata", "gl_chr1.vcf", package="VariantAnnotation") > hdr <- scanVcfHeader(fl) > length(samples(hdr)) > [1] 85 > > so the above example file has 85 samples, let's try to subset on the > first 6 ones: > > param <- ScanVcfParam(samples=samples(hdr)[1:6]) > vcf <- readVcf(fl, "hg19", param) > Warning message: > In .vcf_map_samples(samples(hdr), samples) : samples not in file: ?...? > > next to this warning, the resulting 'CollapsedVCF' object has only 4 > samples instead of 6: > > dim(vcf) > [1] 9 4 > > a hint about what might be happening comes from the sample names in the > parameter object and the '...' referred to in the warning: > > vcfSamples(param) > [1] "NA06984" "NA06986" "..." "NA07000" "NA07037" > > this happens in both the release and devel versions, here's my > sessionInfo() for the release version: > > R version 3.0.2 (2013-09-25) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF8 LC_NUMERIC=C LC_TIME=en_US.UTF8 > [4] LC_COLLATE=en_US.UTF8 LC_MONETARY=en_US.UTF8 > LC_MESSAGES=en_US.UTF8 > [7] LC_PAPER=en_US.UTF8 LC_NAME=C LC_ADDRESS=C > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF8 > LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] VariantAnnotation_1.8.8 Rsamtools_1.14.2 Biostrings_2.30.1 > [4] GenomicRanges_1.14.4 XVector_0.2.0 IRanges_1.20.6 > [7] BiocGenerics_0.8.0 vimcom_0.9-92 setwidth_1.0-3 > [10] colorout_1.0-1 > > loaded via a namespace (and not attached): > [1] AnnotationDbi_1.24.0 Biobase_2.22.0 biomaRt_2.18.0 > bitops_1.0-6 > [5] BSgenome_1.30.0 DBI_0.2-7 GenomicFeatures_1.14.2 > RCurl_1.95-4.1 > [9] RSQLite_0.11.4 rtracklayer_1.22.0 stats4_3.0.2 > tools_3.0.2 > [13] XML_3.98-1.1 zlibbioc_1.8.0 > > thanks! > robert. > -- Valerie Obenchain Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B155 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: vobencha at fhcrc.org Phone: (206) 667-3158 Fax: (206) 667-1319
ADD COMMENT

Login before adding your answer.

Traffic: 415 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6