Search
Question: Multi-sample VCF file and filterVcf
0
gravatar for Guest User
3.5 years ago by
Guest User12k
Guest User12k wrote:
Hi, I have a multi-sample VCF file where each record contains genotype information (e.g. GT, DP, AD etc. ) for a number of samples, as well as various annotation tags/values. Basically, I wonder how I can combine the VariantAnnotation and the filterVcf functionality to - Parse the VCF and a) subset/filter variants in samples that satisfy given criteria (e.g. DP >= 10) b) subset/filter variants according to annotation criteria or the FILTER column c) output the filtered variants in a sample-wise manner I could not find any examples that dealt with multi-sample VCF files from the documentation. Thanks, Sigve -- output of sessionInfo(): R version 3.1.0 (2014-04-10) Platform: x86_64-apple-darwin13.1.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] ggplot2_0.9.3.1 stringr_0.6.2 xtable_1.7-3 reshape2_1.4 lattice_0.20-29 [6] dplyr_0.1.3 VariantAnnotation_1.10.1 Rsamtools_1.16.0 Biostrings_2.32.0 XVector_0.4.0 [11] GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 IRanges_1.22.6 BiocGenerics_0.10.0 loaded via a namespace (and not attached): [1] AnnotationDbi_1.26.0 assertthat_0.1 BatchJobs_1.2 BBmisc_1.6 Biobase_2.24.0 [6] BiocParallel_0.6.0 biomaRt_2.20.0 bitops_1.0-6 brew_1.0-6 BSgenome_1.32.0 [11] codetools_0.2-8 colorspace_1.2-4 DBI_0.2-7 digest_0.6.4 evaluate_0.5.5 [16] fail_1.2 foreach_1.4.2 formatR_0.10 GenomicAlignments_1.0.1 GenomicFeatures_1.16.0 [21] grid_3.1.0 gtable_0.1.2 iterators_1.0.7 knitr_1.5 labeling_0.2 [26] MASS_7.3-31 munsell_0.4.2 plyr_1.8.1 proto_0.3-10 Rcpp_0.11.1 [31] RCurl_1.95-4.1 RSQLite_0.11.4 rtracklayer_1.24.0 scales_0.2.4 sendmailR_1.1-2 [36] stats4_3.1.0 tools_3.1.0 XML_3.98-1.1 zlibbioc_1.10.0 -- Sent via the guest posting facility at bioconductor.org.
ADD COMMENTlink modified 3.5 years ago by Valerie Obenchain ♦♦ 6.4k • written 3.5 years ago by Guest User12k
0
gravatar for Valerie Obenchain
3.5 years ago by
Valerie Obenchain ♦♦ 6.4k
United States
Valerie Obenchain ♦♦ 6.4k wrote:
Hi, filterVcf() transfers the content of one vcf file on disk to another, filtering records as it goes. There are two types of filtering which are described in the 'Details' section of the ?filterVcf man page. One is a 'pre-filter' which is equavalent to a grep of the unparsed lines. The second is called simply 'filter' which involves reading the records into a VCF (R object) then filtering. On 05/16/2014 07:05 AM, Sigve Nakken [guest] wrote: > Hi, > > I have a multi-sample VCF file where each record contains genotype information (e.g. GT, DP, AD etc. ) for a number of samples, as well as various annotation tags/values. Basically, I wonder how I can combine the VariantAnnotation and the filterVcf functionality to > > - Parse the VCF and > a) subset/filter variants in samples that satisfy given criteria (e.g. DP >= 10) For this case you'll want to define a 'filter'. The man page has an example of filtering by an info field, you want to do the same by a geno field. Your filter will look something like this, filt <- FilterRules(list(DP_filt = function(x) geno(x)$DP >= 10)) > b) subset/filter variants according to annotation criteria or the FILTER column Because you are simply looking for a value match here (not <, >, etc.) you can use a pre-filter. Follow the man page example for 'low coverage'. pre <- FilterRules(list(FILTER = function(x) grepl("yourValueHere", x, fixed=TRUE))) > c) output the filtered variants in a sample-wise manner I'm not sure I understand. filterVcf() and readVcf() work with multiple sample files. The output of the first is a vcf on disk, the second is the VCF R object. In the VCF object samples are in the geno() arrays ordered as they are in the vcf file. To read in selective samples you can use scanVcfParam(samples = c("samp12", "samp3", "samp1")) Valerie > > I could not find any examples that dealt with multi-sample VCF files from the documentation. > > Thanks, > Sigve > > > > > > > -- output of sessionInfo(): > > R version 3.1.0 (2014-04-10) > Platform: x86_64-apple-darwin13.1.0 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] ggplot2_0.9.3.1 stringr_0.6.2 xtable_1.7-3 reshape2_1.4 lattice_0.20-29 > [6] dplyr_0.1.3 VariantAnnotation_1.10.1 Rsamtools_1.16.0 Biostrings_2.32.0 XVector_0.4.0 > [11] GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 IRanges_1.22.6 BiocGenerics_0.10.0 > > loaded via a namespace (and not attached): > [1] AnnotationDbi_1.26.0 assertthat_0.1 BatchJobs_1.2 BBmisc_1.6 Biobase_2.24.0 > [6] BiocParallel_0.6.0 biomaRt_2.20.0 bitops_1.0-6 brew_1.0-6 BSgenome_1.32.0 > [11] codetools_0.2-8 colorspace_1.2-4 DBI_0.2-7 digest_0.6.4 evaluate_0.5.5 > [16] fail_1.2 foreach_1.4.2 formatR_0.10 GenomicAlignments_1.0.1 GenomicFeatures_1.16.0 > [21] grid_3.1.0 gtable_0.1.2 iterators_1.0.7 knitr_1.5 labeling_0.2 > [26] MASS_7.3-31 munsell_0.4.2 plyr_1.8.1 proto_0.3-10 Rcpp_0.11.1 > [31] RCurl_1.95-4.1 RSQLite_0.11.4 rtracklayer_1.24.0 scales_0.2.4 sendmailR_1.1-2 > [36] stats4_3.1.0 tools_3.1.0 XML_3.98-1.1 zlibbioc_1.10.0 > > -- > Sent via the guest posting facility at bioconductor.org. > -- Valerie Obenchain Program in Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, Seattle, WA 98109 Email: vobencha at fhcrc.org Phone: (206) 667-3158
ADD COMMENTlink written 3.5 years ago by Valerie Obenchain ♦♦ 6.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 254 users visited in the last hour