Question

VariantAnnotation::readVcf is reordering the FORMAT field of my VCF file

0

Entering edit mode

Peter Hickey ▴ 740

@petehaitch

Last seen 4 days ago

WEHI, Melbourne, Australia

Hi Valerie, readVcf is reordering the fields in the FORMAT field of my VCF file which is causing me problems downstream. I am using VariantAnnotation to subset my VCF file and then write the subset to file using writeVcf. However, the reordering of the FORMAT fields means by new VCF does not comply with VCF spec, specifically the GT-field is not the first sub-field of the FORMAT field. A reproducible example follows. Thanks, Pete > sessionInfo() R version 2.15.2 Patched (2013-02-10 r61900) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] VariantAnnotation_1.4.9 Rsamtools_1.10.2 Biostrings_2.26.3 GenomicRanges_1.10.7 [5] IRanges_1.16.6 BiocGenerics_0.4.0 loaded via a namespace (and not attached): [1] AnnotationDbi_1.20.5 Biobase_2.18.0 biomaRt_2.14.0 bitops_1.0-4.2 [5] BSgenome_1.26.1 DBI_0.2-5 GenomicFeatures_1.10.2 parallel_2.15.2 [9] RCurl_1.95-3 RSQLite_0.11.2 rtracklayer_1.18.2 stats4_2.15.2 [13] tools_2.15.2 XML_3.95-0.1 zlibbioc_1.4.0 #### Bug report: readVcf() reordering FORMAT column of VCF file in violation of VCF spec #### library(VariantAnnotation) ## Download my example VCF download.file('http://dl.dropbox.com/u/17421746/readVcf_problem.vcf.gz ', destfile = '/tmp/readVcf_problem.vcf.gz') ## Cat the VCF: Note that the FORMAT column is GT:AD:DP:GQ:PL system(command = 'zcat /tmp/readVcf_problem.vcf.gz') # Might need to alter this command depending on system; works on my linux machine but need 'gzcat' on my mac ## Read in the VCF using readVcf() problem_vcf <- readVcf('/tmp/readVcf_problem.vcf.gz', genome = 'mm9') ## Note the order of geno(), which stores each field in the FORMAT column as a list element, is "AD DP GQ GT PL" geno(problem_vcf) ## Write the file back to disk using writeVcf() writeVcf(problem_vcf, '/tmp/problem.vcf', index = TRUE) ## Cat the newly written version of the VCF: Note that the FORMAT field is now AD:DP:GQ:GT:PL, which is contrary to VCF spec which says of the FORMAT field: "The first sub-field must always be the genotype (GT) if it is present" (see http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf- variant-call-format-version-41) system(command = 'zcat /tmp/problem.vcf.gz') # Might need to alter this command depending on system; works on my linux machine but need 'gzcat' on my mac ## No such problem with the example VCF fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") good_vcf <- readVcf(fl, genome = 'hg19') ## Sub-fields of FORMAT field are correctly ordered geno(good_vcf) -------------------------------- Peter Hickey, PhD Student/Research Assistant, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. Ph: +613 9345 2324 hickey@wehi.edu.au http://www.wehi.edu.au ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:8}}

• 2.2k views

ADD COMMENT • link updated 11.2 years ago by Valerie Obenchain ★ 6.8k • written 11.2 years ago by Peter Hickey ▴ 740

score 0 · Answer 1 · 2013-03-04

Hi Pete, The order is taken from the header. readVcf() attempts to assign data type and order based on header information. If there is no header line then no data type is assigned and the variable is left unparsed. > hd <- scanVcfHeader('readVcf_problem.vcf.gz') > geno(hd) DataFrame with 5 rows and 3 columns Number Type <character> <character> AD . Integer DP 1 Integer GQ 1 Integer GT 1 String PL G Integer ... ... > vcf <- readVcf("readVcf_problem.vcf.gz", "hg19") > geno(vcf) List of length 5 names(5): AD DP GQ GT PL I haven't run into 'GT' being out of order before - or at least causing problems downstream. Your quick fix is to move your GT header line above 'AD'. I will look into keeping 'GT' first reguardles of header order. Thanks for pointing this out. Valerie On 03/04/13 18:32, hickey at wehi.EDU.AU wrote: > Hi Valerie, > > readVcf is reordering the fields in the FORMAT field of my VCF file > which is causing me problems downstream. I am using VariantAnnotation > to subset my VCF file and then write the subset to file using > writeVcf. However, the reordering of the FORMAT fields means by new > VCF does not comply with VCF spec, specifically the GT-field is not > the first sub-field of the FORMAT field. A reproducible example follows. > > Thanks, > Pete > > > sessionInfo() > R version 2.15.2 Patched (2013-02-10 r61900) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] VariantAnnotation_1.4.9 Rsamtools_1.10.2 Biostrings_2.26.3 > GenomicRanges_1.10.7 > [5] IRanges_1.16.6 BiocGenerics_0.4.0 > > loaded via a namespace (and not attached): > [1] AnnotationDbi_1.20.5 Biobase_2.18.0 biomaRt_2.14.0 > bitops_1.0-4.2 > [5] BSgenome_1.26.1 DBI_0.2-5 GenomicFeatures_1.10.2 > parallel_2.15.2 > [9] RCurl_1.95-3 RSQLite_0.11.2 rtracklayer_1.18.2 > stats4_2.15.2 > [13] tools_2.15.2 XML_3.95-0.1 zlibbioc_1.4.0 > > #### Bug report: readVcf() reordering FORMAT column of VCF file in > violation of VCF spec #### > library(VariantAnnotation) > ## Download my example VCF > download.file('http://dl.dropbox.com/u/17421746/readVcf_problem.vcf. gz', > destfile = '/tmp/readVcf_problem.vcf.gz') > ## Cat the VCF: Note that the FORMAT column is GT:AD:DP:GQ:PL > system(command = 'zcat /tmp/readVcf_problem.vcf.gz') # Might need to > alter this command depending on system; works on my linux machine but > need 'gzcat' on my mac > ## Read in the VCF using readVcf() > problem_vcf <- readVcf('/tmp/readVcf_problem.vcf.gz', genome = 'mm9') > ## Note the order of geno(), which stores each field in the FORMAT > column as a list element, is "AD DP GQ GT PL" > geno(problem_vcf) > ## Write the file back to disk using writeVcf() > writeVcf(problem_vcf, '/tmp/problem.vcf', index = TRUE) > ## Cat the newly written version of the VCF: Note that the FORMAT > field is now AD:DP:GQ:GT:PL, which is contrary to VCF spec which says > of the FORMAT field: "The first sub-field must always be the genotype > (GT) if it is present" (see > http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format /vcf-variant-call-format-version-41) > system(command = 'zcat /tmp/problem.vcf.gz') # Might need to alter > this command depending on system; works on my linux machine but need > 'gzcat' on my mac > > ## No such problem with the example VCF > fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") > good_vcf <- readVcf(fl, genome = 'hg19') > ## Sub-fields of FORMAT field are correctly ordered > geno(good_vcf) > > -------------------------------- > Peter Hickey, > PhD Student/Research Assistant, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > Ph: +613 9345 2324 > > hickey at wehi.edu.au <mailto:hickey at="" wehi.edu.au=""> > http://www.wehi.edu.au > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:6}}