Question: VariantAnnotation::readVcf is reordering the FORMAT field of my VCF file
0
gravatar for Peter Hickey
6.7 years ago by
Peter Hickey460
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Peter Hickey460 wrote:
Hi Valerie, readVcf is reordering the fields in the FORMAT field of my VCF file which is causing me problems downstream. I am using VariantAnnotation to subset my VCF file and then write the subset to file using writeVcf. However, the reordering of the FORMAT fields means by new VCF does not comply with VCF spec, specifically the GT-field is not the first sub-field of the FORMAT field. A reproducible example follows. Thanks, Pete > sessionInfo() R version 2.15.2 Patched (2013-02-10 r61900) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] VariantAnnotation_1.4.9 Rsamtools_1.10.2 Biostrings_2.26.3 GenomicRanges_1.10.7 [5] IRanges_1.16.6 BiocGenerics_0.4.0 loaded via a namespace (and not attached): [1] AnnotationDbi_1.20.5 Biobase_2.18.0 biomaRt_2.14.0 bitops_1.0-4.2 [5] BSgenome_1.26.1 DBI_0.2-5 GenomicFeatures_1.10.2 parallel_2.15.2 [9] RCurl_1.95-3 RSQLite_0.11.2 rtracklayer_1.18.2 stats4_2.15.2 [13] tools_2.15.2 XML_3.95-0.1 zlibbioc_1.4.0 #### Bug report: readVcf() reordering FORMAT column of VCF file in violation of VCF spec #### library(VariantAnnotation) ## Download my example VCF download.file('http://dl.dropbox.com/u/17421746/readVcf_problem.vcf.gz ', destfile = '/tmp/readVcf_problem.vcf.gz') ## Cat the VCF: Note that the FORMAT column is GT:AD:DP:GQ:PL system(command = 'zcat /tmp/readVcf_problem.vcf.gz') # Might need to alter this command depending on system; works on my linux machine but need 'gzcat' on my mac ## Read in the VCF using readVcf() problem_vcf <- readVcf('/tmp/readVcf_problem.vcf.gz', genome = 'mm9') ## Note the order of geno(), which stores each field in the FORMAT column as a list element, is "AD DP GQ GT PL" geno(problem_vcf) ## Write the file back to disk using writeVcf() writeVcf(problem_vcf, '/tmp/problem.vcf', index = TRUE) ## Cat the newly written version of the VCF: Note that the FORMAT field is now AD:DP:GQ:GT:PL, which is contrary to VCF spec which says of the FORMAT field: "The first sub-field must always be the genotype (GT) if it is present" (see http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf- variant-call-format-version-41) system(command = 'zcat /tmp/problem.vcf.gz') # Might need to alter this command depending on system; works on my linux machine but need 'gzcat' on my mac ## No such problem with the example VCF fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") good_vcf <- readVcf(fl, genome = 'hg19') ## Sub-fields of FORMAT field are correctly ordered geno(good_vcf) -------------------------------- Peter Hickey, PhD Student/Research Assistant, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. Ph: +613 9345 2324 hickey@wehi.edu.au http://www.wehi.edu.au ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:8}}
• 1.3k views
ADD COMMENTlink modified 6.7 years ago by Valerie Obenchain6.7k • written 6.7 years ago by Peter Hickey460
Answer: VariantAnnotation::readVcf is reordering the FORMAT field of my VCF file
0
gravatar for Valerie Obenchain
6.7 years ago by
United States
Valerie Obenchain6.7k wrote:
Hi Pete, The order is taken from the header. readVcf() attempts to assign data type and order based on header information. If there is no header line then no data type is assigned and the variable is left unparsed. > hd <- scanVcfHeader('readVcf_problem.vcf.gz') > geno(hd) DataFrame with 5 rows and 3 columns Number Type <character> <character> AD . Integer DP 1 Integer GQ 1 Integer GT 1 String PL G Integer ... ... > vcf <- readVcf("readVcf_problem.vcf.gz", "hg19") > geno(vcf) List of length 5 names(5): AD DP GQ GT PL I haven't run into 'GT' being out of order before - or at least causing problems downstream. Your quick fix is to move your GT header line above 'AD'. I will look into keeping 'GT' first reguardles of header order. Thanks for pointing this out. Valerie On 03/04/13 18:32, hickey at wehi.EDU.AU wrote: > Hi Valerie, > > readVcf is reordering the fields in the FORMAT field of my VCF file > which is causing me problems downstream. I am using VariantAnnotation > to subset my VCF file and then write the subset to file using > writeVcf. However, the reordering of the FORMAT fields means by new > VCF does not comply with VCF spec, specifically the GT-field is not > the first sub-field of the FORMAT field. A reproducible example follows. > > Thanks, > Pete > > > sessionInfo() > R version 2.15.2 Patched (2013-02-10 r61900) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] VariantAnnotation_1.4.9 Rsamtools_1.10.2 Biostrings_2.26.3 > GenomicRanges_1.10.7 > [5] IRanges_1.16.6 BiocGenerics_0.4.0 > > loaded via a namespace (and not attached): > [1] AnnotationDbi_1.20.5 Biobase_2.18.0 biomaRt_2.14.0 > bitops_1.0-4.2 > [5] BSgenome_1.26.1 DBI_0.2-5 GenomicFeatures_1.10.2 > parallel_2.15.2 > [9] RCurl_1.95-3 RSQLite_0.11.2 rtracklayer_1.18.2 > stats4_2.15.2 > [13] tools_2.15.2 XML_3.95-0.1 zlibbioc_1.4.0 > > #### Bug report: readVcf() reordering FORMAT column of VCF file in > violation of VCF spec #### > library(VariantAnnotation) > ## Download my example VCF > download.file('http://dl.dropbox.com/u/17421746/readVcf_problem.vcf. gz', > destfile = '/tmp/readVcf_problem.vcf.gz') > ## Cat the VCF: Note that the FORMAT column is GT:AD:DP:GQ:PL > system(command = 'zcat /tmp/readVcf_problem.vcf.gz') # Might need to > alter this command depending on system; works on my linux machine but > need 'gzcat' on my mac > ## Read in the VCF using readVcf() > problem_vcf <- readVcf('/tmp/readVcf_problem.vcf.gz', genome = 'mm9') > ## Note the order of geno(), which stores each field in the FORMAT > column as a list element, is "AD DP GQ GT PL" > geno(problem_vcf) > ## Write the file back to disk using writeVcf() > writeVcf(problem_vcf, '/tmp/problem.vcf', index = TRUE) > ## Cat the newly written version of the VCF: Note that the FORMAT > field is now AD:DP:GQ:GT:PL, which is contrary to VCF spec which says > of the FORMAT field: "The first sub-field must always be the genotype > (GT) if it is present" (see > http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format /vcf-variant-call-format-version-41) > system(command = 'zcat /tmp/problem.vcf.gz') # Might need to alter > this command depending on system; works on my linux machine but need > 'gzcat' on my mac > > ## No such problem with the example VCF > fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") > good_vcf <- readVcf(fl, genome = 'hg19') > ## Sub-fields of FORMAT field are correctly ordered > geno(good_vcf) > > -------------------------------- > Peter Hickey, > PhD Student/Research Assistant, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > Ph: +613 9345 2324 > > hickey at wehi.edu.au <mailto:hickey at="" wehi.edu.au=""> > http://www.wehi.edu.au > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:6}}
ADD COMMENTlink written 6.7 years ago by Valerie Obenchain6.7k
Hi Valerie, Thank you for your quick reply and the suggested work-around. Modifying the VCF header did prove a quick fix. A few more details that may help in deciding on a permanent fix. 1. This VCF was created using GATK's UnifiedGenotyper (v2.3-3-g4706074). I also created a VCF using the HaplotypeCaller in GATK (same version) and it also output the header in the same order. It appears to be the standard ordering of FORMAT fields in the header for VCFs created by GATK (http://gatkforums.broadinstitute.org/discussion/1268/how-should-i -interpret-vcf-files-produced-by-the-gatk) 2. The downstream problem I ran into was in using the Integrative Genomics Viewer from the Broad. It refused to load my subset VCF file because the GT sub-field didn't appear first in the FORMAT field. IGV seems to be very strict in following VCF specs. 3. Annoyingly, as far as I can tell there is no requirement/guarantee that a VCF file conforming to VCF spec has the header information describing the FORMAT fields in the same order as how those fields appear the individual variant calls. Thanks again, Pete On 05/03/2013, at 3:25 PM, Valerie Obenchain wrote: > Hi Pete, > > The order is taken from the header. readVcf() attempts to assign data type and order based on header information. If there is no header line then no data type is assigned and the variable is left unparsed. > > > hd <- scanVcfHeader('readVcf_problem.vcf.gz') > > geno(hd) > DataFrame with 5 rows and 3 columns > Number Type > <character> <character> > AD . Integer > DP 1 Integer > GQ 1 Integer > GT 1 String > PL G Integer > ... > ... > > > vcf <- readVcf("readVcf_problem.vcf.gz", "hg19") > > geno(vcf) > List of length 5 > names(5): AD DP GQ GT PL > > I haven't run into 'GT' being out of order before - or at least causing problems downstream. Your quick fix is to move your GT header line above 'AD'. I will look into keeping 'GT' first reguardles of header order. > > Thanks for pointing this out. > Valerie > > > On 03/04/13 18:32, hickey@wehi.EDU.AU wrote: >> Hi Valerie, >> >> readVcf is reordering the fields in the FORMAT field of my VCF file which is causing me problems downstream. I am using VariantAnnotation to subset my VCF file and then write the subset to file using writeVcf. However, the reordering of the FORMAT fields means by new VCF does not comply with VCF spec, specifically the GT-field is not the first sub-field of the FORMAT field. A reproducible example follows. >> >> Thanks, >> Pete >> >> > sessionInfo() >> R version 2.15.2 Patched (2013-02-10 r61900) >> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >> >> locale: >> [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] VariantAnnotation_1.4.9 Rsamtools_1.10.2 Biostrings_2.26.3 GenomicRanges_1.10.7 >> [5] IRanges_1.16.6 BiocGenerics_0.4.0 >> >> loaded via a namespace (and not attached): >> [1] AnnotationDbi_1.20.5 Biobase_2.18.0 biomaRt_2.14.0 bitops_1.0-4.2 >> [5] BSgenome_1.26.1 DBI_0.2-5 GenomicFeatures_1.10.2 parallel_2.15.2 >> [9] RCurl_1.95-3 RSQLite_0.11.2 rtracklayer_1.18.2 stats4_2.15.2 >> [13] tools_2.15.2 XML_3.95-0.1 zlibbioc_1.4.0 >> >> #### Bug report: readVcf() reordering FORMAT column of VCF file in violation of VCF spec #### >> library(VariantAnnotation) >> ## Download my example VCF >> download.file('http://dl.dropbox.com/u/17421746/readVcf_problem.vcf .gz', destfile = '/tmp/readVcf_problem.vcf.gz') >> ## Cat the VCF: Note that the FORMAT column is GT:AD:DP:GQ:PL >> system(command = 'zcat /tmp/readVcf_problem.vcf.gz') # Might need to alter this command depending on system; works on my linux machine but need 'gzcat' on my mac >> ## Read in the VCF using readVcf() >> problem_vcf <- readVcf('/tmp/readVcf_problem.vcf.gz', genome = 'mm9') >> ## Note the order of geno(), which stores each field in the FORMAT column as a list element, is "AD DP GQ GT PL" >> geno(problem_vcf) >> ## Write the file back to disk using writeVcf() >> writeVcf(problem_vcf, '/tmp/problem.vcf', index = TRUE) >> ## Cat the newly written version of the VCF: Note that the FORMAT field is now AD:DP:GQ:GT:PL, which is contrary to VCF spec which says of the FORMAT field: "The first sub-field must always be the genotype (GT) if it is present" (see http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf- variant-call-format-version-41) >> system(command = 'zcat /tmp/problem.vcf.gz') # Might need to alter this command depending on system; works on my linux machine but need 'gzcat' on my mac >> >> ## No such problem with the example VCF >> fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") >> good_vcf <- readVcf(fl, genome = 'hg19') >> ## Sub-fields of FORMAT field are correctly ordered >> geno(good_vcf) >> >> -------------------------------- >> Peter Hickey, >> PhD Student/Research Assistant, >> Bioinformatics Division, >> Walter and Eliza Hall Institute of Medical Research, >> 1G Royal Parade, Parkville, Vic 3052, Australia. >> Ph: +613 9345 2324 >> >> hickey@wehi.edu.au <mailto:hickey@wehi.edu.au> >> http://www.wehi.edu.au >> >> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for the addressee. >> You must not disclose, forward, print or use it without the permission of the sender. >> ______________________________________________________________________ > -------------------------------- Peter Hickey, PhD Student/Research Assistant, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. Ph: +613 9345 2324 hickey@wehi.edu.au http://www.wehi.edu.au ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:8}}
ADD REPLYlink written 6.7 years ago by Peter Hickey460
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 194 users visited in the last hour