VariantAnnotation::readVcf is reordering the FORMAT field of my VCF file
1
0
Entering edit mode
Peter Hickey ▴ 740
@petehaitch
Last seen 4 days ago
WEHI, Melbourne, Australia
Hi Valerie, readVcf is reordering the fields in the FORMAT field of my VCF file which is causing me problems downstream. I am using VariantAnnotation to subset my VCF file and then write the subset to file using writeVcf. However, the reordering of the FORMAT fields means by new VCF does not comply with VCF spec, specifically the GT-field is not the first sub-field of the FORMAT field. A reproducible example follows. Thanks, Pete > sessionInfo() R version 2.15.2 Patched (2013-02-10 r61900) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] VariantAnnotation_1.4.9 Rsamtools_1.10.2 Biostrings_2.26.3 GenomicRanges_1.10.7 [5] IRanges_1.16.6 BiocGenerics_0.4.0 loaded via a namespace (and not attached): [1] AnnotationDbi_1.20.5 Biobase_2.18.0 biomaRt_2.14.0 bitops_1.0-4.2 [5] BSgenome_1.26.1 DBI_0.2-5 GenomicFeatures_1.10.2 parallel_2.15.2 [9] RCurl_1.95-3 RSQLite_0.11.2 rtracklayer_1.18.2 stats4_2.15.2 [13] tools_2.15.2 XML_3.95-0.1 zlibbioc_1.4.0 #### Bug report: readVcf() reordering FORMAT column of VCF file in violation of VCF spec #### library(VariantAnnotation) ## Download my example VCF download.file('http://dl.dropbox.com/u/17421746/readVcf_problem.vcf.gz ', destfile = '/tmp/readVcf_problem.vcf.gz') ## Cat the VCF: Note that the FORMAT column is GT:AD:DP:GQ:PL system(command = 'zcat /tmp/readVcf_problem.vcf.gz') # Might need to alter this command depending on system; works on my linux machine but need 'gzcat' on my mac ## Read in the VCF using readVcf() problem_vcf <- readVcf('/tmp/readVcf_problem.vcf.gz', genome = 'mm9') ## Note the order of geno(), which stores each field in the FORMAT column as a list element, is "AD DP GQ GT PL" geno(problem_vcf) ## Write the file back to disk using writeVcf() writeVcf(problem_vcf, '/tmp/problem.vcf', index = TRUE) ## Cat the newly written version of the VCF: Note that the FORMAT field is now AD:DP:GQ:GT:PL, which is contrary to VCF spec which says of the FORMAT field: "The first sub-field must always be the genotype (GT) if it is present" (see http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf- variant-call-format-version-41) system(command = 'zcat /tmp/problem.vcf.gz') # Might need to alter this command depending on system; works on my linux machine but need 'gzcat' on my mac ## No such problem with the example VCF fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") good_vcf <- readVcf(fl, genome = 'hg19') ## Sub-fields of FORMAT field are correctly ordered geno(good_vcf) -------------------------------- Peter Hickey, PhD Student/Research Assistant, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. Ph: +613 9345 2324 hickey@wehi.edu.au http://www.wehi.edu.au ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:8}}
• 2.2k views
ADD COMMENT
0
Entering edit mode
@valerie-obenchain-4275
Last seen 2.4 years ago
United States
Hi Pete, The order is taken from the header. readVcf() attempts to assign data type and order based on header information. If there is no header line then no data type is assigned and the variable is left unparsed. > hd <- scanVcfHeader('readVcf_problem.vcf.gz') > geno(hd) DataFrame with 5 rows and 3 columns Number Type <character> <character> AD . Integer DP 1 Integer GQ 1 Integer GT 1 String PL G Integer ... ... > vcf <- readVcf("readVcf_problem.vcf.gz", "hg19") > geno(vcf) List of length 5 names(5): AD DP GQ GT PL I haven't run into 'GT' being out of order before - or at least causing problems downstream. Your quick fix is to move your GT header line above 'AD'. I will look into keeping 'GT' first reguardles of header order. Thanks for pointing this out. Valerie On 03/04/13 18:32, hickey at wehi.EDU.AU wrote: > Hi Valerie, > > readVcf is reordering the fields in the FORMAT field of my VCF file > which is causing me problems downstream. I am using VariantAnnotation > to subset my VCF file and then write the subset to file using > writeVcf. However, the reordering of the FORMAT fields means by new > VCF does not comply with VCF spec, specifically the GT-field is not > the first sub-field of the FORMAT field. A reproducible example follows. > > Thanks, > Pete > > > sessionInfo() > R version 2.15.2 Patched (2013-02-10 r61900) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] VariantAnnotation_1.4.9 Rsamtools_1.10.2 Biostrings_2.26.3 > GenomicRanges_1.10.7 > [5] IRanges_1.16.6 BiocGenerics_0.4.0 > > loaded via a namespace (and not attached): > [1] AnnotationDbi_1.20.5 Biobase_2.18.0 biomaRt_2.14.0 > bitops_1.0-4.2 > [5] BSgenome_1.26.1 DBI_0.2-5 GenomicFeatures_1.10.2 > parallel_2.15.2 > [9] RCurl_1.95-3 RSQLite_0.11.2 rtracklayer_1.18.2 > stats4_2.15.2 > [13] tools_2.15.2 XML_3.95-0.1 zlibbioc_1.4.0 > > #### Bug report: readVcf() reordering FORMAT column of VCF file in > violation of VCF spec #### > library(VariantAnnotation) > ## Download my example VCF > download.file('http://dl.dropbox.com/u/17421746/readVcf_problem.vcf. gz', > destfile = '/tmp/readVcf_problem.vcf.gz') > ## Cat the VCF: Note that the FORMAT column is GT:AD:DP:GQ:PL > system(command = 'zcat /tmp/readVcf_problem.vcf.gz') # Might need to > alter this command depending on system; works on my linux machine but > need 'gzcat' on my mac > ## Read in the VCF using readVcf() > problem_vcf <- readVcf('/tmp/readVcf_problem.vcf.gz', genome = 'mm9') > ## Note the order of geno(), which stores each field in the FORMAT > column as a list element, is "AD DP GQ GT PL" > geno(problem_vcf) > ## Write the file back to disk using writeVcf() > writeVcf(problem_vcf, '/tmp/problem.vcf', index = TRUE) > ## Cat the newly written version of the VCF: Note that the FORMAT > field is now AD:DP:GQ:GT:PL, which is contrary to VCF spec which says > of the FORMAT field: "The first sub-field must always be the genotype > (GT) if it is present" (see > http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format /vcf-variant-call-format-version-41) > system(command = 'zcat /tmp/problem.vcf.gz') # Might need to alter > this command depending on system; works on my linux machine but need > 'gzcat' on my mac > > ## No such problem with the example VCF > fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") > good_vcf <- readVcf(fl, genome = 'hg19') > ## Sub-fields of FORMAT field are correctly ordered > geno(good_vcf) > > -------------------------------- > Peter Hickey, > PhD Student/Research Assistant, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > Ph: +613 9345 2324 > > hickey at wehi.edu.au <mailto:hickey at="" wehi.edu.au=""> > http://www.wehi.edu.au > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:6}}
ADD COMMENT
0
Entering edit mode
Hi Valerie, Thank you for your quick reply and the suggested work-around. Modifying the VCF header did prove a quick fix. A few more details that may help in deciding on a permanent fix. 1. This VCF was created using GATK's UnifiedGenotyper (v2.3-3-g4706074). I also created a VCF using the HaplotypeCaller in GATK (same version) and it also output the header in the same order. It appears to be the standard ordering of FORMAT fields in the header for VCFs created by GATK (http://gatkforums.broadinstitute.org/discussion/1268/how-should-i -interpret-vcf-files-produced-by-the-gatk) 2. The downstream problem I ran into was in using the Integrative Genomics Viewer from the Broad. It refused to load my subset VCF file because the GT sub-field didn't appear first in the FORMAT field. IGV seems to be very strict in following VCF specs. 3. Annoyingly, as far as I can tell there is no requirement/guarantee that a VCF file conforming to VCF spec has the header information describing the FORMAT fields in the same order as how those fields appear the individual variant calls. Thanks again, Pete On 05/03/2013, at 3:25 PM, Valerie Obenchain wrote: > Hi Pete, > > The order is taken from the header. readVcf() attempts to assign data type and order based on header information. If there is no header line then no data type is assigned and the variable is left unparsed. > > > hd <- scanVcfHeader('readVcf_problem.vcf.gz') > > geno(hd) > DataFrame with 5 rows and 3 columns > Number Type > <character> <character> > AD . Integer > DP 1 Integer > GQ 1 Integer > GT 1 String > PL G Integer > ... > ... > > > vcf <- readVcf("readVcf_problem.vcf.gz", "hg19") > > geno(vcf) > List of length 5 > names(5): AD DP GQ GT PL > > I haven't run into 'GT' being out of order before - or at least causing problems downstream. Your quick fix is to move your GT header line above 'AD'. I will look into keeping 'GT' first reguardles of header order. > > Thanks for pointing this out. > Valerie > > > On 03/04/13 18:32, hickey@wehi.EDU.AU wrote: >> Hi Valerie, >> >> readVcf is reordering the fields in the FORMAT field of my VCF file which is causing me problems downstream. I am using VariantAnnotation to subset my VCF file and then write the subset to file using writeVcf. However, the reordering of the FORMAT fields means by new VCF does not comply with VCF spec, specifically the GT-field is not the first sub-field of the FORMAT field. A reproducible example follows. >> >> Thanks, >> Pete >> >> > sessionInfo() >> R version 2.15.2 Patched (2013-02-10 r61900) >> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >> >> locale: >> [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] VariantAnnotation_1.4.9 Rsamtools_1.10.2 Biostrings_2.26.3 GenomicRanges_1.10.7 >> [5] IRanges_1.16.6 BiocGenerics_0.4.0 >> >> loaded via a namespace (and not attached): >> [1] AnnotationDbi_1.20.5 Biobase_2.18.0 biomaRt_2.14.0 bitops_1.0-4.2 >> [5] BSgenome_1.26.1 DBI_0.2-5 GenomicFeatures_1.10.2 parallel_2.15.2 >> [9] RCurl_1.95-3 RSQLite_0.11.2 rtracklayer_1.18.2 stats4_2.15.2 >> [13] tools_2.15.2 XML_3.95-0.1 zlibbioc_1.4.0 >> >> #### Bug report: readVcf() reordering FORMAT column of VCF file in violation of VCF spec #### >> library(VariantAnnotation) >> ## Download my example VCF >> download.file('http://dl.dropbox.com/u/17421746/readVcf_problem.vcf .gz', destfile = '/tmp/readVcf_problem.vcf.gz') >> ## Cat the VCF: Note that the FORMAT column is GT:AD:DP:GQ:PL >> system(command = 'zcat /tmp/readVcf_problem.vcf.gz') # Might need to alter this command depending on system; works on my linux machine but need 'gzcat' on my mac >> ## Read in the VCF using readVcf() >> problem_vcf <- readVcf('/tmp/readVcf_problem.vcf.gz', genome = 'mm9') >> ## Note the order of geno(), which stores each field in the FORMAT column as a list element, is "AD DP GQ GT PL" >> geno(problem_vcf) >> ## Write the file back to disk using writeVcf() >> writeVcf(problem_vcf, '/tmp/problem.vcf', index = TRUE) >> ## Cat the newly written version of the VCF: Note that the FORMAT field is now AD:DP:GQ:GT:PL, which is contrary to VCF spec which says of the FORMAT field: "The first sub-field must always be the genotype (GT) if it is present" (see http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf- variant-call-format-version-41) >> system(command = 'zcat /tmp/problem.vcf.gz') # Might need to alter this command depending on system; works on my linux machine but need 'gzcat' on my mac >> >> ## No such problem with the example VCF >> fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") >> good_vcf <- readVcf(fl, genome = 'hg19') >> ## Sub-fields of FORMAT field are correctly ordered >> geno(good_vcf) >> >> -------------------------------- >> Peter Hickey, >> PhD Student/Research Assistant, >> Bioinformatics Division, >> Walter and Eliza Hall Institute of Medical Research, >> 1G Royal Parade, Parkville, Vic 3052, Australia. >> Ph: +613 9345 2324 >> >> hickey@wehi.edu.au <mailto:hickey@wehi.edu.au> >> http://www.wehi.edu.au >> >> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for the addressee. >> You must not disclose, forward, print or use it without the permission of the sender. >> ______________________________________________________________________ > -------------------------------- Peter Hickey, PhD Student/Research Assistant, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. Ph: +613 9345 2324 hickey@wehi.edu.au http://www.wehi.edu.au ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:8}}
ADD REPLY

Login before adding your answer.

Traffic: 360 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6