writeVCF outputs invalid VCF if no genotypes
1
0
Entering edit mode
@daniel-cameron-6227
Last seen 8.7 years ago
Australia

IGV is failing to read genotypeless VCFs written by writeVCF() since a tab is written after the INFO field whether it is followed by a genotype or not. The VCF header is correctly written without a trailing tab and without a FORMAT header, but each record incorrectly includes a trailing tab.

Additionally, the ##fileformat header line is required by the VCF specifications and is not written if the input file does not include it.

--- Script to reproduce ---

library(VariantAnnotation)
f <- file("test.vcf")
writeLines(c(
  "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO",
  "chr1\t1\tvariant1\tA\t[chr1:1[a\t1\t.\tATTR=value",
  "chr2\t2\tvariant2\tA\t[chr2:2[a\t2\t.\tATTR=value"
), f)
close(f)
vcf <- readVcf("test.vcf", "hg19")
writeVcf(vcf, "output-should-not-contain-trailing-tabs.vcf")

 

--- sessionInfo() ---

R version 3.2.1 (2015-06-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252    LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                       LC_TIME=English_Australia.1252    

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] stringr_1.0.0            RColorBrewer_1.1-2       scales_0.2.5             ggplot2_1.0.1            VennDiagram_1.6.9        openxlsx_3.0.0           plyr_1.8.3               VariantAnnotation_1.14.6
 [9] Rsamtools_1.20.4         Biostrings_2.36.1        XVector_0.8.0            rtracklayer_1.28.6       GenomicRanges_1.20.5     GenomeInfoDb_1.4.1       IRanges_2.2.5            S4Vectors_0.6.2         
[17] BiocGenerics_0.14.0      BiocInstaller_1.18.4    

loaded via a namespace (and not attached):
 [1] Rcpp_0.11.6             futile.logger_1.4.1     GenomicFeatures_1.20.1  bitops_1.0-6            futile.options_1.0.0    tools_3.2.1             zlibbioc_1.14.0         biomaRt_2.24.0         
 [9] digest_0.6.8            RSQLite_1.0.0           gtable_0.1.2            BSgenome_1.36.2         DBI_0.3.1               proto_0.3-10            Biobase_2.28.0          AnnotationDbi_1.30.1   
[17] XML_3.98-1.3            BiocParallel_1.2.11     reshape2_1.4.1          lambda.r_1.1.7          magrittr_1.5            MASS_7.3-43             GenomicAlignments_1.4.1 colorspace_1.2-6       
[25] stringi_0.5-5           munsell_0.4.2           RCurl_1.95-4.7         

 

variantannotation • 1.4k views
ADD COMMENT
1
Entering edit mode
@valerie-obenchain-4275
Last seen 2.9 years ago
United States

Hi Daniel,

Thanks for the report about the extra tabs.  I don't think it should be up to writeVcf() to determine the fileformat. We could throw an error if no fileformat is in the input or add a 'fileformat' arg to writeVcf() that would let the user specify or use a reasonable default.

In addition to the issues you raise there are several things on the TODO for readVcf() / writeVcf() for files with incomplete headers. I plan to work on this next week and will post back here when it's done.

Valerie

ADD COMMENT
0
Entering edit mode

We're writing the file according to some version of the format, right? It's required by the VCF spec to have a fileformat directive, so we should output one.

ADD REPLY
0
Entering edit mode

I have not tracked format changes by version and enforced them in writeVcf(). Maybe you've added some of this. Are you thinking of supporting version compliance for all of 4.0, 4.1 and 4.2 or just the most current? 

ADD REPLY
0
Entering edit mode

I guess we are implicitly supporting a version, I'm just not sure which. It might be too much work to add strict validation (especially since there is no effort to document the differences between the versions), but we should at least write out the declaration, and simply say that user data must conform to version X, which would depend on the Bioc release. Most of the time, the changes are internal to writeVcf, and won't affect the user. Luckily, the spec seems fairly mature; the gap between 4.1 and the minor update 4.2 was 2 years.

Btw, I found on twitter that the major difference between 4.1 and 4.2 is the support for "R" to indicate that a field has a value for every allele, including ref. This could be used for the AD field. 

ADD REPLY
0
Entering edit mode

I agree, supporting the current VCF version for the current Bioc release seems reasonable. I'll see what else I can find that defines 4.2 and add this up when I add the other changes.

ADD REPLY
0
Entering edit mode

These changes were made to VariantAnnotation 1.15.24 and 1.14.9:

  • if not present, writeVcf() writes 'fileformat' header line as VCFv4.2
  • if 'fileformat' is VCFv4.2 and there is a 'DP' INFO field, 'Number' is written out as 'R'
  • remove extra tab after INFO field in case of no FORMAT data

The new versions should be available via biocLite() by tomorrow noon.

Valerie

ADD REPLY
0
Entering edit mode

DP is the total depth, why would it be per-allele? Do you mean AD?

ADD REPLY
0
Entering edit mode

Yes, sorry, AD not DP.

 

ADD REPLY

Login before adding your answer.

Traffic: 989 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6