IGV is failing to read genotypeless VCFs written by writeVCF() since a tab is written after the INFO field whether it is followed by a genotype or not. The VCF header is correctly written without a trailing tab and without a FORMAT header, but each record incorrectly includes a trailing tab.
Additionally, the ##fileformat header line is required by the VCF specifications and is not written if the input file does not include it.
--- Script to reproduce ---
library(VariantAnnotation) f <- file("test.vcf") writeLines(c( "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO", "chr1\t1\tvariant1\tA\t[chr1:1[a\t1\t.\tATTR=value", "chr2\t2\tvariant2\tA\t[chr2:2[a\t2\t.\tATTR=value" ), f) close(f) vcf <- readVcf("test.vcf", "hg19") writeVcf(vcf, "output-should-not-contain-trailing-tabs.vcf")
--- sessionInfo() ---
R version 3.2.1 (2015-06-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 LC_MONETARY=English_Australia.1252 LC_NUMERIC=C LC_TIME=English_Australia.1252
attached base packages:
[1] grid stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringr_1.0.0 RColorBrewer_1.1-2 scales_0.2.5 ggplot2_1.0.1 VennDiagram_1.6.9 openxlsx_3.0.0 plyr_1.8.3 VariantAnnotation_1.14.6
[9] Rsamtools_1.20.4 Biostrings_2.36.1 XVector_0.8.0 rtracklayer_1.28.6 GenomicRanges_1.20.5 GenomeInfoDb_1.4.1 IRanges_2.2.5 S4Vectors_0.6.2
[17] BiocGenerics_0.14.0 BiocInstaller_1.18.4
loaded via a namespace (and not attached):
[1] Rcpp_0.11.6 futile.logger_1.4.1 GenomicFeatures_1.20.1 bitops_1.0-6 futile.options_1.0.0 tools_3.2.1 zlibbioc_1.14.0 biomaRt_2.24.0
[9] digest_0.6.8 RSQLite_1.0.0 gtable_0.1.2 BSgenome_1.36.2 DBI_0.3.1 proto_0.3-10 Biobase_2.28.0 AnnotationDbi_1.30.1
[17] XML_3.98-1.3 BiocParallel_1.2.11 reshape2_1.4.1 lambda.r_1.1.7 magrittr_1.5 MASS_7.3-43 GenomicAlignments_1.4.1 colorspace_1.2-6
[25] stringi_0.5-5 munsell_0.4.2 RCurl_1.95-4.7
We're writing the file according to some version of the format, right? It's required by the VCF spec to have a fileformat directive, so we should output one.
I have not tracked format changes by version and enforced them in writeVcf(). Maybe you've added some of this. Are you thinking of supporting version compliance for all of 4.0, 4.1 and 4.2 or just the most current?
I guess we are implicitly supporting a version, I'm just not sure which. It might be too much work to add strict validation (especially since there is no effort to document the differences between the versions), but we should at least write out the declaration, and simply say that user data must conform to version X, which would depend on the Bioc release. Most of the time, the changes are internal to writeVcf, and won't affect the user. Luckily, the spec seems fairly mature; the gap between 4.1 and the minor update 4.2 was 2 years.
Btw, I found on twitter that the major difference between 4.1 and 4.2 is the support for "R" to indicate that a field has a value for every allele, including ref. This could be used for the AD field.
I agree, supporting the current VCF version for the current Bioc release seems reasonable. I'll see what else I can find that defines 4.2 and add this up when I add the other changes.
These changes were made to VariantAnnotation 1.15.24 and 1.14.9:
The new versions should be available via biocLite() by tomorrow noon.
Valerie
DP is the total depth, why would it be per-allele? Do you mean AD?
Yes, sorry, AD not DP.