VariantAnnotation not does escape = within header string quotes
1
0
Entering edit mode
@daniel-cameron-10086
Last seen 5 hours ago
Australia

VariantAnnotation does not round-trip custom headers and instead writes incomplete strings which cause subsequent parsing issues.

Reproduction steps: 1) create the following VCF as VariantAnnotationBug_roundtrip_custom_string_fields.vcf

##fileformat=VCFv4.4
##DRAGENVersion=<ID=dragen,Version="SW: 4.5.0-1749-g09b496a7, HW: 07.031.807">
##DRAGENCommandLine=<ID=dragen,Date="Thu Oct 30 23:28:48 UTC 2025",CommandLineOptions="--output-directory=test">
##contig=<ID=chr1,length=248956422>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  test

2) Run the following:

library(VariantAnnotation
readVcf("temp.vcf", writeVcf(readVcf("../VariantAnnotationBug_roundtrip_custom_string_fields.vcf"), "temp.vcf"))
sessionInfo()

The offending line in temp.vcf is turned into:

##DRAGENCommandLine=<ID=dragen,Date="Thu Oct 30 23:28:48 UTC 2025",CommandLineOptions="--output-directory>

Note how the CommandLineOptions option is truncated where the = is. The = within the string quotes should not be considered a special character and the line should be round-tripped without error.

The R output is :

[W::bcf_hdr_parse_line] Incomplete header line, trying to proceed anyway:
    [##DRAGENCommandLine=<ID=dragen,Date="Thu Oct 30 23:28:48 UTC 2025",CommandLineOptions="--output-directory>
##contig=<ID=chr1,length=248956422>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
]
    [10]
[W::bcf_hdr_parse_line] Incomplete header line, trying to proceed anyway:
    [##DRAGENCommandLine=<ID=dragen,Date="Thu Oct 30 23:28:48 UTC 2025",CommandLineOptions="--output-directory>
##contig=<ID=chr1,length=248956422>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
]
    [10]
class: CollapsedVCF 
dim: 0 0 
rowRanges(vcf):
  GRanges with 4 metadata columns: REF, ALT, QUAL, FILTER
info(vcf):
  DataFrame with 1 column: INFO
  Fields with no header: INFO 
geno(vcf):
  List of length 0: 
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                           LC_TIME=English_United States.utf8    

time zone: Australia/Sydney
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.4                    forcats_1.0.0                      stringr_1.5.1                     
 [4] dplyr_1.1.4                        purrr_1.0.2                        readr_2.1.5                       
 [7] tidyr_1.3.1                        tibble_3.2.1                       ggplot2_3.5.1                     
[10] tidyverse_2.0.0                    StructuralVariantAnnotation_1.22.0 rtracklayer_1.66.0                
[13] VariantAnnotation_1.52.0           Rsamtools_2.22.0                   Biostrings_2.74.1                 
[16] XVector_0.46.0                     SummarizedExperiment_1.36.0        Biobase_2.66.0                    
[19] GenomicRanges_1.58.0               GenomeInfoDb_1.42.3                IRanges_2.40.1                    
[22] S4Vectors_0.44.0                   MatrixGenerics_1.18.1              matrixStats_1.5.0                 
[25] BiocGenerics_0.52.0               

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1         blob_1.2.4               bitops_1.0-9             fastmap_1.2.0            RCurl_1.98-1.16         
 [6] GenomicAlignments_1.42.0 XML_3.99-0.18            timechange_0.3.0         lifecycle_1.0.4          pwalign_1.2.0           
[11] KEGGREST_1.46.0          RSQLite_2.3.9            magrittr_2.0.3           compiler_4.4.2           rlang_1.1.5             
[16] tools_4.4.2              utf8_1.2.4               yaml_2.3.10              S4Arrays_1.6.0           bit_4.5.0.1             
[21] curl_6.2.0               DelayedArray_0.32.0      abind_1.4-8              BiocParallel_1.40.0      withr_3.0.2             
[26] grid_4.4.2               colorspace_2.1-1         scales_1.3.0             cli_3.6.3                crayon_1.5.3            
[31] generics_0.1.3           rstudioapi_0.17.1        httr_1.4.7               tzdb_0.4.0               rjson_0.2.23            
[36] DBI_1.2.3                cachem_1.1.0             zlibbioc_1.52.0          assertthat_0.2.1         parallel_4.4.2          
[41] AnnotationDbi_1.68.0     BiocManager_1.30.25      restfulr_0.0.15          vctrs_0.6.5              Matrix_1.7-2            
[46] jsonlite_1.8.9           hms_1.1.3                bit64_4.6.0-1            GenomicFeatures_1.58.0   glue_1.8.0              
[51] codetools_0.2-20         stringi_1.8.4            gtable_0.3.6             BiocIO_1.16.0            UCSC.utils_1.2.0        
[56] munsell_0.5.1            pillar_1.10.1            GenomeInfoDbData_1.2.13  BSgenome_1.74.0          R6_2.5.1                
[61] vroom_1.6.5              lattice_0.22-6           png_0.1-8                memoise_2.0.1            SparseArray_1.6.1       
[66] pkgconfig_2.0.3
VariantAnnotation • 49 views
ADD COMMENT
0
Entering edit mode
Kevin Blighe ★ 4.0k
@kevin
Last seen 9 hours ago
The Cave, 181 Longwood Avenue, Boston, …

Hi,

Thanks for the minimal reproduction - this is very clear and confirms a bug in VariantAnnotation::writeVcf(). The issue is that the writer (likely via the underlying htslib/BCFtools) is incorrectly parsing the = inside the quoted CommandLineOptions value as a new key-value delimiter, rather than treating the whole thing as a single quoted string. This leads to truncation right before the inner =.

I can reproduce this locally with R 4.4.2 and VariantAnnotation_1.52.0 on Windows (same platform as yours). The warnings on read-back are just htslib trying to recover gracefully, but the header is mangled.

This definitely warrants a bug report to the VariantAnnotation maintainers. Since you've already posted to the Bioconductor support site (linked), that's the perfect spot - they monitor it closely and can triage to the devs (e.g., Valerie Obenchain). If it's not getting traction there, you could also open a GitHub issue on the Bioconductor/VariantAnnotation repo with your exact repro steps.

Quick workaround

If you need to round-trip files in the interim, one option is to write the VCF, then manually fix the header line(s) post-hoc with a text editor or sed/awk. For example, assuming the truncated line is always in the same spot:

# After writeVcf(), patch the temp.vcf
sed -i 's|CommandLineOptions="--output-directory>|CommandLineOptions="--output-directory=test">|' temp.vcf

This is hacky, of course, and assumes the truncation pattern is consistent (which it seems to be from your example). If the full CommandLineOptions varies, you might need a more robust script to reconstruct it.

Your sessionInfo() looks clean otherwise - no conflicts there.

Let us know if the Bioc support folks chime in with a fix!

Kevin

ADD COMMENT
0
Entering edit mode

I'm just stripping the entire line as a workaround using meta(header(vcf))$DRAGENCommandLine = NULL. Not ideal but at least subsequent parsers don't choke on the unterminated quote. The output is already lossy as all command-line arguments after the first one are already dropped (presumably because they also have = in them).

ADD REPLY

Login before adding your answer.

Traffic: 822 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6