I'm experimenting with custom VCF formats. More specifically, some colleagues and I are working on establishing standards for polyploid data in non-model organisms.
I'd like to make use of the Sample field format as described in the current VCF specification part 1.4.8. It especially seems useful for long-term data preservation, enabling us to put all sample metadata into the file.
I made the following dummy file to see if I could import data from ##SAMPLE lines.
##Tassel=<ID=GenotypeTable,Version=5,Description="Reference allele is not known. The major allele was used as reference allele">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the reference and alternate alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=PL,Number=.,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##META=<ID=Ploidy,Type=String,Number=.,Description="Ploidy with respect to reference genome">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sam1 Sam2
Chr01 1000 S01_1000 G A 100 PASS DP=20 GT:AD 0/0:10,0 0/1:5,5
readVcf works, but I can't find the sample metadata anywhere in the vcf object. It seems like colData would be the appropriate place to put it. If I manually add columns to colData, writeVcf gives an error.
myvcf <- readVcf("testvcf.vcf") colData(myvcf) colData(myvcf)$Ploidy <- c("2x", "4x") writeVcf(myvcf, file = "outvcf.vcf")
Is there something I'm missing? Or if not, what is the possibility of this being implemented in VariantAnnotation at some point in the future?
> sessionInfo() R version 3.5.0 (2018-04-23) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) Matrix products: default locale:  LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252  LC_MONETARY=English_United States.1252 LC_NUMERIC=C  LC_TIME=English_United States.1252 attached base packages:  stats4 parallel stats graphics grDevices utils datasets methods base other attached packages:  VariantAnnotation_1.26.1 Rsamtools_1.32.3 Biostrings_2.48.0  XVector_0.20.0 SummarizedExperiment_1.10.1 DelayedArray_0.6.6  BiocParallel_1.14.2 matrixStats_0.54.0 Biobase_2.40.0  GenomicRanges_1.32.7 GenomeInfoDb_1.16.0 IRanges_2.14.12  S4Vectors_0.18.3 BiocGenerics_0.26.0 loaded via a namespace (and not attached):  Rcpp_0.12.18 compiler_3.5.0 prettyunits_1.0.2 progress_1.2.0  GenomicFeatures_1.32.2 bitops_1.0-6 tools_3.5.0 zlibbioc_1.26.0  biomaRt_2.36.1 digest_0.6.17 bit_1.1-14 BSgenome_1.48.0  evaluate_0.11 RSQLite_2.1.1 memoise_1.1.0 lattice_0.20-35  pkgconfig_2.0.2 rlang_0.2.2 Matrix_1.2-14 rstudioapi_0.7  DBI_1.0.0 yaml_2.2.0 GenomeInfoDbData_1.1.0 rtracklayer_1.40.6  httr_1.3.1 stringr_1.3.1 knitr_1.20 hms_0.4.2  rprojroot_1.3-2 bit64_0.9-7 grid_3.5.0 R6_2.2.2  AnnotationDbi_1.42.1 XML_3.98-1.16 rmarkdown_1.10 blob_1.1.1  magrittr_1.5 GenomicAlignments_1.16.0 backports_1.1.2 htmltools_0.3.6  assertthat_0.2.0 stringi_1.1.7 RCurl_1.95-4.11 crayon_1.3.4