Search
Question: Data from ##SAMPLE lines in VCF?
0
11 weeks ago by
lvclark0
lvclark0 wrote:

I'm experimenting with custom VCF formats.  More specifically, some colleagues and I are working on establishing standards for polyploid data in non-model organisms.

I'd like to make use of the Sample field format as described in the current VCF specification part 1.4.8.  It especially seems useful for long-term data preservation, enabling us to put all sample metadata into the file.

I made the following dummy file to see if I could import data from ##SAMPLE lines.

##fileformat=VCFv4.3 ##Tassel=<ID=GenotypeTable,Version=5,Description="Reference allele is not known. The major allele was used as reference allele"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the reference and alternate alleles in the order listed"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)"> ##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality"> ##FORMAT=<ID=PL,Number=.,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic"> ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency"> ##META=<ID=Species,Type=String,Number=.,Description="Species name"> ##META=<ID=Ploidy,Type=String,Number=.,Description="Ploidy with respect to reference genome"> ##SAMPLE=<ID=Sam1,Species="Miscanthus sinensis",Ploidy="2x",Description="Sample1"> ##SAMPLE=<ID=Sam2,Species="Miscanthus sacchariflorus",Ploidy="4x",Description="Sample2"> #CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    Sam1    Sam2 Chr01    1000    S01_1000    G    A    100    PASS    DP=20    GT:AD    0/0:10,0    0/1:5,5

readVcf works, but I can't find the sample metadata anywhere in the vcf object.  It seems like colData would be the appropriate place to put it.  If I manually add columns to colData, writeVcf gives an error.

myvcf <- readVcf("testvcf.vcf")
colData(myvcf)
colData(myvcf)$Ploidy <- c("2x", "4x") writeVcf(myvcf, file = "outvcf.vcf") Is there something I'm missing? Or if not, what is the possibility of this being implemented in VariantAnnotation at some point in the future? > sessionInfo() R version 3.5.0 (2018-04-23) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) Matrix products: default locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] VariantAnnotation_1.26.1 Rsamtools_1.32.3 Biostrings_2.48.0 [4] XVector_0.20.0 SummarizedExperiment_1.10.1 DelayedArray_0.6.6 [7] BiocParallel_1.14.2 matrixStats_0.54.0 Biobase_2.40.0 [10] GenomicRanges_1.32.7 GenomeInfoDb_1.16.0 IRanges_2.14.12 [13] S4Vectors_0.18.3 BiocGenerics_0.26.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.18 compiler_3.5.0 prettyunits_1.0.2 progress_1.2.0 [5] GenomicFeatures_1.32.2 bitops_1.0-6 tools_3.5.0 zlibbioc_1.26.0 [9] biomaRt_2.36.1 digest_0.6.17 bit_1.1-14 BSgenome_1.48.0 [13] evaluate_0.11 RSQLite_2.1.1 memoise_1.1.0 lattice_0.20-35 [17] pkgconfig_2.0.2 rlang_0.2.2 Matrix_1.2-14 rstudioapi_0.7 [21] DBI_1.0.0 yaml_2.2.0 GenomeInfoDbData_1.1.0 rtracklayer_1.40.6 [25] httr_1.3.1 stringr_1.3.1 knitr_1.20 hms_0.4.2 [29] rprojroot_1.3-2 bit64_0.9-7 grid_3.5.0 R6_2.2.2 [33] AnnotationDbi_1.42.1 XML_3.98-1.16 rmarkdown_1.10 blob_1.1.1 [37] magrittr_1.5 GenomicAlignments_1.16.0 backports_1.1.2 htmltools_0.3.6 [41] assertthat_0.2.0 stringi_1.1.7 RCurl_1.95-4.11 crayon_1.3.4  ADD COMMENTlink modified 10 weeks ago by Valerie Obenchain ♦♦ 6.7k • written 11 weeks ago by lvclark0 0 11 weeks ago by lvclark0 lvclark0 wrote: Found it: meta(header(myvcf))$SAMPLE

Although, there is still an issue with writeVcf, and looking at the source code it seems to be that now I have another data frame called META which creates a conflict with the META data frame containing the VCF version number.

I can confirm this.  None of our example VCF in VariantAnnotation use the META header record type

(https://samtools.github.io/hts-specs/VCFv4.3.pdf section 1.4.8).  Additionally writeVcf-methods.Rd mentions Bioconductor 3.1 and the contemporaneous standard v4.2 of VCF spec.  It looks like there is some updating to be done in VariantAnnotation for writeVcf to work with your example.  NB -- one can't cut and paste your example VCF because tabs are not preserved.  This threw me for a bit.

Whoops, sorry about the tabs going away!  And thank you for looking into this.

Thanks for the report. The 'META' name clash is unfortunate. I'll work on a solution and should have something in they next couple of days.

Valerie

0
10 weeks ago by
Valerie Obenchain ♦♦ 6.7k
United States
Valerie Obenchain ♦♦ 6.7k wrote:

VariantAnnotation 1.27.6 and Rsamtools 1.33.6 have been updated to support VCFv4.3.

I wanted to add some background to explain why we had the "META" DataFame in the first place. The meta-information lines in a VCF header are preceded by '##' and we can think of these key-value pairs as "simple" or "non-simple".

A simple pair would have a single value, e.g.,

##fileformat=VCFv4.3

A non-simple pair has values enclosed in '<>' with subsections such as 'ID', 'Number', 'Type' etc., e.g.,

##SAMPLE=<ID=Sample1,Assay=WholeGenome,Ethnicity=AFR,Disease=None,Description="Patient germ>

Previously the simple pairs went into a DataFrame called "META" and the non-simple went into separate DataFrames, one per unique key. Now that meta-information lines can have a key called "META" we can end up with 2 DataFrames with the same name:

> hdr
samples(0):
meta(4): META Tassel META SAMPLE
fixed(0):
info(3): NS DP AF
geno(5): GT AD DP GQ PL

Naming the DataFrame "META" was (my) poor choice from the start since it doesn't represent all meta-information lines, just the simple ones. We could re-name this but we'll always run the risk that a new header line will have the same key name and we'll be back in the same situation.

I've decided to split the "META" DataFrame, making each row a separate DataFrame. This seems like the cleanest solution in that meta() now returns a DataFrame for each unique key in the header.

> hdr
samples(0):
meta(4): fileformat Tassel META SAMPLE
fixed(0):
info(3): NS DP AF
geno(5): GT AD DP GQ PL

Unfortunately this is not a completely backwards compatible solution. The package functions support both v4.2 or v4.3 but if users have code that extracts 'fileformat', 'fileDate' etc. from the old "META" DataFrame they will need to update their code with VariantAnnotation >= 1.27.6.

Because next release is less than 4 weeks away, I'm inclined to not port this to the current release (BioC 3.7) unless there is a strong need.

I'm perfectly happy waiting a month for this (or installing the development version in the meantime).  The project that I'm working on in which I encountered this issue is very much in the early stages.

Thank you so much for fixing this so quickly!