Question: Data from ##SAMPLE lines in VCF?
gravatar for lvclark
14 months ago by
lvclark0 wrote:

I'm experimenting with custom VCF formats.  More specifically, some colleagues and I are working on establishing standards for polyploid data in non-model organisms.

I'd like to make use of the Sample field format as described in the current VCF specification part 1.4.8.  It especially seems useful for long-term data preservation, enabling us to put all sample metadata into the file.

I made the following dummy file to see if I could import data from ##SAMPLE lines.

##Tassel=<ID=GenotypeTable,Version=5,Description="Reference allele is not known. The major allele was used as reference allele">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the reference and alternate alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=PL,Number=.,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##META=<ID=Species,Type=String,Number=.,Description="Species name">
##META=<ID=Ploidy,Type=String,Number=.,Description="Ploidy with respect to reference genome">
##SAMPLE=<ID=Sam1,Species="Miscanthus sinensis",Ploidy="2x",Description="Sample1">
##SAMPLE=<ID=Sam2,Species="Miscanthus sacchariflorus",Ploidy="4x",Description="Sample2">
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    Sam1    Sam2
Chr01    1000    S01_1000    G    A    100    PASS    DP=20    GT:AD    0/0:10,0    0/1:5,5

readVcf works, but I can't find the sample metadata anywhere in the vcf object.  It seems like colData would be the appropriate place to put it.  If I manually add columns to colData, writeVcf gives an error.

myvcf <- readVcf("testvcf.vcf")
colData(myvcf)$Ploidy <- c("2x", "4x")
writeVcf(myvcf, file = "outvcf.vcf")

Is there something I'm missing?  Or if not, what is the possibility of this being implemented in VariantAnnotation at some point in the future?

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] VariantAnnotation_1.26.1    Rsamtools_1.32.3            Biostrings_2.48.0          
 [4] XVector_0.20.0              SummarizedExperiment_1.10.1 DelayedArray_0.6.6         
 [7] BiocParallel_1.14.2         matrixStats_0.54.0          Biobase_2.40.0             
[10] GenomicRanges_1.32.7        GenomeInfoDb_1.16.0         IRanges_2.14.12            
[13] S4Vectors_0.18.3            BiocGenerics_0.26.0        

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.18             compiler_3.5.0           prettyunits_1.0.2        progress_1.2.0          
 [5] GenomicFeatures_1.32.2   bitops_1.0-6             tools_3.5.0              zlibbioc_1.26.0         
 [9] biomaRt_2.36.1           digest_0.6.17            bit_1.1-14               BSgenome_1.48.0         
[13] evaluate_0.11            RSQLite_2.1.1            memoise_1.1.0            lattice_0.20-35         
[17] pkgconfig_2.0.2          rlang_0.2.2              Matrix_1.2-14            rstudioapi_0.7          
[21] DBI_1.0.0                yaml_2.2.0               GenomeInfoDbData_1.1.0   rtracklayer_1.40.6      
[25] httr_1.3.1               stringr_1.3.1            knitr_1.20               hms_0.4.2               
[29] rprojroot_1.3-2          bit64_0.9-7              grid_3.5.0               R6_2.2.2                
[33] AnnotationDbi_1.42.1     XML_3.98-1.16            rmarkdown_1.10           blob_1.1.1              
[37] magrittr_1.5             GenomicAlignments_1.16.0 backports_1.1.2          htmltools_0.3.6         
[41] assertthat_0.2.0         stringi_1.1.7            RCurl_1.95-4.11          crayon_1.3.4 
variantannotation coldata • 452 views
ADD COMMENTlink modified 14 months ago by Valerie Obenchain6.7k • written 14 months ago by lvclark0
Answer: Data from ##SAMPLE lines in VCF?
gravatar for lvclark
14 months ago by
lvclark0 wrote:

Found it:

ADD COMMENTlink written 14 months ago by lvclark0

Although, there is still an issue with writeVcf, and looking at the source code it seems to be that now I have another data frame called META which creates a conflict with the META data frame containing the VCF version number.

ADD REPLYlink written 14 months ago by lvclark0

I can confirm this.  None of our example VCF in VariantAnnotation use the META header record type

( section 1.4.8).  Additionally writeVcf-methods.Rd mentions Bioconductor 3.1 and the contemporaneous standard v4.2 of VCF spec.  It looks like there is some updating to be done in VariantAnnotation for writeVcf to work with your example.  NB -- one can't cut and paste your example VCF because tabs are not preserved.  This threw me for a bit. 

ADD REPLYlink written 14 months ago by Vincent J. Carey, Jr.6.3k

Whoops, sorry about the tabs going away!  And thank you for looking into this.

ADD REPLYlink written 14 months ago by lvclark0

Thanks for the report. The 'META' name clash is unfortunate. I'll work on a solution and should have something in they next couple of days.


ADD REPLYlink written 14 months ago by Valerie Obenchain6.7k
Answer: Data from ##SAMPLE lines in VCF?
gravatar for Valerie Obenchain
14 months ago by
United States
Valerie Obenchain6.7k wrote:

VariantAnnotation 1.27.6 and Rsamtools 1.33.6 have been updated to support VCFv4.3.

I wanted to add some background to explain why we had the "META" DataFame in the first place. The meta-information lines in a VCF header are preceded by '##' and we can think of these key-value pairs as "simple" or "non-simple".

A simple pair would have a single value, e.g.,


A non-simple pair has values enclosed in '<>' with subsections such as 'ID', 'Number', 'Type' etc., e.g.,

##SAMPLE=<ID=Sample1,Assay=WholeGenome,Ethnicity=AFR,Disease=None,Description="Patient germ>

Previously the simple pairs went into a DataFrame called "META" and the non-simple went into separate DataFrames, one per unique key. Now that meta-information lines can have a key called "META" we can end up with 2 DataFrames with the same name:

> hdr
class: VCFHeader
meta(4): META Tassel META SAMPLE
info(3): NS DP AF
geno(5): GT AD DP GQ PL

Naming the DataFrame "META" was (my) poor choice from the start since it doesn't represent all meta-information lines, just the simple ones. We could re-name this but we'll always run the risk that a new header line will have the same key name and we'll be back in the same situation.

I've decided to split the "META" DataFrame, making each row a separate DataFrame. This seems like the cleanest solution in that `meta()` now returns a DataFrame for each unique key in the header.

> hdr
class: VCFHeader
meta(4): fileformat Tassel META SAMPLE
info(3): NS DP AF
geno(5): GT AD DP GQ PL

Unfortunately this is not a completely backwards compatible solution. The package functions support both v4.2 or v4.3 but if users have code that extracts 'fileformat', 'fileDate' etc. from the old "META" DataFrame they will need to update their code with VariantAnnotation >= 1.27.6.

Because next release is less than 4 weeks away, I'm inclined to not port this to the current release (BioC 3.7) unless there is a strong need.

ADD COMMENTlink written 14 months ago by Valerie Obenchain6.7k

I'm perfectly happy waiting a month for this (or installing the development version in the meantime).  The project that I'm working on in which I encountered this issue is very much in the early stages.

Thank you so much for fixing this so quickly!

ADD REPLYlink written 14 months ago by lvclark0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 215 users visited in the last hour