Question

Data from ##SAMPLE lines in VCF?

0

Entering edit mode

lvclark ▴ 10

@lvclark-17516

Last seen 3.8 years ago

I'm experimenting with custom VCF formats. More specifically, some colleagues and I are working on establishing standards for polyploid data in non-model organisms.

I'd like to make use of the Sample field format as described in the current VCF specification part 1.4.8. It especially seems useful for long-term data preservation, enabling us to put all sample metadata into the file.

I made the following dummy file to see if I could import data from ##SAMPLE lines.

##fileformat=VCFv4.3 ##Tassel=<ID=GenotypeTable,Version=5,Description="Reference allele is not known. The major allele was used as reference allele"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the reference and alternate alleles in the order listed"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)"> ##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality"> ##FORMAT=<ID=PL,Number=.,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic"> ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency"> ##META=<ID=Species,Type=String,Number=.,Description="Species name"> ##META=<ID=Ploidy,Type=String,Number=.,Description="Ploidy with respect to reference genome"> ##SAMPLE=<ID=Sam1,Species="Miscanthus sinensis",Ploidy="2x",Description="Sample1"> ##SAMPLE=<ID=Sam2,Species="Miscanthus sacchariflorus",Ploidy="4x",Description="Sample2"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sam1 Sam2 Chr01 1000 S01_1000 G A 100 PASS DP=20 GT:AD 0/0:10,0 0/1:5,5

readVcf works, but I can't find the sample metadata anywhere in the vcf object. It seems like colData would be the appropriate place to put it. If I manually add columns to colData, writeVcf gives an error.

myvcf <- readVcf("testvcf.vcf")
colData(myvcf)
colData(myvcf)$Ploidy <- c("2x", "4x")
writeVcf(myvcf, file = "outvcf.vcf")

Is there something I'm missing? Or if not, what is the possibility of this being implemented in VariantAnnotation at some point in the future?

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] VariantAnnotation_1.26.1    Rsamtools_1.32.3            Biostrings_2.48.0          
 [4] XVector_0.20.0              SummarizedExperiment_1.10.1 DelayedArray_0.6.6         
 [7] BiocParallel_1.14.2         matrixStats_0.54.0          Biobase_2.40.0             
[10] GenomicRanges_1.32.7        GenomeInfoDb_1.16.0         IRanges_2.14.12            
[13] S4Vectors_0.18.3            BiocGenerics_0.26.0        

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.18             compiler_3.5.0           prettyunits_1.0.2        progress_1.2.0          
 [5] GenomicFeatures_1.32.2   bitops_1.0-6             tools_3.5.0              zlibbioc_1.26.0         
 [9] biomaRt_2.36.1           digest_0.6.17            bit_1.1-14               BSgenome_1.48.0         
[13] evaluate_0.11            RSQLite_2.1.1            memoise_1.1.0            lattice_0.20-35         
[17] pkgconfig_2.0.2          rlang_0.2.2              Matrix_1.2-14            rstudioapi_0.7          
[21] DBI_1.0.0                yaml_2.2.0               GenomeInfoDbData_1.1.0   rtracklayer_1.40.6      
[25] httr_1.3.1               stringr_1.3.1            knitr_1.20               hms_0.4.2               
[29] rprojroot_1.3-2          bit64_0.9-7              grid_3.5.0               R6_2.2.2                
[33] AnnotationDbi_1.42.1     XML_3.98-1.16            rmarkdown_1.10           blob_1.1.1              
[37] magrittr_1.5             GenomicAlignments_1.16.0 backports_1.1.2          htmltools_0.3.6         
[41] assertthat_0.2.0         stringi_1.1.7            RCurl_1.95-4.11          crayon_1.3.4

variantannotation coldata • 3.3k views

ADD COMMENT • link updated 5.8 years ago by Valerie Obenchain ★ 6.8k • written 5.8 years ago by lvclark ▴ 10

score 1 · Answer 1 · 2018-09-25

1

Entering edit mode

lvclark ▴ 10

@lvclark-17516

Last seen 3.8 years ago

Found it:

meta(header(myvcf))$SAMPLE

ADD COMMENT • link 5.8 years ago lvclark ▴ 10

0

Entering edit mode

Although, there is still an issue with writeVcf, and looking at the source code it seems to be that now I have another data frame called META which creates a conflict with the META data frame containing the VCF version number.

ADD REPLY • link 5.8 years ago lvclark ▴ 10

0

Entering edit mode

I can confirm this. None of our example VCF in VariantAnnotation use the META header record type

(https://samtools.github.io/hts-specs/VCFv4.3.pdf section 1.4.8). Additionally writeVcf-methods.Rd mentions Bioconductor 3.1 and the contemporaneous standard v4.2 of VCF spec. It looks like there is some updating to be done in VariantAnnotation for writeVcf to work with your example. NB -- one can't cut and paste your example VCF because tabs are not preserved. This threw me for a bit.

ADD REPLY • link 5.8 years ago Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

Whoops, sorry about the tabs going away! And thank you for looking into this.

ADD REPLY • link 5.8 years ago lvclark ▴ 10

0

Entering edit mode

Thanks for the report. The 'META' name clash is unfortunate. I'll work on a solution and should have something in they next couple of days.

Valerie

ADD REPLY • link 5.8 years ago Valerie Obenchain ★ 6.8k

score 0 · Answer 2 · 2018-10-02

VariantAnnotation 1.27.6 and Rsamtools 1.33.6 have been updated to support VCFv4.3.

I wanted to add some background to explain why we had the "META" DataFame in the first place. The meta-information lines in a VCF header are preceded by '##' and we can think of these key-value pairs as "simple" or "non-simple".

A simple pair would have a single value, e.g.,

##fileformat=VCFv4.3

A non-simple pair has values enclosed in '<>' with subsections such as 'ID', 'Number', 'Type' etc., e.g.,

##SAMPLE=<ID=Sample1,Assay=WholeGenome,Ethnicity=AFR,Disease=None,Description="Patient germ>

Previously the simple pairs went into a DataFrame called "META" and the non-simple went into separate DataFrames, one per unique key. Now that meta-information lines can have a key called "META" we can end up with 2 DataFrames with the same name:

> hdr
class: VCFHeader
samples(0):
meta(4): META Tassel META SAMPLE
fixed(0):
info(3): NS DP AF
geno(5): GT AD DP GQ PL

Naming the DataFrame "META" was (my) poor choice from the start since it doesn't represent all meta-information lines, just the simple ones. We could re-name this but we'll always run the risk that a new header line will have the same key name and we'll be back in the same situation.

I've decided to split the "META" DataFrame, making each row a separate DataFrame. This seems like the cleanest solution in that `meta()` now returns a DataFrame for each unique key in the header.

> hdr
class: VCFHeader
samples(0):
meta(4): fileformat Tassel META SAMPLE
fixed(0):
info(3): NS DP AF
geno(5): GT AD DP GQ PL

Unfortunately this is not a completely backwards compatible solution. The package functions support both v4.2 or v4.3 but if users have code that extracts 'fileformat', 'fileDate' etc. from the old "META" DataFrame they will need to update their code with VariantAnnotation >= 1.27.6.

Because next release is less than 4 weeks away, I'm inclined to not port this to the current release (BioC 3.7) unless there is a strong need.