Question

readVCFAsVRanges in VariantAnnotation broken by an FTZ format?

0

Entering edit mode

Andy Lynch ▴ 120

@andy-lynch-6934

Last seen 15 months ago

United Kingdom

I've noticed that the inclusion of a "##FORMAT=<ID=FTZ,Nu..." line in a vcf file seems to break the readVCFAsVRanges() function (although not the readVCF() function). CaVEMan, for example, includes such a line in its vcf files. Replacing all instances of FTZ with, e.g., FUZ in the vcf file avoids the problem, but is not a desirable solution.

This can be illustrated with two toy vcf files, the only difference of which is the FTZ/FUZ switch. [Variant location and nature have been changed to ensure that no patient data are presented.]

FTZtest.vcf:

##fileformat=VCFv4.1
##fileDate=20160629
##reference=/datastore/reference_files/genome.fa
##vcfProcessLog=<InputVCF=<.>,InputVCFSource=<CaVEMan>,InputVCFParam=<NORMAL_CONTAMINATION=0.1,PRIOR_MUT_RATE=6e-06,REF_BIAS=0.95,SNP_CUTOFF=0.95,MUT_CUTOFF=0.8,PRIOR_SNP_RATE=0.0001>>
##cavemanVersion=1.9.2
##contig=<ID=1,assembly=GRCh37,length=249250621,species=HUMAN>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=FTZ,Number=1,Type=Integer,Description="Reads presenting a T for this position, forward strand">
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    NORMAL    TUMOUR
1    16074    ID    C    A    .    PASS    DP=61    GT:FTZ    0|0:20    0|1:15

FUZtest.vcf:

##fileformat=VCFv4.1
##fileDate=20160629
##reference=/datastore/reference_files/genome.fa
##vcfProcessLog=<InputVCF=<.>,InputVCFSource=<CaVEMan>,InputVCFParam=<NORMAL_CONTAMINATION=0.1,PRIOR_MUT_RATE=6e-06,REF_BIAS=0.95,SNP_CUTOFF=0.95,MUT_CUTOFF=0.8,PRIOR_SNP_RATE=0.0001>>
##cavemanVersion=1.9.2
##contig=<ID=1,assembly=GRCh37,length=249250621,species=HUMAN>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=FUZ,Number=1,Type=Integer,Description="Reads presenting a T for this position, forward strand">
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    NORMAL    TUMOUR
1    16074    ID    C    A    .    PASS    DP=61    GT:FUZ    0|0:20    0|1:15

Running the command readVcfAsVRanges("FUZtest.vcf","hg19") works as expected, but readVcfAsVRanges("FTZtest.vcf","hg19") returns an error.

> vcf1<-readVcfAsVRanges("FTZtest.vcf","hg19")
Error in strsplit(x, ";", fixed = TRUE) : non-character argument
> traceback()
10: strsplit(x, ";", fixed = TRUE)
9: parseFilterStrings(as.vector(geno(from)$FT))
8: eval(expr, envir, enclos)
7: eval(quote(list(...)), env)
6: eval(quote(list(...)), env)
5: standardGeneric("cbind")
4: cbind(filter, parseFilterStrings(as.vector(geno(from)$FT)))
3: asMethod(object)
2: as(readVcf(x, genome, param = param, row.names = use.names, ...),
"VRanges")
1: readVcfAsVRanges("FTZtest.vcf",
"hg19")

Anyway. I can't work out why this should be a problem, so thought I would share. Apologies if I have missed previous references to this.

Andy Lynch (CRUK Cambridge Institute)

> sessionInfo() R version 3.3.1 (2016-06-21) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.4 LTS

locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 [6] LC_MESSAGES=en_GB.UTF-8 LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] VariantAnnotation_1.18.5 BiocInstaller_1.22.3 QDNAseq_1.8.0 mclust_5.2 knitr_1.13 [6] GenomicAlignments_1.8.4 Rsamtools_1.24.0 Biostrings_2.40.2 XVector_0.12.0 SummarizedExperiment_1.2.3 [11] Biobase_2.32.0 GenomicRanges_1.24.2 GenomeInfoDb_1.8.1 IRanges_2.6.1 S4Vectors_0.10.2 [16] BiocGenerics_0.18.0

loaded via a namespace (and not attached): [1] formatR_1.4 highr_0.6 R.methodsS3_1.7.1 GenomicFeatures_1.24.3 bitops_1.0-6 R.utils_2.3.0 [7] tools_3.3.1 zlibbioc_1.18.0 biomaRt_2.28.0 BSgenome_1.40.1 evaluate_0.9 RSQLite_1.0.0 [13] DBI_0.4-1 rtracklayer_1.32.1 stringr_1.0.0 CGHbase_1.32.0 impute_1.46.0 marray_1.50.0 [19] DNAcopy_1.46.0 AnnotationDbi_1.34.4 XML_3.98-1.4 CGHcall_2.34.0 BiocParallel_1.6.2 limma_3.28.14 [25] magrittr_1.5 matrixStats_0.50.2 BiocStyle_2.0.2 stringi_1.1.1 RCurl_1.95-4.8 R.oo_1.20.0

variantannotation readvcfasvranges vcf • 1.6k views

ADD COMMENT • link updated 8.9 years ago by Michael Lawrence ★ 11k • written 8.9 years ago by Andy Lynch ▴ 120

score 1 · Answer 1 · 2016-07-27

Hi Andy,

This looks like a partial matching problem to me. The code explicitly looks for the genotype filter FT field, which isn't present in your data, so it then matches to your FTZ field and falls over when this isn't in the format it's expecting.

It should be possible to prevent this by using the [['FT']] method of accessing list entries rather than $FT

If you'd like to temporarily patch this yourself I think editing lines 252 & 253 of methods-VRanges-class.R should be sufficient.

score 1 · Answer 2 · 2016-07-27

1

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 3.6 years ago

United States

Should be fixed in 1.18.6.

ADD COMMENT • link 8.9 years ago Michael Lawrence ★ 11k