Question: about filtering variants in a VCF file
1
2.5 years ago by
Bogdan550
Palo Alto, CA, USA
Bogdan550 wrote:

Dear Martin, please when you have a minute, may I ask you for a suggestion : shall I have a VCF file as it is described below, please could you advise on a way to select only the variants with AD > 5, and AF > 0.05 ? thanks you very much ;)

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    TUMOR    NORMAL
chr1    146139995    0    T    C    0    PASS    BaseCounts=1,5,0,38;ECNT=1;FS=4.639;GC=52.48;HCNT=2;HRun=0;LowMQ=0.0000,0.0682,44;MAX_ED=.;MIN_ED=.;NLOD=7.78;SOR=1.916;TLOD=7.51;VariantType=SNP;ANNOVAR_DATE=2016-02-01;Func.refGene=exonic;Gene.refGene=NBPF10;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;AAChange.refGene=NBPF10:NM_001039703:exon5:c.A592G:p.K198E,NBPF10:NM_001302371:exon5:c.A592G:p.K198E;cosmic70=.;SIFT_score=.;SIFT_pred=.;Polyphen2_HDIV_score=.;Polyphen2_HDIV_pred=.;Polyphen2_HVAR_score=.;Polyphen2_HVAR_pred=.;LRT_score=.;LRT_pred=.;MutationTaster_score=.;MutationTaster_pred=.;MutationAssessor_score=.;MutationAssessor_pred=.;FATHMM_score=4.51;FATHMM_pred=T;PROVEAN_score=.;PROVEAN_pred=.;VEST3_score=0.26;CADD_raw=0.353;CADD_phred=6.192;DANN_score=0.130;fathmm-MKL_coding_score=0.000;fathmm-MKL_coding_pred=N;MetaSVM_score=-0.961;MetaSVM_pred=T;MetaLR_score=0.008;MetaLR_pred=T;integrated_fitCons_score=0.693;integrated_confidence_value=0;GERP�_RS=-0.47;phyloP7way_vertebrate=-1.190;phyloP20way_mammalian=-1.003;phastCons7way_vertebrate=0.002;phastCons20way_mammalian=0.004;SiPhy_29way_logOdds=3.32;dbscSNV_ADA_SCORE=.;dbscSNV_RF_SCORE=.;HRC_AF=.;HRC_AC=.;HRC_AN=.;HRC_non1000G_AF=.;HRC_non1000G_AC=.;HRC_non1000G_AN=.;esp6500siv2_ea=.;esp6500siv2_aa=.;esp6500siv2_all=.;ExAC_ALL=.;ExAC_AFR=.;ExAC_AMR=.;ExAC_EAS=.;ExAC_FIN=.;ExAC_NFE=.;ExAC_OTH=.;ExAC_SAS=.;ExAC_ALL=.;ExAC_AFR=.;ExAC_AMR=.;ExAC_EAS=.;ExAC_FIN=.;ExAC_NFE=.;ExAC_OTH=.;ExAC_SAS=.;ExAC_ALL=.;ExAC_AFR=.;ExAC_AMR=.;ExAC_EAS=.;ExAC_FIN=.;ExAC_NFE=.;ExAC_OTH=.;ExAC_SAS=.;Kaviar_AF=.;Kaviar_AC=.;Kaviar_AN=.;avsnp144=.;nci60=.;clinvar_20150629=.;ALLELE_END    GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1    0/1:31,5:0.139:2:3:0.400:904,143:18:13    0/0:26,0:0.00:0:0:.:759,0:10:16
filtervcf • 1.1k views
modified 2.5 years ago by Martin Morgan ♦♦ 23k • written 2.5 years ago by Bogdan550

Do you want to create another VCF file with only variants satisfying a particular criterion, or  would you like to read in a portion of the VCF file for further computation in R? Also, from your  sample line, AD is a vector of length 2 (one for each genotype) and is available for each sample (in contrast, AF characterizes each variant) so your criterion on AD needs to be specified more carefully. It would be informative to include the output of scanVcfHeader("path/to/file.vcf") in your QUESTION.

Dear Martin, thank you very much. I would like to make another VCF file with the filtered variants indeed.

And thank you for mentioning about AD field, as I was going to ask why the output of the command -- head(geno(vcf)$AD) -- looks not too specific (please see below). > head(geno(vcf)$AD)
NORMAL    TUMOR
chr1:108044_C/G   Integer,2 Integer,2
chr1:123100_A/ATG Integer,2 Integer,2
chr1:187017_G/C   Integer,2 Integer,2
chr1:205931_A/G   Integer,2 Integer,2
chr1:205932_A/G   Integer,2 Integer,2
chr1:262878_T/C   Integer,2 Integer,2

The output for scanVcfHeader(x.vcf) is the following (below). I have used GATK and MUTECT2 in order to produce the vcf file. Many thanks !

samples(2): NORMAL TUMOR
meta(2): META contig
fixed(1): FILTER
info(23): AC AF ... TLOD VariantType
geno(14): GT AD ... REF_F1R2 REF_F2R1

So how are you going to select variants based on AD -- above a threshold, for all alleles and for all samples? For the minor allele in both NORMAL and TUMOR ? For the minor allele in TUMOR ? ... ?

Dear Martin, thank you for your question. I aim to select the somatic variants based on AD > 5 and AF > 0.05 of the mutated allele (I believe we can call the mutated allele as minor allele) only in the TUMOR sample.

4
2.5 years ago by
Martin Morgan ♦♦ 23k
United States
Martin Morgan ♦♦ 23k wrote:

For a VCF object vcf, you'd like to select the rows with info(vcf)$AF > 0.05, which is a logical vector the same length as the number of rows in vcf. AD is harder, but you'd like to select the appropriate column geno(vcf)$AD[,"TUMOR"]. This is a list(), one element for each row of the VCF, with the element equal to the AD of the corresponding genotype. You'd like to test whether the minimum value of each list element is greater than 5. An S4Vectors way of doing this is to coerce the plain-old list() to a List, and then exploit the fact that min() on the List returns element-wise minimums, min(as(geno(vcf)$AD[,"TUMOR"], "List")) >= 5. One might want to think carefully about this, e.g., if the variant allele is actually at high frequency. A filter might be AFADfilter =function(x) { af <- info(x)$AF > 0.05
ad <- min(as(geno(x)$AD[,"TUMOR"], "List")) >= 5 af & ad } Create a FilterRules() instance (be sure to enclose the function in a list) filter <- FilterRules(list(AFADfilter = AFADfilter)) To filter a VCF file, use the filter as the argument to filterVcf(), vcf_filtered <- filterVcf(vcf_file, "hg38", "AF_and_AD_filtered.vcf", filters=filters) To filter a VCF object, use subsetByFilter(vcf, filter) ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by Martin Morgan ♦♦ 23k Dear Martin, thank you for your help, and message especially so early in the morning : I am very grateful ;) the piece of code that you provided is a good example that is easy to follow and makes good biological sense . If you'd like, I would suggest to include it also in the official documentation for the package, as a lot of computational biologists will appreciate having it. It shall work on my PC, I am sure. I had installed R 3.3.2 a while ago, and apparently filterVCF does not work with R.3.3.2, so I would need to downgrade to R.3.3.1. > sessionInfo() R version 3.3.2 (2016-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 14393) locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BiocInstaller_1.24.0 loaded via a namespace (and not attached): [1] tools_3.3.2 ADD REPLYlink written 2.5 years ago by Bogdan550 forgot to mention, there is an error message I am getting when I am running filterVcf() from both platforms R 3.3.2 and R 3.2.5 (that runs on the other computer), and the error message is below. Shall you have any suggestions about how to fix it, please let me know. Thank you very much ;) ! > vcf_filtered <- filterVcf(vcf, "hg38", "AF_and_AD_filtered.vcf", filters=filter) Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘filterVcf’ for signature ‘"CollapsedVCF"’ > sessionInfo() R version 3.2.5 (2016-04-14) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu precise (12.04.5 LTS) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] VariantAnnotation_1.16.4 Rsamtools_1.22.0 [3] Biostrings_2.38.4 XVector_0.10.0 [5] SummarizedExperiment_1.0.2 Biobase_2.30.0 [7] GenomicRanges_1.22.4 GenomeInfoDb_1.6.3 [9] IRanges_2.4.8 S4Vectors_0.8.11 [11] BiocGenerics_0.16.1 loaded via a namespace (and not attached): [1] Rcpp_0.12.8 AnnotationDbi_1.32.3 GenomicAlignments_1.6.3 [4] zlibbioc_1.16.0 BiocParallel_1.4.3 BSgenome_1.38.0 [7] tools_3.2.5 DBI_0.5-1 lambda.r_1.1.9 [10] futile.logger_1.4.3 digest_0.6.11 rtracklayer_1.30.4 [13] futile.options_1.0.0 bitops_1.0-6 biomaRt_2.26.1 [16] RCurl_1.95-4.8 memoise_1.0.0 RSQLite_1.1-1 [19] GenomicFeatures_1.22.13 XML_3.98-1.5 ADD REPLYlink written 2.5 years ago by Bogdan550 You said that you wanted to filter a vcf file, so your argument 'vcf' should be the path to a vcf file. If instead you would like to instead filter a vcf object, use AFADfilter =function(x) { af <- info(x)$AF > 0.05
ad <- min(as(geno(x)$AD[,"TUMOR"], "List")) >= 5 af & ad } vcf[AFADfilter(vcf)]  ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Martin Morgan ♦♦ 23k Dear Martin, thank you, this works ! happy Sunday time ! ADD REPLYlink written 2.5 years ago by Bogdan550 Dear Martin, thank you again for your help. If you do mind having me adding another question on the filterVCF() : The implementation that you have suggested (below) works : vcf <- readVcf("AML_out.vcf.bgzip",genome="hg38") AFADfilter = function(x) { af <- geno(x)$AF[,"TUMOR"] > 0.05
ad <- sum(as(geno(x)$AD[,"TUMOR"], "List")) >= 5 af & ad } z <- vcf[AFADfilter(vcf)] writeVcf(z,"output.vcf") However, in order to have a full implementation of filterVcf(), please would you advise what shall I change, as the following piece of code does not work : filter <- FilterRules(AFADfilter = function(x) { af <- geno(x)$AF[,"TUMOR"] > 0.05
ad <- sum(as(geno(x)$AD[,"TUMOR"], "List")) >= 5 af & ad }) vcf_filtered <- filterVcf("AML_out.vcf.bgzip", "hg38", "AF_and_AD_filtered.vcf", filters=filter) The error message is : > filterVcf("AML_out.vcf.bgzip", "hg38", "AF_and_AD_filtered.vcf", filters=filter) starting filter filtering 15519 records Error in eval(filter, x) : filter rule evaluated to non-logical: AFADfilter ADD REPLYlink modified 2.5 years ago by Martin Morgan ♦♦ 23k • written 2.5 years ago by Bogdan550 The argument to FilterRules() needs to be list(AFADfilter=fucntion(x) { ...}), rather than just the named function. I corrected my original post, and it is correct in both the help page and vignette. Sorry for the confusion. ADD REPLYlink written 2.5 years ago by Martin Morgan ♦♦ 23k Dear Martin, thank you again for your prompt reply, and for precious help : yes, the piece of R code below is working ;) AFADfilter = function(x) { af <- geno(x)$AF[,"TUMOR"] > 0.05
}

vcf_filtered <- filterVcf("AML_out.vcf.bgzip", "hg38", "AF_and_AD_filtered.vcf", filters=filter)