Search
Question: about filtering variants in a VCF file
1
gravatar for Bogdan
20 months ago by
Bogdan520
Palo Alto, CA, USA
Bogdan520 wrote:

Dear Martin, please when you have a minute, may I ask you for a suggestion : shall I have a VCF file as it is described below, please could you advise on a way to select only the variants with AD > 5, and AF > 0.05 ? thanks you very much ;)

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    TUMOR    NORMAL
chr1    146139995    0    T    C    0    PASS    BaseCounts=1,5,0,38;ECNT=1;FS=4.639;GC=52.48;HCNT=2;HRun=0;LowMQ=0.0000,0.0682,44;MAX_ED=.;MIN_ED=.;NLOD=7.78;SOR=1.916;TLOD=7.51;VariantType=SNP;ANNOVAR_DATE=2016-02-01;Func.refGene=exonic;Gene.refGene=NBPF10;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;AAChange.refGene=NBPF10:NM_001039703:exon5:c.A592G:p.K198E,NBPF10:NM_001302371:exon5:c.A592G:p.K198E;cosmic70=.;SIFT_score=.;SIFT_pred=.;Polyphen2_HDIV_score=.;Polyphen2_HDIV_pred=.;Polyphen2_HVAR_score=.;Polyphen2_HVAR_pred=.;LRT_score=.;LRT_pred=.;MutationTaster_score=.;MutationTaster_pred=.;MutationAssessor_score=.;MutationAssessor_pred=.;FATHMM_score=4.51;FATHMM_pred=T;PROVEAN_score=.;PROVEAN_pred=.;VEST3_score=0.26;CADD_raw=0.353;CADD_phred=6.192;DANN_score=0.130;fathmm-MKL_coding_score=0.000;fathmm-MKL_coding_pred=N;MetaSVM_score=-0.961;MetaSVM_pred=T;MetaLR_score=0.008;MetaLR_pred=T;integrated_fitCons_score=0.693;integrated_confidence_value=0;GERP�_RS=-0.47;phyloP7way_vertebrate=-1.190;phyloP20way_mammalian=-1.003;phastCons7way_vertebrate=0.002;phastCons20way_mammalian=0.004;SiPhy_29way_logOdds=3.32;dbscSNV_ADA_SCORE=.;dbscSNV_RF_SCORE=.;HRC_AF=.;HRC_AC=.;HRC_AN=.;HRC_non1000G_AF=.;HRC_non1000G_AC=.;HRC_non1000G_AN=.;esp6500siv2_ea=.;esp6500siv2_aa=.;esp6500siv2_all=.;ExAC_ALL=.;ExAC_AFR=.;ExAC_AMR=.;ExAC_EAS=.;ExAC_FIN=.;ExAC_NFE=.;ExAC_OTH=.;ExAC_SAS=.;ExAC_ALL=.;ExAC_AFR=.;ExAC_AMR=.;ExAC_EAS=.;ExAC_FIN=.;ExAC_NFE=.;ExAC_OTH=.;ExAC_SAS=.;ExAC_ALL=.;ExAC_AFR=.;ExAC_AMR=.;ExAC_EAS=.;ExAC_FIN=.;ExAC_NFE=.;ExAC_OTH=.;ExAC_SAS=.;Kaviar_AF=.;Kaviar_AC=.;Kaviar_AN=.;avsnp144=.;nci60=.;clinvar_20150629=.;ALLELE_END    GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1    0/1:31,5:0.139:2:3:0.400:904,143:18:13    0/0:26,0:0.00:0:0:.:759,0:10:16
ADD COMMENTlink modified 20 months ago by Martin Morgan ♦♦ 22k • written 20 months ago by Bogdan520

Do you want to create another VCF file with only variants satisfying a particular criterion, or  would you like to read in a portion of the VCF file for further computation in R? Also, from your  sample line, AD is a vector of length 2 (one for each genotype) and is available for each sample (in contrast, AF characterizes each variant) so your criterion on AD needs to be specified more carefully. It would be informative to include the output of scanVcfHeader("path/to/file.vcf") in your QUESTION.

ADD REPLYlink modified 20 months ago • written 20 months ago by Martin Morgan ♦♦ 22k

Dear Martin, thank you very much. I would like to make another VCF file with the filtered variants indeed.

And thank you for mentioning about AD field, as I was going to ask why the output of the command -- head(geno(vcf)$AD) -- looks not too specific (please see below).

 

> head(geno(vcf)$AD)
                  NORMAL    TUMOR    
chr1:108044_C/G   Integer,2 Integer,2
chr1:123100_A/ATG Integer,2 Integer,2
chr1:187017_G/C   Integer,2 Integer,2
chr1:205931_A/G   Integer,2 Integer,2
chr1:205932_A/G   Integer,2 Integer,2
chr1:262878_T/C   Integer,2 Integer,2

The output for scanVcfHeader(x.vcf) is the following (below). I have used GATK and MUTECT2 in order to produce the vcf file. Many thanks !

> scanVcfHeader("./AML_out.vcf")

class: VCFHeader
samples(2): NORMAL TUMOR
meta(2): META contig
fixed(1): FILTER
info(23): AC AF ... TLOD VariantType
geno(14): GT AD ... REF_F1R2 REF_F2R1

 

 

 

ADD REPLYlink written 20 months ago by Bogdan520

So how are you going to select variants based on AD -- above a threshold, for all alleles and for all samples? For the minor allele in both NORMAL and TUMOR ? For the minor allele in TUMOR ? ... ?

ADD REPLYlink written 20 months ago by Martin Morgan ♦♦ 22k

Dear Martin, thank you for your question. I aim to select the somatic variants based on AD > 5 and AF > 0.05 of the mutated allele (I believe we can call the mutated allele as minor allele) only in the TUMOR sample.

ADD REPLYlink written 20 months ago by Bogdan520
4
gravatar for Martin Morgan
20 months ago by
Martin Morgan ♦♦ 22k
United States
Martin Morgan ♦♦ 22k wrote:

For a VCF object vcf, you'd like to select the rows with info(vcf)$AF > 0.05, which is a logical vector the same length as the number of rows in vcf.

AD is harder, but you'd like to select the appropriate column geno(vcf)$AD[,"TUMOR"]. This is a list(), one element for each row of the VCF, with the element equal to the AD of the corresponding genotype. You'd like to test whether the minimum value of each list element is greater than 5. An S4Vectors way of doing this is to coerce the plain-old list() to a List, and then exploit the fact that min() on the List returns element-wise minimums, min(as(geno(vcf)$AD[,"TUMOR"], "List")) >= 5. One might want to think carefully about this, e.g., if the variant allele is actually at high frequency.

A filter might be 

AFADfilter =function(x) {
    af <- info(x)$AF > 0.05
    ad <- min(as(geno(x)$AD[,"TUMOR"], "List")) >= 5
    af & ad
}

Create a FilterRules() instance (be sure to enclose the function in a list)

filter <- FilterRules(list(AFADfilter = AFADfilter))

To filter a VCF file, use the filter as the argument to filterVcf(),

vcf_filtered <- filterVcf(vcf_file, "hg38", "AF_and_AD_filtered.vcf", filters=filters)

To filter a VCF object, use

subsetByFilter(vcf, filter)
ADD COMMENTlink modified 20 months ago • written 20 months ago by Martin Morgan ♦♦ 22k

Dear Martin,

thank you for your help, and message especially so early in the morning : I am very grateful ;) the piece of code that you provided is a good example that is easy to follow and makes good biological sense . If you'd like, I would suggest to include it also in the official documentation for the package, as a lot of computational biologists will appreciate having it.

It shall work on my PC, I am sure. I had installed R 3.3.2 a while ago, and apparently filterVCF does not work with R.3.3.2, so I would need to downgrade to R.3.3.1.

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 14393)

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] BiocInstaller_1.24.0

loaded via a namespace (and not attached):
[1] tools_3.3.2  

 

 

ADD REPLYlink written 20 months ago by Bogdan520

forgot to mention, there is an error message I am getting when I am running filterVcf() from both platforms R 3.3.2 and R 3.2.5 (that runs on the other computer), and the error message is below.

Shall you have any suggestions about how to fix it, please let me know. Thank you very much ;) !

> vcf_filtered <- filterVcf(vcf, "hg38", "AF_and_AD_filtered.vcf", filters=filter)

Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘filterVcf’ for signature ‘"CollapsedVCF"’

 

> sessionInfo()
R version 3.2.5 (2016-04-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu precise (12.04.5 LTS)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] VariantAnnotation_1.16.4   Rsamtools_1.22.0          
 [3] Biostrings_2.38.4          XVector_0.10.0            
 [5] SummarizedExperiment_1.0.2 Biobase_2.30.0            
 [7] GenomicRanges_1.22.4       GenomeInfoDb_1.6.3        
 [9] IRanges_2.4.8              S4Vectors_0.8.11          
[11] BiocGenerics_0.16.1       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.8             AnnotationDbi_1.32.3    GenomicAlignments_1.6.3
 [4] zlibbioc_1.16.0         BiocParallel_1.4.3      BSgenome_1.38.0        
 [7] tools_3.2.5             DBI_0.5-1               lambda.r_1.1.9         
[10] futile.logger_1.4.3     digest_0.6.11           rtracklayer_1.30.4     
[13] futile.options_1.0.0    bitops_1.0-6            biomaRt_2.26.1         
[16] RCurl_1.95-4.8          memoise_1.0.0           RSQLite_1.1-1          
[19] GenomicFeatures_1.22.13 XML_3.98-1.5           

 

ADD REPLYlink written 20 months ago by Bogdan520

You said that you wanted to filter a vcf file, so your argument 'vcf' should be the path to a vcf file. If instead you would like to instead filter a vcf object, use

AFADfilter =function(x) {
    af <- info(x)$AF > 0.05
    ad <- min(as(geno(x)$AD[,"TUMOR"], "List")) >= 5
    af & ad
}
vcf[AFADfilter(vcf)]
ADD REPLYlink modified 20 months ago • written 20 months ago by Martin Morgan ♦♦ 22k

Dear Martin, thank you, this works ! happy Sunday time !

ADD REPLYlink written 20 months ago by Bogdan520

Dear Martin, thank you again for your help. If you do mind having me adding another question on the filterVCF() :

The implementation that you have suggested (below) works : 

vcf <- readVcf("AML_out.vcf.bgzip",genome="hg38")

AFADfilter = function(x) {
    af <- geno(x)$AF[,"TUMOR"] > 0.05
    ad <- sum(as(geno(x)$AD[,"TUMOR"], "List")) >= 5
    af & ad
}

z <- vcf[AFADfilter(vcf)]
writeVcf(z,"output.vcf")

However, in order to have a full implementation of filterVcf(), please would you advise what shall I change, as the following piece of code does not work :

filter <- FilterRules(AFADfilter = function(x) {
    af <- geno(x)$AF[,"TUMOR"] > 0.05
    ad <- sum(as(geno(x)$AD[,"TUMOR"], "List")) >= 5
    af & ad
})

vcf_filtered <- filterVcf("AML_out.vcf.bgzip", "hg38", "AF_and_AD_filtered.vcf", filters=filter)

The error message is :

> filterVcf("AML_out.vcf.bgzip", "hg38", "AF_and_AD_filtered.vcf", filters=filter)
starting filter
filtering 15519 records
Error in eval(filter, x) : 
  filter rule evaluated to non-logical: AFADfilter
ADD REPLYlink modified 20 months ago by Martin Morgan ♦♦ 22k • written 20 months ago by Bogdan520

The argument to FilterRules() needs to be list(AFADfilter=fucntion(x) { ...}), rather than just the named function. I corrected my original post, and it is correct in both the help page and vignette. Sorry for the confusion.

ADD REPLYlink written 20 months ago by Martin Morgan ♦♦ 22k

Dear Martin,

thank you again for your prompt reply, and for precious help : yes, the piece of R code below is working ;)

AFADfilter = function(x) {
                          af <- geno(x)$AF[,"TUMOR"] > 0.05
                          ad <- sum(as(geno(x)$AD[,"TUMOR"], "List")) >= 5
                          af & ad
                         }
filter <- FilterRules(list(AFADfilter = AFADfilter))

vcf_filtered <- filterVcf("AML_out.vcf.bgzip", "hg38", "AF_and_AD_filtered.vcf", filters=filter)

 

ADD REPLYlink modified 20 months ago • written 20 months ago by Bogdan520

If I may, I have another little question regarding subsetByFilter(), although I will open a new track. thank you !

ADD REPLYlink written 20 months ago by Bogdan520
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 109 users visited in the last hour