Question: Select specific variants from vcf file
0
2.5 years ago by
Poland

Hello,

Does anyone know how to extract specific variants from vcf files?

I have several vcf files with variants from NGS experiment, I'd like to subset only variants such as missense(stop gain stop loss, start gain, start loss)/splice site(in intron and exon) and all frameshift mutations.

What is more, I'm looking for changes with small MAF - I know there is 'COMMON=0' parameter.

So how can I do this filtering but on WINDOWS, or with some paclage in R?

All the best,

vcf • 1.0k views
modified 2.5 years ago by Martin Morgan ♦♦ 23k • written 2.5 years ago by Adam0
Answer: Select specific variants from vcf file
2
2.5 years ago by
Martin Morgan ♦♦ 23k
United States
Martin Morgan ♦♦ 23k wrote:

Use ScanVcfParam() with readVcf() to selectively import your data into R, or filterVcf() to create a new VCF file with an appropriate subset. The primary source of documentation are the vignettes and man pages of relevant functions, available from within R in the usual way for from the package landing page.

VCF files are of course just text files, but they are highly structured; grep is ok for some basic manipulations (filterVcf does this for the 'prefilters') but other computations involve unpacking the data more completely.

Maybe a little philosophical but there is tremendous value to semantically 'rich' data that one loses with dplyr; a short compare and contrast is for instance at slides 14 - 16 of these slides. This value is compounded the more you use Bioconductor -- for a one-off it seems like overkill, but for daily use you find yourself spending less time worrying about data representation and more time addressing the informatic, statistical, and biological questions that motivate your research.

Answer: Select specific variants from vcf file
0
2.5 years ago by
United States
James W. MacDonald51k wrote:

In basic terms you want to read the VCF file(s) into R using the VariantAnnotation package. You can then use a TxDb package to get a transcripts GRanges object and then use subsetByOverlaps to subset your VCF to those that overlap a known transcript. You can then use predictCoding and a BSgenome package to predict the coding consequences. This is all covered in the VariantAnnotation vignette, so I would direct you there for more details.