Question

Select specific variants from vcf file

0

Entering edit mode

Adam • 0

@adam-10025

Last seen 2.3 years ago

Poland

Hello,

Does anyone know how to extract specific variants from vcf files?

I have several vcf files with variants from NGS experiment, I'd like to subset only variants such as missense(stop gain stop loss, start gain, start loss)/splice site(in intron and exon) and all frameshift mutations.

What is more, I'm looking for changes with small MAF - I know there is 'COMMON=0' parameter.

So how can I do this filtering but on WINDOWS, or with some paclage in R?

All the best,

Adam.

vcf • 5.6k views

ADD COMMENT • link updated 8.9 years ago by Martin Morgan 25k • written 8.9 years ago by Adam • 0

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 16 hours ago

United States

In basic terms you want to read the VCF file(s) into R using the VariantAnnotation package. You can then use a TxDb package to get a transcripts GRanges object and then use subsetByOverlaps to subset your VCF to those that overlap a known transcript. You can then use predictCoding and a BSgenome package to predict the coding consequences. This is all covered in the VariantAnnotation vignette, so I would direct you there for more details.

ADD COMMENT • link 8.9 years ago James W. MacDonald 68k

0

Entering edit mode

Yes, actually I read about this package but don't you think it's a bit complicated? I'm asking becasue vcf file already has variation type, missense, splice region, frameshift etc. So maybe typical filter and grep from dplyr in R would be enough?

ADD REPLY • link 8.9 years ago Adam • 0

score 2 · Accepted Answer · 2017-04-04

Use ScanVcfParam() with readVcf() to selectively import your data into R, or filterVcf() to create a new VCF file with an appropriate subset. The primary source of documentation are the vignettes and man pages of relevant functions, available from within R in the usual way for from the package landing page.

VCF files are of course just text files, but they are highly structured; grep is ok for some basic manipulations (filterVcf does this for the 'prefilters') but other computations involve unpacking the data more completely.

Maybe a little philosophical but there is tremendous value to semantically 'rich' data that one loses with dplyr; a short compare and contrast is for instance at slides 14 - 16 of these slides. This value is compounded the more you use Bioconductor -- for a one-off it seems like overkill, but for daily use you find yourself spending less time worrying about data representation and more time addressing the informatic, statistical, and biological questions that motivate your research.