**0**wrote:

Hi guys, need some help filtering my vcf files with the VariantAnnotation package (v1.12.9) from Bioconductor. I have a vcf file (v4.2) containing 28238 SNPs for 160 individuals, and I'm running R version 3.1.3 I opened my file and ran:

hist(geno(vcf)$DP) ##check the distribution of Read Depth hist(geno(vcf)$GQ) ##check the distribution of Phred-Scaled Genotype Quality

I then tried to filter out all samples with a DP<10 and a GQ<20 and create a new vcf file:

vcf2 <- vcf[geno(vcf)$DP>10 && geno(vcf)$GQ>20]

The problem is that this code does not seem to be working, because both files have exactly the same dimensions:

```
dim(vcf) ##28238 160
dim(vcf2) ##28238 160
```

So my questions are: 1)How can I filter vcf files by DP and GQ?? 2)Is it better to eliminate SNPs that have low DP & GQ, or to re-code genotypes with low DP and GQ as missing genotypes (./.)? Any help would be much appreciated, thanks! Rodolfo

**6.7k**• written 4.6 years ago by r.jaffe •

**0**

In this statement geno(vcf)$DP>10 && geno(vcf)$GQ>20 you combine the comparisons with &&. In R this is does a scalar comparison of the first element (it seems like R should at least warn here, but it doesn't).

If it's 'TRUE', then recycling kicks in and you select ALL elements.

On the other hand a single & does a vectorized operation, and you select just the ones you're interested in!

So

23kHi Martin, thanks for your reply. If I do what you suggest I get this error:

Any further ideas?

Cheers,

Rodolfo

0The first message is equivalent to trying to subset a vector by a longer vector, which in base R has the unfortunate consequence of adding 'NA' values

but in the Genomic Ranges infrastructure generates a more sensible error

The error comes about because as Valerie pointed out the geno() values are 2- or more dimensional arrays, whereas the way that you're subsetting the VCF object is expecting a 1-dimensional vector; the length of the two-dimensional vector is the product of the number of rows and columns, so longer than the length of the VCF object.

Returning to your original question, I guess geno(vcf)$DP > 10 & geno(vcf)$GQ > 20 returns a logical matrix with 28238 SNPs (rows) by 160 samples (columns). You could remove a row (SNP) entirely, or a column (sample) entirely by imposing some condition on the values in a row, e.g., at least 80 samples satisfy the condition

`vcf[rowSums(geno(vcf)$DP > 10 & geno(vcf)$GQ > 20) > 80]`

To update the values of individual genotypes, you might follow the lead suggested by R for setting values of a matrix to, e.g., NAand do something like

23k