Search
Question: Question: How can I filter a vcf file by DP and GQ using the R package VariantAnnotation?
0
3.5 years ago by
r.jaffe0
Brazil
r.jaffe0 wrote:

Hi guys, need some help filtering my vcf files with the VariantAnnotation package (v1.12.9) from Bioconductor. I have a vcf file (v4.2) containing 28238 SNPs for 160 individuals, and I'm running R version 3.1.3 I opened my file and ran:

hist(geno(vcf)$DP) ##check the distribution of Read Depth hist(geno(vcf)$GQ) ##check the distribution of Phred-Scaled Genotype Quality

I then tried to filter out all samples with a DP<10 and a GQ<20 and create a new vcf file:

vcf2 <- vcf[geno(vcf)$DP>10 && geno(vcf)$GQ>20]

The problem is that this code does not seem to be working, because both files have exactly the same dimensions:

dim(vcf) ##28238 160
dim(vcf2) ##28238 160

So my questions are: 1)How can I filter vcf files by DP and GQ?? 2)Is it better to eliminate SNPs that have low DP & GQ, or to re-code genotypes with low DP and GQ as missing genotypes (./.)? Any help would be much appreciated, thanks! Rodolfo

modified 3.5 years ago by Valerie Obenchain ♦♦ 6.6k • written 3.5 years ago by r.jaffe0

In this statement geno(vcf)$DP>10 && geno(vcf)$GQ>20 you combine the comparisons with &&. In R this is does a scalar comparison of the first element (it seems like R should at least warn here, but it doesn't).

> x = 1:5; y = 5:1
> x < 3 && y > 2
[1] TRUE


If it's 'TRUE', then recycling kicks in and you select ALL elements.

> x[ x < 3 && y > 2 ]
[1] 1 2 3 4 5

On the other hand a single & does a vectorized operation, and you select just the ones you're interested in!

>  x < 3 & y > 2
[1]  TRUE  TRUE FALSE FALSE FALSE
> x[ x < 3 & y > 2]
[1] 1 2

So

vcf2 <- vcf[geno(vcf)$DP>10 & geno(vcf)$GQ>20]

Hi Martin, thanks for your reply. If I do what you suggest I get this error:

Error in NSBS(i = i, x = x, exact = exact, upperBoundIsStrict = upperBoundIsStrict) :
subscript is a logical vector with out-of-bounds TRUE values
In NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) :
subscript is an array, passing it thru as.vector() first

Any further ideas?

Cheers,

Rodolfo

The first message is equivalent to trying to subset a vector by a longer vector, which in base R has the unfortunate consequence of adding 'NA' values

> integer()[TRUE]
[1] NA

but in the Genomic Ranges infrastructure generates a more sensible error

> VCF()[TRUE]
Error in NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) :
subscript is a logical vector with out-of-bounds TRUE values> x = 1:3

The error comes about because as Valerie pointed out the geno() values are 2- or more dimensional arrays, whereas the way that you're subsetting the VCF object is expecting a 1-dimensional vector; the length of the two-dimensional vector is the product of the number of rows and columns, so longer than the length of the VCF object.

Returning to your original question, I guess geno(vcf)$DP > 10 & geno(vcf)$GQ > 20 returns a logical matrix with 28238 SNPs (rows) by 160 samples (columns). You could remove a row (SNP) entirely, or a column (sample) entirely by imposing some condition on  the values in a row, e.g., at least 80 samples satisfy the condition vcf[rowSums(geno(vcf)$DP > 10 & geno(vcf)$GQ > 20) > 80] To update the values of individual genotypes, you might follow the lead suggested by R for setting values of a matrix to, e.g., NA

> m = matrix(rnorm(18), 6)
> m[m < 0 ] = NA
> m
[,1]       [,2]      [,3]
[1,]        NA 0.16330049        NA
[2,] 0.9612970         NA 0.3841423
[3,]        NA         NA        NA
[4,]        NA 0.09288314        NA
[5,] 0.6207293 0.01493677 1.2038016
[6,] 0.5116028 0.85283204        NA

and do something like

idx = geno(vcf)$DP > 10 & geno(vcf)$GQ > 20
m = geno(vcf)$GQ m[ !idx ] = NA geno(vcf)$GQ = m

0
3.5 years ago by
Valerie Obenchain ♦♦ 6.6k
United States
Valerie Obenchain ♦♦ 6.6k wrote:

Hi,

You may want to look at ?filterVcf. This filters the file and writes a subset to disk vs reading in all data then subsetting. There are 2 ways to filter, the first (prefilter) is a grep of the unparsed data which detects presence/absence of a value. The second (filter) parses the data and allows filtering on >= comparisions. For the DP and GQ example above you'll want the second type.

The data in geno() are matrices. If you have a single sample (single column) the simple comparison DP > 10 will give a logical vector you can subset on. If you have multiple samples you'll need to do more work. For example,

> mat = matrix(1:10, ncol=2)
> mat
[,1] [,2]
[1,]    1    6
[2,]    2    7
[3,]    3    8
[4,]    4    9
[5,]    5   10
> mat > 3
[,1] [,2]
[1,] FALSE TRUE
[2,] FALSE TRUE
[3,] FALSE TRUE
[4,]  TRUE TRUE
[5,]  TRUE TRUE

Here the filter would need to be more specific and tailored for the number of samples. For example, do you want all samples to have a value > 3?

> rowSums(mat > 3) == 2
[1] FALSE FALSE FALSE  TRUE  TRUE

or any sample > 3?

> rowSums(mat > 3) >= 1
[1] TRUE TRUE TRUE TRUE TRUE

My feeling is that you should filter low DP & GQ and not recode as missing. Others on the list may be able to expand more on that topic.

Valerie