Question: Question: How can I filter a vcf file by DP and GQ using the R package VariantAnnotation?
gravatar for r.jaffe
3.7 years ago by
r.jaffe0 wrote:

Hi guys, need some help filtering my vcf files with the VariantAnnotation package (v1.12.9) from Bioconductor. I have a vcf file (v4.2) containing 28238 SNPs for 160 individuals, and I'm running R version 3.1.3 I opened my file and ran:

hist(geno(vcf)$DP) ##check the distribution of Read Depth
hist(geno(vcf)$GQ) ##check the distribution of Phred-Scaled Genotype Quality

I then tried to filter out all samples with a DP<10 and a GQ<20 and create a new vcf file:

vcf2 <- vcf[geno(vcf)$DP>10 && geno(vcf)$GQ>20]

The problem is that this code does not seem to be working, because both files have exactly the same dimensions:

dim(vcf) ##28238 160
dim(vcf2) ##28238 160

So my questions are: 1)How can I filter vcf files by DP and GQ?? 2)Is it better to eliminate SNPs that have low DP & GQ, or to re-code genotypes with low DP and GQ as missing genotypes (./.)? Any help would be much appreciated, thanks! Rodolfo

ADD COMMENTlink modified 3.7 years ago by Valerie Obenchain ♦♦ 6.7k • written 3.7 years ago by r.jaffe0

In this statement geno(vcf)$DP>10 && geno(vcf)$GQ>20 you combine the comparisons with &&. In R this is does a scalar comparison of the first element (it seems like R should at least warn here, but it doesn't).

> x = 1:5; y = 5:1
> x < 3 && y > 2
[1] TRUE

If it's 'TRUE', then recycling kicks in and you select ALL elements.

> x[ x < 3 && y > 2 ]
[1] 1 2 3 4 5

On the other hand a single & does a vectorized operation, and you select just the ones you're interested in!

>  x < 3 & y > 2
> x[ x < 3 & y > 2]
[1] 1 2


vcf2 <- vcf[geno(vcf)$DP>10 & geno(vcf)$GQ>20]


ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Martin Morgan ♦♦ 22k

Hi Martin, thanks for your reply. If I do what you suggest I get this error:

Error in NSBS(i = i, x = x, exact = exact, upperBoundIsStrict = upperBoundIsStrict) : 
  subscript is a logical vector with out-of-bounds TRUE values
In addition: Warning message:
In NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) :
  subscript is an array, passing it thru as.vector() first


Any further ideas?



ADD REPLYlink written 3.7 years ago by r.jaffe0

The first message is equivalent to trying to subset a vector by a longer vector, which in base R has the unfortunate consequence of adding 'NA' values

> integer()[TRUE]
[1] NA

but in the Genomic Ranges infrastructure generates a more sensible error

Error in NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) : 
  subscript is a logical vector with out-of-bounds TRUE values> x = 1:3

The error comes about because as Valerie pointed out the geno() values are 2- or more dimensional arrays, whereas the way that you're subsetting the VCF object is expecting a 1-dimensional vector; the length of the two-dimensional vector is the product of the number of rows and columns, so longer than the length of the VCF object.

Returning to your original question, I guess geno(vcf)$DP > 10 & geno(vcf)$GQ > 20 returns a logical matrix with 28238 SNPs (rows) by 160 samples (columns). You could remove a row (SNP) entirely, or a column (sample) entirely by imposing some condition on  the values in a row, e.g., at least 80 samples satisfy the condition vcf[rowSums(geno(vcf)$DP > 10 & geno(vcf)$GQ > 20) > 80] To update the values of individual genotypes, you might follow the lead suggested by R for setting values of a matrix to, e.g., NA

> m = matrix(rnorm(18), 6)
> m[m < 0 ] = NA
> m
          [,1]       [,2]      [,3]
[1,]        NA 0.16330049        NA
[2,] 0.9612970         NA 0.3841423
[3,]        NA         NA        NA
[4,]        NA 0.09288314        NA
[5,] 0.6207293 0.01493677 1.2038016
[6,] 0.5116028 0.85283204        NA

and do something like

idx = geno(vcf)$DP > 10 & geno(vcf)$GQ > 20
m = geno(vcf)$GQ
m[ !idx ] = NA
geno(vcf)$GQ = m


ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Martin Morgan ♦♦ 22k
gravatar for Valerie Obenchain
3.7 years ago by
Valerie Obenchain ♦♦ 6.7k
United States
Valerie Obenchain ♦♦ 6.7k wrote:


You may want to look at ?filterVcf. This filters the file and writes a subset to disk vs reading in all data then subsetting. There are 2 ways to filter, the first (prefilter) is a grep of the unparsed data which detects presence/absence of a value. The second (filter) parses the data and allows filtering on >= comparisions. For the DP and GQ example above you'll want the second type.

The data in geno() are matrices. If you have a single sample (single column) the simple comparison DP > 10 will give a logical vector you can subset on. If you have multiple samples you'll need to do more work. For example,

> mat = matrix(1:10, ncol=2)
> mat
     [,1] [,2]
[1,]    1    6
[2,]    2    7
[3,]    3    8
[4,]    4    9
[5,]    5   10
> mat > 3
      [,1] [,2]

Here the filter would need to be more specific and tailored for the number of samples. For example, do you want all samples to have a value > 3?

> rowSums(mat > 3) == 2

or any sample > 3?

> rowSums(mat > 3) >= 1

My feeling is that you should filter low DP & GQ and not recode as missing. Others on the list may be able to expand more on that topic.


ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by Valerie Obenchain ♦♦ 6.7k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 272 users visited in the last hour