Search
Question: Question: How can I filter a vcf file by DP and GQ using the R package VariantAnnotation?
0
gravatar for r.jaffe
3.7 years ago by
r.jaffe0
Brazil
r.jaffe0 wrote:

Hi guys, need some help filtering my vcf files with the VariantAnnotation package (v1.12.9) from Bioconductor. I have a vcf file (v4.2) containing 28238 SNPs for 160 individuals, and I'm running R version 3.1.3 I opened my file and ran:

hist(geno(vcf)$DP) ##check the distribution of Read Depth
hist(geno(vcf)$GQ) ##check the distribution of Phred-Scaled Genotype Quality

I then tried to filter out all samples with a DP<10 and a GQ<20 and create a new vcf file:

vcf2 <- vcf[geno(vcf)$DP>10 && geno(vcf)$GQ>20]

The problem is that this code does not seem to be working, because both files have exactly the same dimensions:

dim(vcf) ##28238 160
dim(vcf2) ##28238 160

So my questions are: 1)How can I filter vcf files by DP and GQ?? 2)Is it better to eliminate SNPs that have low DP & GQ, or to re-code genotypes with low DP and GQ as missing genotypes (./.)? Any help would be much appreciated, thanks! Rodolfo

ADD COMMENTlink modified 3.7 years ago by Valerie Obenchain ♦♦ 6.7k • written 3.7 years ago by r.jaffe0

In this statement geno(vcf)$DP>10 && geno(vcf)$GQ>20 you combine the comparisons with &&. In R this is does a scalar comparison of the first element (it seems like R should at least warn here, but it doesn't).

> x = 1:5; y = 5:1
> x < 3 && y > 2
[1] TRUE

If it's 'TRUE', then recycling kicks in and you select ALL elements.

> x[ x < 3 && y > 2 ]
[1] 1 2 3 4 5

On the other hand a single & does a vectorized operation, and you select just the ones you're interested in!

>  x < 3 & y > 2
[1]  TRUE  TRUE FALSE FALSE FALSE
> x[ x < 3 & y > 2]
[1] 1 2

So

vcf2 <- vcf[geno(vcf)$DP>10 & geno(vcf)$GQ>20]

 

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Martin Morgan ♦♦ 22k

Hi Martin, thanks for your reply. If I do what you suggest I get this error:

Error in NSBS(i = i, x = x, exact = exact, upperBoundIsStrict = upperBoundIsStrict) : 
  subscript is a logical vector with out-of-bounds TRUE values
In addition: Warning message:
In NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) :
  subscript is an array, passing it thru as.vector() first

 

Any further ideas?

Cheers,

Rodolfo

ADD REPLYlink written 3.7 years ago by r.jaffe0

The first message is equivalent to trying to subset a vector by a longer vector, which in base R has the unfortunate consequence of adding 'NA' values

> integer()[TRUE]
[1] NA

but in the Genomic Ranges infrastructure generates a more sensible error

> VCF()[TRUE]
Error in NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) : 
  subscript is a logical vector with out-of-bounds TRUE values> x = 1:3

The error comes about because as Valerie pointed out the geno() values are 2- or more dimensional arrays, whereas the way that you're subsetting the VCF object is expecting a 1-dimensional vector; the length of the two-dimensional vector is the product of the number of rows and columns, so longer than the length of the VCF object.

Returning to your original question, I guess geno(vcf)$DP > 10 & geno(vcf)$GQ > 20 returns a logical matrix with 28238 SNPs (rows) by 160 samples (columns). You could remove a row (SNP) entirely, or a column (sample) entirely by imposing some condition on  the values in a row, e.g., at least 80 samples satisfy the condition vcf[rowSums(geno(vcf)$DP > 10 & geno(vcf)$GQ > 20) > 80] To update the values of individual genotypes, you might follow the lead suggested by R for setting values of a matrix to, e.g., NA

> m = matrix(rnorm(18), 6)
> m[m < 0 ] = NA
> m
          [,1]       [,2]      [,3]
[1,]        NA 0.16330049        NA
[2,] 0.9612970         NA 0.3841423
[3,]        NA         NA        NA
[4,]        NA 0.09288314        NA
[5,] 0.6207293 0.01493677 1.2038016
[6,] 0.5116028 0.85283204        NA

and do something like

idx = geno(vcf)$DP > 10 & geno(vcf)$GQ > 20
m = geno(vcf)$GQ
m[ !idx ] = NA
geno(vcf)$GQ = m

 

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Martin Morgan ♦♦ 22k
0
gravatar for Valerie Obenchain
3.7 years ago by
Valerie Obenchain ♦♦ 6.7k
United States
Valerie Obenchain ♦♦ 6.7k wrote:

Hi,

You may want to look at ?filterVcf. This filters the file and writes a subset to disk vs reading in all data then subsetting. There are 2 ways to filter, the first (prefilter) is a grep of the unparsed data which detects presence/absence of a value. The second (filter) parses the data and allows filtering on >= comparisions. For the DP and GQ example above you'll want the second type.

The data in geno() are matrices. If you have a single sample (single column) the simple comparison DP > 10 will give a logical vector you can subset on. If you have multiple samples you'll need to do more work. For example,

> mat = matrix(1:10, ncol=2)
> mat
     [,1] [,2]
[1,]    1    6
[2,]    2    7
[3,]    3    8
[4,]    4    9
[5,]    5   10
> mat > 3
      [,1] [,2]
[1,] FALSE TRUE
[2,] FALSE TRUE
[3,] FALSE TRUE
[4,]  TRUE TRUE
[5,]  TRUE TRUE

Here the filter would need to be more specific and tailored for the number of samples. For example, do you want all samples to have a value > 3?

> rowSums(mat > 3) == 2
[1] FALSE FALSE FALSE  TRUE  TRUE

or any sample > 3?

> rowSums(mat > 3) >= 1
[1] TRUE TRUE TRUE TRUE TRUE

My feeling is that you should filter low DP & GQ and not recode as missing. Others on the list may be able to expand more on that topic.

Valerie

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by Valerie Obenchain ♦♦ 6.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 272 users visited in the last hour