Hi guys, need some help filtering my vcf files with the VariantAnnotation package (v1.12.9) from Bioconductor. I have a vcf file (v4.2) containing 28238 SNPs for 160 individuals, and I'm running R version 3.1.3 I opened my file and ran:
hist(geno(vcf)$DP) ##check the distribution of Read Depth hist(geno(vcf)$GQ) ##check the distribution of Phred-Scaled Genotype Quality
I then tried to filter out all samples with a DP<10 and a GQ<20 and create a new vcf file:
vcf2 <- vcf[geno(vcf)$DP>10 && geno(vcf)$GQ>20]
The problem is that this code does not seem to be working, because both files have exactly the same dimensions:
dim(vcf) ##28238 160
dim(vcf2) ##28238 160
So my questions are: 1)How can I filter vcf files by DP and GQ?? 2)Is it better to eliminate SNPs that have low DP & GQ, or to re-code genotypes with low DP and GQ as missing genotypes (./.)? Any help would be much appreciated, thanks! Rodolfo
In this statement geno(vcf)$DP>10 && geno(vcf)$GQ>20 you combine the comparisons with &&. In R this is does a scalar comparison of the first element (it seems like R should at least warn here, but it doesn't).
If it's 'TRUE', then recycling kicks in and you select ALL elements.
On the other hand a single & does a vectorized operation, and you select just the ones you're interested in!
So
Hi Martin, thanks for your reply. If I do what you suggest I get this error:
Any further ideas?
Cheers,
Rodolfo
The first message is equivalent to trying to subset a vector by a longer vector, which in base R has the unfortunate consequence of adding 'NA' values
but in the Genomic Ranges infrastructure generates a more sensible error
The error comes about because as Valerie pointed out the geno() values are 2- or more dimensional arrays, whereas the way that you're subsetting the VCF object is expecting a 1-dimensional vector; the length of the two-dimensional vector is the product of the number of rows and columns, so longer than the length of the VCF object.
Returning to your original question, I guess geno(vcf)$DP > 10 & geno(vcf)$GQ > 20 returns a logical matrix with 28238 SNPs (rows) by 160 samples (columns). You could remove a row (SNP) entirely, or a column (sample) entirely by imposing some condition on the values in a row, e.g., at least 80 samples satisfy the condition
vcf[rowSums(geno(vcf)$DP > 10 & geno(vcf)$GQ > 20) > 80]
To update the values of individual genotypes, you might follow the lead suggested by R for setting values of a matrix to, e.g., NAand do something like