Quality filtering of exactSNP-generated VCF files
Entering edit mode
Last seen 6.1 years ago

I'm dealing with the results of some VCF files generated with exactSNP over BAM files made with subjunc aligner (all installed from Rsubread package) from a RNA-Seq project (made with HiSeq/TruSeq/mRNA only). Phred scores are the default (phred+33). The score scale of the FASTQ was detected as Sanger/Illumina-1.8. 

Now, I want keep only the high quality variants with at least 2 supporting reads. I've tried this two filtering expressions:

vcf <- vcf[which(vcf@fixed$QUAL > 22 & !(vcf@info$SR %in% c(NA, '1'))), ]


vcf <- vcf[which(vcf@fixed$QUAL > 22 & !(vcf@info$MM %in% c(NA, '1'))), ]

The first one simply get rid of all variants (empty CollapsedVCF as result). The second returns about 5% of the variants in the original file (which seems reasonable to me). 

Can someone explain the difference between these two results? A suggestion on how to filter VCFs generated with subjunc/exactSNP is much welcome too.

exactSNP qc variantannotation • 2.4k views
Entering edit mode
Wei Shi ★ 3.3k
Last seen 22 hours ago
Australia/Melbourne/Olivia Newton-John …

The indels and SNPs should be dealt with differently when you performed the filtering. ExactSNP uses 'DP' tag to report the read depth (or coverage) for each called SNP. For indels, ExactSNP does not report read depth. It reports the number of reads supporting each indel using the 'SR' tag.

The 'QUAL' tag is only meaningful for SNPs. For indels, ExactSNP just sets a value of 1.0 for the 'QUAL' tag for each indel.  ExactSNP does not detect indels itself. It simply extracts all the indels found in the SAM/BAM input file and then saves them to its VCF output file.

The 'QUAL' value for each called SNP is calculated as -log10(p) where p is the p value yielded from the statistical test carried out by ExactSNP. Therefore, a 'QUAL' value of 22 corresponds to a p value of 10^-22, which is extremely low. Note that when calling SNPs, ExactSNP takes into account the sequencing depth at each SNP location, ie ExactSNP uses a lower 'QUAL' cutoff for SNP calling when sequencing depth is low and a higher 'QUAL' cutoff when sequencing depth is high. Therefore I would suggest the following for your SNP filtering:

SNPs_filtered <- vcf[which(vcf@fixed$QUAL > 5 & vcf@info$DP > 5), ]

Note that this will filter out those low-coverage SNPs which are often hard to call.

For indel filtering, you may try this:

Indels_filtered <- vcf[which(vcf@info$SR > 2), ]

Let me know if this does not work for you.

Entering edit mode

My parameters were based on guidelines used by GATK pipelines. They suggest at least 2 supporting reads to call a SNPs as a good candidate. Your parameters work fine. But, as we are using these results for clinical investigation, a more stringent criteria seems appropriate. About indels, I'm treating the file generated by subjunc with this expression:

indels <- indels[which(!(indels@info$SR %in% c(NA, '1')) & (indels@fixed$QUAL / indels@info$DP > 2.0)), ]

Why is SR field coded as char instead of int? 


Entering edit mode

Thanks for pointing this out. We will change the 'SR' field to Integer type.

For your indel filtering, you should not use the 'QUAL' and 'DP' fields. See my explanation above.


Entering edit mode

For indels, I'm using the subjunc-generated indels file. The QUAL field there has other scale (I'm not sure exactly which scale, though). As we already had some independent clinical/laboratorial results, I've calibrated the parameters based on those previous findings. The expected indels were there. With SR > 2 those findings are filtered out. About SNPs, with QUAL > 20 and DP > 5 I've got pretty reasonable results. I confess I'm very satisfied with Rsubread/subread. 


Login before adding your answer.

Traffic: 251 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6