Question: Quality filtering of exactSNP-generated VCF files
gravatar for daniel.silvestre
3.4 years ago by
daniel.silvestre50 wrote:

I'm dealing with the results of some VCF files generated with exactSNP over BAM files made with subjunc aligner (all installed from Rsubread package) from a RNA-Seq project (made with HiSeq/TruSeq/mRNA only). Phred scores are the default (phred+33). The score scale of the FASTQ was detected as Sanger/Illumina-1.8. 

Now, I want keep only the high quality variants with at least 2 supporting reads. I've tried this two filtering expressions:

vcf <- vcf[which(vcf@fixed$QUAL > 22 & !(vcf@info$SR %in% c(NA, '1'))), ]


vcf <- vcf[which(vcf@fixed$QUAL > 22 & !(vcf@info$MM %in% c(NA, '1'))), ]

The first one simply get rid of all variants (empty CollapsedVCF as result). The second returns about 5% of the variants in the original file (which seems reasonable to me). 

Can someone explain the difference between these two results? A suggestion on how to filter VCFs generated with subjunc/exactSNP is much welcome too.

ADD COMMENTlink modified 3.4 years ago by Wei Shi2.7k • written 3.4 years ago by daniel.silvestre50
gravatar for Wei Shi
3.4 years ago by
Wei Shi2.7k
Wei Shi2.7k wrote:

The indels and SNPs should be dealt with differently when you performed the filtering. ExactSNP uses 'DP' tag to report the read depth (or coverage) for each called SNP. For indels, ExactSNP does not report read depth. It reports the number of reads supporting each indel using the 'SR' tag.

The 'QUAL' tag is only meaningful for SNPs. For indels, ExactSNP just sets a value of 1.0 for the 'QUAL' tag for each indel.  ExactSNP does not detect indels itself. It simply extracts all the indels found in the SAM/BAM input file and then saves them to its VCF output file.

The 'QUAL' value for each called SNP is calculated as -log10(p) where p is the p value yielded from the statistical test carried out by ExactSNP. Therefore, a 'QUAL' value of 22 corresponds to a p value of 10^-22, which is extremely low. Note that when calling SNPs, ExactSNP takes into account the sequencing depth at each SNP location, ie ExactSNP uses a lower 'QUAL' cutoff for SNP calling when sequencing depth is low and a higher 'QUAL' cutoff when sequencing depth is high. Therefore I would suggest the following for your SNP filtering:

SNPs_filtered <- vcf[which(vcf@fixed$QUAL > 5 & vcf@info$DP > 5), ]

Note that this will filter out those low-coverage SNPs which are often hard to call.

For indel filtering, you may try this:

Indels_filtered <- vcf[which(vcf@info$SR > 2), ]

Let me know if this does not work for you.

ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by Wei Shi2.7k

My parameters were based on guidelines used by GATK pipelines. They suggest at least 2 supporting reads to call a SNPs as a good candidate. Your parameters work fine. But, as we are using these results for clinical investigation, a more stringent criteria seems appropriate. About indels, I'm treating the file generated by subjunc with this expression:

indels <- indels[which(!(indels@info$SR %in% c(NA, '1')) & (indels@fixed$QUAL / indels@info$DP > 2.0)), ]

Why is SR field coded as char instead of int? 


ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by daniel.silvestre50

Thanks for pointing this out. We will change the 'SR' field to Integer type.

For your indel filtering, you should not use the 'QUAL' and 'DP' fields. See my explanation above.


ADD REPLYlink written 3.4 years ago by Wei Shi2.7k

For indels, I'm using the subjunc-generated indels file. The QUAL field there has other scale (I'm not sure exactly which scale, though). As we already had some independent clinical/laboratorial results, I've calibrated the parameters based on those previous findings. The expected indels were there. With SR > 2 those findings are filtered out. About SNPs, with QUAL > 20 and DP > 5 I've got pretty reasonable results. I confess I'm very satisfied with Rsubread/subread. 

ADD REPLYlink written 3.4 years ago by daniel.silvestre50
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 345 users visited in the last hour