Search
Question: Quality filtering of exactSNP-generated VCF files
1
3.9 years ago by
Brazil
daniel.silvestre50 wrote:

I'm dealing with the results of some VCF files generated with exactSNP over BAM files made with subjunc aligner (all installed from Rsubread package) from a RNA-Seq project (made with HiSeq/TruSeq/mRNA only). Phred scores are the default (phred+33). The score scale of the FASTQ was detected as Sanger/Illumina-1.8.

Now, I want keep only the high quality variants with at least 2 supporting reads. I've tried this two filtering expressions:

vcf <- vcf[which(vcf@fixed$QUAL > 22 & !(vcf@info$SR %in% c(NA, '1'))), ]

And:

vcf <- vcf[which(vcf@fixed$QUAL > 22 & !(vcf@info$MM %in% c(NA, '1'))), ]

The first one simply get rid of all variants (empty CollapsedVCF as result). The second returns about 5% of the variants in the original file (which seems reasonable to me).

Can someone explain the difference between these two results? A suggestion on how to filter VCFs generated with subjunc/exactSNP is much welcome too.

modified 3.9 years ago by Wei Shi2.9k • written 3.9 years ago by daniel.silvestre50
2
3.9 years ago by
Wei Shi2.9k
Australia
Wei Shi2.9k wrote:

The indels and SNPs should be dealt with differently when you performed the filtering. ExactSNP uses 'DP' tag to report the read depth (or coverage) for each called SNP. For indels, ExactSNP does not report read depth. It reports the number of reads supporting each indel using the 'SR' tag.

The 'QUAL' tag is only meaningful for SNPs. For indels, ExactSNP just sets a value of 1.0 for the 'QUAL' tag for each indel.  ExactSNP does not detect indels itself. It simply extracts all the indels found in the SAM/BAM input file and then saves them to its VCF output file.

The 'QUAL' value for each called SNP is calculated as -log10(p) where p is the p value yielded from the statistical test carried out by ExactSNP. Therefore, a 'QUAL' value of 22 corresponds to a p value of 10^-22, which is extremely low. Note that when calling SNPs, ExactSNP takes into account the sequencing depth at each SNP location, ie ExactSNP uses a lower 'QUAL' cutoff for SNP calling when sequencing depth is low and a higher 'QUAL' cutoff when sequencing depth is high. Therefore I would suggest the following for your SNP filtering:

SNPs_filtered <- vcf[which(vcf@fixed$QUAL > 5 & vcf@info$DP > 5), ]

Note that this will filter out those low-coverage SNPs which are often hard to call.

For indel filtering, you may try this:

Indels_filtered <- vcf[which(vcf@info$SR > 2), ] Let me know if this does not work for you. ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by Wei Shi2.9k My parameters were based on guidelines used by GATK pipelines. They suggest at least 2 supporting reads to call a SNPs as a good candidate. Your parameters work fine. But, as we are using these results for clinical investigation, a more stringent criteria seems appropriate. About indels, I'm treating the file generated by subjunc with this expression: indels <- indels[which(!(indels@info$SR %in% c(NA, '1')) & (indels@fixed$QUAL / indels@info$DP > 2.0)), ]

Why is SR field coded as char instead of int?

Thanks for pointing this out. We will change the 'SR' field to Integer type.

For your indel filtering, you should not use the 'QUAL' and 'DP' fields. See my explanation above.