I have a few questions on the use of PureCN and I reckon it is OK to include them in a single post. Thanks in advance for your help and insights.
Mutect 1 is officially recommended by PureCN; it seems that the support for Mutect 2 is still in beta in the latest release of PureCN to date (2.2.0). Is a comparison between Mutect 1 and 2 with PureCN available somewhere? If using Mutect 2, should I provide PureCN with the Mutect 2 output before or after FilterMutectCalls? (Or does it matter? Not sure if I missed this but I could not find the related instructions.)
If I understand correctly, it is recommended to use a process-matched pool of normal for copy number (coverage) normalization even if (sample-specific) matched normals are available. However, including matched normals (when available) in Mutect calls is recommended and this will help with the purity-ploidy fitting to SNV. (Of course, when matched normals are available one will possibly care less about the somatic-vs-germline classification.) I would like to confirm that my understandings are correct.
Relatively new to this field, I am at a loss in terms of how I should go about doing the manual curation. I understand that certain prior biological knowledge of the samples will help to decide whether the PureCN-picked solution is "real", but such prior knowledge is not always available or we are not confident. So generally where shall I start during manual curation? What should one be looking at? If one decide to reject the default solution, what should be based on when picking from the alternatives? I understand that this is a very general question and there is no fixed algorithm, but any empirical tip is much appreciated.
if you follow the GATK best practices (https://gatk.broadinstitute.org/hc/en-us/articles/360035531132) closely, you should be all good. We haven't switched to GATK4 internally yet, but I occasionally test PureCN with the latest GATK4 and it works well. So yes, apply all standard commands including filtering. Use matched normals when available, but provide the --genotype-germline-sites flag to get the SNP allelic fractions we need.
If you have matched normals, simply take all of them and build a pool of normals with NormalDB.R. PureCN is pretty good at extracting all kinds of information to reduce biases. If you use our Docker image, you can conveniently import the GenomicsDB from Mutect2 for the mapping bias part (check if a SNP has a bias to the reference or alt allele). Don't provide the matched normal with --normal when you have a NormalDB.
That's a tricky FAQ (see for example https://github.com/lima1/PureCN/issues/238). Feel free to post examples you are unsure, it's usually pretty obvious if something went wrong, but you need some experience.
Feel free to post the log file of an example run and I can check if everything looks good.
Thanks a lot Markus, this is very helpful. For #3 let me see if I can pick a "typical" run that may require manual curation among my samples. For #2, perhaps independent from what's been said about NormalDB, may I further check my understanding -- using the paired-normal mode for Mutect when possible should lead to better purity-ploidy estimation (compared to tumor-only Mutect call), since the germline SNP priors are better assigned, right?
PureCN is made for tumor-only, so it's pretty accurate even without using the normal in Mutect2. Usually you get identical purity and ploidy except for a small number of difficult samples where PureCN is unsure. The benefit is more in the variant classification step (germline vs somatic vs sub-clonal). It might also remove a few more artifacts that are missed by the pool of normals.
OK that makes sense. Thanks again!
Thanks for your previous help. I now have a particular question that I believe is related to manual curation of PureCN results, so I am reusing my previous post here.
Let me first say that while I understand one of the major advantage of PureCN is the variant classification with tumor-only data, quite often we use PureCN mainly for purity-ploidy estimation and allele-specific copy number calling even when we do have matched normal samples, where we are able to perform the actual tumor-vs-matched-normal somatic mutation calling with another tool, e.g. Mutect 2. My question is regarding such a case, as follows:
So for such a sample, I called somatic mutation against the matched normal with e.g. Mutect 2 (plus additional custom filtering to remove potential germline variants as much as possible), and then I checked the mutant allele frequency (MAF) distribution of all passed somatic mutations. If I see a mostly unimodal MAF distribution with peak (and median) at ~0.5, does that indicate high tumor purity and "automatically" preclude any PureCN solution with low purity, assuming diploidy? My intuition is that if the purity is low, it would lead to noticeable shift of the peak of MAF distribution to below 0.5 (assuming diploidy). If this understanding is wrong, could you please correct me and help explain what I missed?
In practice, I have a few such samples whose MAF distribution was located around 0.5 (as described above), with their default PureCN solution having near 2-ploidy and low purity (and often without any flags/warnings). Further checking, the top few PureCN solutions all have near 2-ploidy, but with varied purity estimates, and in some cases, a high-purity estimate was present as the 2nd best solution, or maybe among top 3-5 solutions. Seeing from the segmented copy number log2 ratio plot, most of these samples indeed mostly have neutral log2 ratios with relatively small fractions of the genome showing CNV. I wonder whether I should go for an alternative solution with higher purity and how I can decide.
Again, thanks for your help in advance.