Why are the solutions chosen by PureCN and displayed via plotAbs(, type="overview") often located not on the red hotspots of the heatmap, and some hotspots don't have solutions at them?


In the initial purity/ploidy 2D grid search, PureCN does not assign integer copy numbers to segments yet (that's the second step). That means some "missing" solutions might not fit integers well. There are also a couple of parameters that restrict the search space, for example the size of homozygous deletions, percentage of segments assigned to a sub-clonal state etc. The closest valid solution might fit poorly, resulting in a local optimum outside the red.

Also if you used the recommended post-optimization, the final purity/ploidy also depends on the fit of allelic fractions, not only on the tumor vs normal coverage profile.

Top ranking solutions should be in general close to red spots.  If they are consistently mis-aligned, I'm happy to have a look at examples.

Here's an example where I was surprised not to see a solution near (0.2, 2) or (0.25, 3).

I also discovered that when I do manual curation, it finds the preselected solution that is closest to the manual curation data I put in, rather than doing a complete recalculation using that data, which surprised me.  Probably it is best to trust the program and what it is doing?  For manual curation, it seems impossible to know that an alternate solution is likely to be a better one.  

I'm doing a multiregional sequencing project, and I've been poking around looking for a way by which I might use the fact that I have multiple samples from the same tumor, which presumably are likely to share quite a bit of CNV, as a way to choose the best solution for each tumor.  I was thinking perhaps of doing a correlation between the CN data across the genome, between each pair of solutions (S1, S2) for tumor samples (T1, T2).  Perhaps the highest-correlating pair of solutions would be the best ones.  Do you think that idea is worth pursuing?

A problem I have with my data is that the samples are all low purity, about 2/3 of them below 0.25.

Older versions of PureCN had issues where a correct solution was sometimes not considered at all (mostly in low quality, very low purity samples), but I haven't seen any such cases in over a year. Some of the plots in plotAbs accept a purity and ploidy, most importantly the histogram. So you could test how these solutions (0.2, 2) and (0.25, 3) align to the peaks. Then have a look at the balanced SNPs. In your solution, do these fall into regions of copy number 2, 4, 6 (6 rare)?

Below 20% purity is tricky to get the ploidy right. This requires pretty high quality data.

And yes, if you have multiple solutions, some hopefully of higher purity, this should help with the curation. In high purity samples, manual curation is usually easy. Real high ploidy solutions usually have a fairly uniform log-ratio distribution, whereas wrong ones have lots of gaps, like few regions with copy numbers 2 and 4, but many at 3 and 5. If you look at enough samples, it becomes clear what to look for.


