I found this comment that the software currently does not officially support VCF files containing indels. Support for VCFs generated by MuTect 2 that include both single nucleotide variants (SNVs) and indels is planned for Bioconductor 3.5. 

Now, the version of bioconductor is 3.5.

So can I classify indel variants as germline vs. somatic with PureCN?

thanks for your interest in PureCN. GATK4 beta will be released in the next weeks and you can expect a fairly well tested PureCN version soon after. GATK4 alpha was only available under an academic license and I couldn't add this frequently requested feature in time for 3.5. Shoot me an email if you want to get notified when the new PureCN version is available.



I also found this comment that samples with tumor purities below 20% usually cannot be analyzed with this algorithm.

It can be improved in a new PureCN version?


Good question. Germline vs. somatic classification below 30-35% purity is easy and you can expect a 99+% accuracy below 20% - you wouldn't even need PureCN for that. 

The 20% number refers to purity/ploidy inference. The actual lower limit depends on coverage and quality of the data. In high coverage, high quality data from highly optimized assays with a sufficiently large pool of normal samples (for which the tool was designed for, see the vignette for details), this can be as low as 15%. Poor quality FFPE data can be so noisy that even 35% purity is challenging. Dramatic amplifications are usually detectable in high coverage samples with around 15%. In clean data, usually 3-4 exon high level amplifications are detectable, in noisy data even 6-10 exons can be challenging. 

If the purity is very low, like below 2-3%, then the algorithm might start fitting mainly noise because there simply is not a lot of signal. The returned purity can be pretty random in those cases. These cases are usually obvious in a manual curation.

PureCN is designed for hybrid capture data of mostly exons and can therefore only use coverage (i.e in cannot use split reads etc.).  So there is not a lot we can do algorithmically - most of the recent efforts are related to cleaning up the data optimally and using all data as efficiently as possible. Mostly using pool of normal samples. 


