Question: How can I improve somatic/germline status prediction without matched normal (PureCN)?
gravatar for skkujin
10 weeks ago by
skkujin0 wrote:

Dear Markus,

Hi, I'm using PureCN for classifying somatic SNPs from germline ones without matched normal.

I tested it for validation with 42 gastric cancer bam files without matched normal.

They are target sequenced and its covered bed reaches 6.1Mb in all.

I set muTect paired-called "SOMATIC" tagged-SNPs as golden standard and compared it with PureCN results with pooled normal.

I prepared NormalDB with 42 normal bams and vcfs according to provided manual. (each 42 normal is individually matched with each tumor sample)

And I ran PureCN and took 42 results. Their variants.csv files have the somatic prediction for each SNP, but I want to improve sensitivity and fidelity of prediction.

I counted somatic-predicted call as ML.SOMATIC as "TRUE", FLAGGED as "FALSE", prior.somatic > 0.1

I attached the aggregate of curation files that some columns are added to default ones. 

"Normal-matched Somatic Call" is counts of matched-normal muTect called SNPs.

"Predicted Somatic Call" is counts of calls predicted by PureCN.

"True Somatic Call" counts of intersect between above two counts.

"True/Predicted" is the value which is "True Somatic Call" divided by "Predicted Somatic Call"

How can I go forward?

KJ Kim

ADD COMMENTlink modified 10 weeks ago • written 10 weeks ago by skkujin0

Hi KJ,

Can you add the output of the log-file (in case you used the package directly, not the PureCN.R script, you need to specify a log.file in runAbsoluteCN)? Do you use a mappability file in IntervalFile.R?

What cutoffs are you using for predicted somatic? Are you including known germline sites (prior.somatic < 0.1)?


ADD REPLYlink written 10 weeks ago by markus.riester90

This is one of log files. Other logs have same options.

Yes, I used wgEncodeCrgMapabilityAlign100mer.bigWig and wgEncodeUwRepliSeqK562WaveSignalRep1.bigWig for making interval file like you mentioned in quick start guide.

Below is cutoff I used for somatic-predicted call.

ML.SOMATIC as "TRUE", FLAGGED as "FALSE", prior.somatic > 0.1

I didn't use any comsic DB information. Can it be helpful for more accurate prediction?

ADD REPLYlink written 9 weeks ago by skkujin0

This looks good, with the exception that your VCF contains a huge number of artifacts. Only 13% are annotated as being in dbSNP and after variant filtering still less than 50% are in dbSNP. I get >85% even in MSI-H samples with Mutect 1.1.7.

Is this Mutect 1.1.7 or Mutect2 in GATK4? If M1, is this sample an outlier? Any idea why you get that many artifacts?

I used ML.SOMATIC = TRUE for high-level things like mutation burden, with the idea that uncertain calls will roughly cancel each other out (you mis-classify equally in both directions). If you want more accurate results, you can introduce a grey zone. For example POSTERIOR.SOMATIC > 0.8 somatic, < 0.2 germline.

If you just use dbSNP without any filtering, then known somatic mutations get a low somatic prior when they are in dbSNP. Adding Cosmic avoids this. But since hotspots are rare, that shouldn't change it dramatically.

ADD REPLYlink written 9 weeks ago by markus.riester90

Yes, You're right. It was outlier sample.

It has extremely many SNPs compared to others.

Others have 40~70 % germline likely variants.

Its low prediction may be due to artifacts.

I don't have any sense about artifacts now, I need to look into it.

(plus I'm using Mutect 1.1.7)


In fact, I'm trying to calculate mutation burden.

But my results don't look like follow your assumption. 

Somatic but not somatic-predicted is even more than germline but somatic-predicted.

So I think prediction have to be more aggressive despite of more false positive.

Any Idea you suggest?

Thank you.


ADD REPLYlink written 9 weeks ago by skkujin0

Ok. What exactly are the paired tumor/normal somatic counts? Is this straight out of Mutect without any filtering? In other words, how many of the true somatic counts are present in the predictSomatic output and how many of those are predicted to be germline?

PureCN filters by coverage, allelic fraction, supporting reads and base quality. If you provide a SNP blacklist like hg19_simpleRepeats.bed as shown in the tutorial, somatic calls in those would also be missing. 

Did you run GATK CallableLoci on those tumor BAM files? I would recommend --minDepth 50 in your case. Then only count the mutations that fall within those callable regions. There is also a FilterCallableLoci script that will reduce the callable regions further to callable coding regions. 

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by markus.riester90

Yes, right. I set "paired tumor/normal somatic counts" as raw output of Mutect without any filtering.

And, I agree that filtering pair-called Mutect results by coverage, allelic fraction, supporting reads and base quality is needed like PureCN does. This filtered data should be the golden set, not raw result. That would increase prediction sensitivity. Thank you.

GATK tool also would be good for me.

I'll try your advice and let you know the result.

Thanks, Markus

KJ Kim

ADD REPLYlink written 9 weeks ago by skkujin0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 123 users visited in the last hour