Question: Unable to find solution: low coverage
0
gravatar for twtoal
13 months ago by
twtoal0
twtoal0 wrote:

When I ran PureCN using matching tumor/normal pairs, most of my samples were flagged as having a noisy log-ratio.  However, all samples successfully ran through PureCN and none were flagged as Failed.

I then ran PureCN with --normal removed, so it would create a representative normal sample itself.  You had indicated this tended to produce better results.  Now, I've had several samples fail with a fatal error saying that no solution could be found.  These samples all seem to be at the lower end of our coverage.

Below is a screen shot of a plot showing mean coverage of each sample.  The samples that initially failed were the JES-61 and IBG-2-CG-178 samples.  I removed those from the analysis and tried again, and then sample PC-60T3 failed, so I removed it.  Now, sample CH-59T1 has failed.  Just what causes this failure, when it worked with matched normals?  Is this solely a problem with low coverage, or could there be another problem here?

 

ADD COMMENTlink written 13 months ago by twtoal0

You used the same Coverage.R (ideally default) parameters for tumor and normals and the database is using the correct data?

This error message usually only happens in catastrophic QC failures. Basically the 2D purity/ploidy search couldn't find a combination that fits the data.  Why is the coverage low? High AT/GC dropout, extreme duplication?

If you are sure everything is correct, can you share the tumor coverage and normal database RDS file?

ADD REPLYlink written 13 months ago by markus.riester110

For CH59-T1, PureCN indicates a coverage of around 50X, which matches my own data, and PureCN estimates duplication around 20%, which doesn't seem too high.  The coverage is lower than most of our other samples, and we are planning to get more sequencing data, but I wouldn't call that coverage extremely low.  Would you?

The purity is quite low, around 20%.

Yes, it was the same Coverage.R parameters for tumor and normals.  As far as I know, the database used the correct data.  Also, the results for the samples that completed do look better than the results with --normal.  For example, a lot fewer samples are flagged with noisy log ratio.

I've put the normal database RDS and the CH59T1 coverage files and VCF file into a zipped folder on dropbox at:

https://www.dropbox.com/s/45u5ycujeimr8f6/CH59T1_PureCN_Coverage.zip?dl=0

ADD REPLYlink modified 13 months ago • written 13 months ago by twtoal0

I'll have a look tonight or over the weekend. Possible that the coverage in off-target regions is too low and thus noisy. Up/down weighting off-target vs on-target based on sample-specific noise is something I want to look at for the next release. 

Glad to hear it generally improved.

ADD REPLYlink written 13 months ago by markus.riester110

Did you change --offtargetwidth to a very small value? That in addition to the low coverage, you end up with just a small number of reads per off-target bin. So you just add noise from the off-target reads.

Have a bit more trust in PureCN's defaults, they are optimized over many thousands of samples.

Update: If you haven't changed --offtargetwidth, there is probably an issue with your mappability file in that it restricts the genome to very small patches.

The first off-target regions should be something like this:

1:10501-207117
1:258167-297419
1:586489-706352
1:706353-826217
1:826718-826890
1:827391-925401

NormalDB generates a target weight PNG file. Only if you see a huge difference in noise between on and off-target, you need to change the offtargetwidth.

 

ADD REPLYlink modified 13 months ago • written 13 months ago by markus.riester110

 

 

No, I used --offtargetwidth=200000 where 200000 is the default value you documented for --offtargetwidth.

I did generate a new mappability file using a new version of GEM sent to me by Paolo Ribeca, which he said fixed problems with the version you used.  I ran it with the same parameters you used.    I see that the .bigwig file that was generated was size 1.5GB for your version vs. 624MB for the new version, don't know why such a difference.

I uploaded the .wig file to dropbox, at:

https://www.dropbox.com/s/6yor69cyhkdutqs/Homo_sapiens_assembly38_LCCpanel.gemV2_100.mappability.fixed.wig.gz?dl=0

I don't understand where you are getting the off-target regions you list above.  The very first region in our target region BED file (from our panel manufacturer), which I provide as an argument to IntervalFile.R, is:

chr1    1752747 1753157 NADK

With offtargetwidth=200000, I would expect that the first interval would start at chr1:1552747.

When I look at the output interval file from IntervalFile.R, I see that the first interval is:

chr1:803685-850143

which doesn't match my expectation or what you say above.

Also, I don't understand why offtargetwidth would be so large.  We are seeing practically 0 reads more than about 200 bp from our target regions.

 

ADD REPLYlink modified 12 months ago • written 12 months ago by twtoal0

Thanks, I'll have a look at your mappability file in the next 1-2 days.

Have a look at the CNVkit paper ( http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004873 ), which explains it nicely. 

The idea is to tile the genome into off-target bins and then just count the random reads in those, as you would for your targets.

 chr1:1-10000 is inaccessible telomere (and should be 0  or missing in your mappability file!). We ignore a 500 bp margin around it, so the first bin should start at 10501.  

The larger your off-target bin size, the more reads you get. More reads obviously means less sampling variance (i.e. noise). The target weight PNG file should give you an idea how small you can go to still get a similar noise as you get for targeted exons.  

The main reason to include off-target in small panels is that it helps to get the ploidy much more accurate. Ploidy is the average copy number, but if you only sequence like < 0.1% of the genome, that's tricky. But even low (200k) resolution off-target tiling is usually good enough to approximate the real ploidy. The main benefit of going lower than 200k is the higher resolution around focally amplified genes, especially small ones with less than 5 exons. Going below 100k off-target I wouldn't recommend, unless you your coverage is high and your assay is fairly inefficient and you have high off-target rates (>25% uniquely mapped reads). 

ADD REPLYlink modified 12 months ago • written 12 months ago by markus.riester110

Your mappability file has a lot of regions with 0 mappability that are properly sequenced. When you generated the mappability file, did you maybe use a FASTA file in which repeats were masked?  

The problem is that you end up with only small regions between these masked regions, way smaller than 200k. Let me know if this fixes your problem I'll add a warning to the FAQ.

I'm also working to get a supported hg38 mappability file hosted somewhere.

ADD REPLYlink written 12 months ago by markus.riester110

No, I did not use such a FASTA.  Just checked it and repeats aren't masked.  Can you give me an exact position where you see such 0 mappability that shouldn't be 0?

I wonder if this is something to do with the new GEM mappability program version that I used.  When examining my old and new mappability bigwig files made with original GEM and newer version received from its developer, I do see that the new file has a lot of small regions of 0's that the old file does not have.  I've emailed him to ask about this.

Maybe I should switch back to the old file, it might actually be better.

 

 

ADD REPLYlink modified 12 months ago • written 12 months ago by twtoal0

One of the first is chr1:10625-10704. Only inaccessible regions (Ns in the FASTA) like centromeres and telomeres should be 0.

In my file generated by the old GEM it is like this below. My guess is the new GEM requires a minimal mappability. The smallest score I see in your file is 0.166667.

> subsetByOverlaps(oldGem, GRanges("chr1:10625-10704"))

GRanges object with 11 ranges and 1 metadata column:

       seqnames         ranges strand |              score

          <Rle>      <IRanges>  <Rle> |          <numeric>

   [1]     chr1 [10625, 10627]      * |  0.030303031206131

   [2]     chr1 [10628, 10629]      * | 0.0384615398943424

   [3]     chr1 [10630, 10648]      * |  0.030303031206131

   [4]     chr1 [10649, 10653]      * | 0.0384615398943424

   [5]     chr1 [10654, 10656]      * |  0.030303031206131

   [6]     chr1 [10657, 10658]      * | 0.0384615398943424

   [7]     chr1 [10659, 10677]      * |  0.030303031206131

   [8]     chr1 [10678, 10682]      * | 0.0384615398943424

   [9]     chr1 [10683, 10685]      * |  0.030303031206131

  [10]     chr1 [10686, 10687]      * | 0.0384615398943424

  [11]     chr1 [10688, 10704]      * |  0.030303031206131

  -------

  seqinfo: 25 sequences from an unspecified genome

 

ADD REPLYlink written 12 months ago by markus.riester110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 167 users visited in the last hour