Score column for DiffBind::dba.peakset
2
0
Entering edit mode
teckpor • 0
@teckpor-7598
Last seen 6.5 years ago
Singapore
I am puzzled by the documentation of dba.peakset(), where the detailed description of peak.format argument in the help shows:
“narrow”: narrowPeaks file; scoreCol=8

https://genome.ucsc.edu/FAQ/FAQformat.html#format12
shows that column 8 of NarrowPeaks file is the p-value.

Would it be better to use the signalValue column (column 7 instead of 8)? I wonder what are the pros and cons of using (unadjusted) p-value vs signalValue?


The detailed description of scoreCol argument in the help shows:
peak column to normalize to 0...1 scale when adding a peakset; 0 indicates no normalization

Does that mean peak column is doing two things: the first is to override the column choice given above, so that I can for instance force DiffBind to use signalValue column; the second is to indicate normalization is TRUE/FALSE.

Thank you
Teck Por Lim
National University of Singapore

DiffBind chipseq • 1.7k views
0
Entering edit mode

From the statistical point of view, it is often frowned upon to compare p-values, but if signalValue/fold change is unstable, as mentioned by Gord Brown in the answer below, then is there any way out of this dilemma?

1
Entering edit mode

The p-values don't really mean very much anyway.  If you're using them for much of anything beyond ranking peaks, you're already some way into the realm of fiction... different peak callers will give different p-values; different parameters will give different p-values, etc... About all you can say with confidence is that, within a peak set, smaller p-values (larger -log10(p)) are reasonably correlated with stronger peaks.  Rory, would you agree?

0
Entering edit mode

Thank you, Rory Stark and Gord Brown, your input has been food for thought.

2
Entering edit mode
Rory Stark ★ 4.1k
@rory-stark-5741
Last seen 11 days ago
CRUK, Cambridge, UK

Hi Teckpor-

The default scoring for each peakset type is as you describe. Using the p-value means that the highest confidence peaks (taking into account eg background) will have the highest scores. It is worth noting that for the narrowPeak and MACS xls formats, the p-values are reported as -log10(p), so higher scores indicate higher confidence peaks.  As you suggest, other values can be used as scores, and as you figured out, you can override the default column using the scoreCol parameter in dba.peakset (or a column called ScoreCol when using a samplesheet with dba()).

Looking at this now, I see that there is actually an error in the documentation for the scoreCol parameter for dba.peakset(). Setting this to zero will not prevent the scores from being normalized, rather it will prevent any scores from being extracted from the peak file (in which case all scores will be 1 for samples where the peak is called). I have fixed this in the man page moving forward.

Cheers-

Rory

2
Entering edit mode
Gord Brown ▴ 640
@gord-brown-5664
Last seen 9 months ago
United Kingdom

A minor comment on "signalValue" versus -log10(p-value): at least in MACS, signalValue is fold change, which can be quite unstable, particularly for small peaks (1 more or fewer background reads can change the apparent fold change dramatically if there are only a few reads).  -log10(p-value) is probably more consistent.  My \$0.02, anyway... :)

0
Entering edit mode

Yep, that's why we chose p-value over signalValue as our default score.