Score column for DiffBind::dba.peakset
2
0
Entering edit mode
teckpor • 0
@teckpor-7598
Last seen 9.6 years ago
Singapore
I am puzzled by the documentation of dba.peakset(), where the detailed description of peak.format argument in the help shows:
“narrow”: narrowPeaks file; scoreCol=8

https://genome.ucsc.edu/FAQ/FAQformat.html#format12
shows that column 8 of NarrowPeaks file is the p-value.

Would it be better to use the signalValue column (column 7 instead of 8)? I wonder what are the pros and cons of using (unadjusted) p-value vs signalValue?

The detailed description of scoreCol argument in the help shows:
peak column to normalize to 0...1 scale when adding a peakset; 0 indicates no normalization

Does that mean peak column is doing two things: the first is to override the column choice given above, so that I can for instance force DiffBind to use signalValue column; the second is to indicate normalization is TRUE/FALSE.

Thank you
Teck Por Lim
National University of Singapore
DiffBind chipseq • 3.3k views
ADD COMMENT
0
Entering edit mode

From the statistical point of view, it is often frowned upon to compare p-values, but if signalValue/fold change is unstable, as mentioned by Gord Brown in the answer below, then is there any way out of this dilemma?

ADD REPLY
1
Entering edit mode

The p-values don't really mean very much anyway.  If you're using them for much of anything beyond ranking peaks, you're already some way into the realm of fiction... different peak callers will give different p-values; different parameters will give different p-values, etc... About all you can say with confidence is that, within a peak set, smaller p-values (larger -log10(p)) are reasonably correlated with stronger peaks.  Rory, would you agree?

ADD REPLY
0
Entering edit mode

Thank you, Rory Stark and Gord Brown, your input has been food for thought.

ADD REPLY
3
Entering edit mode
Rory Stark ★ 5.2k
@rory-stark-5741
Last seen 5 weeks ago
Cambridge, UK

Hi Teckpor-

The default scoring for each peakset type is as you describe. Using the p-value means that the highest confidence peaks (taking into account eg background) will have the highest scores. It is worth noting that for the narrowPeak and MACS xls formats, the p-values are reported as -log10(p), so higher scores indicate higher confidence peaks.  As you suggest, other values can be used as scores, and as you figured out, you can override the default column using the scoreCol parameter in dba.peakset (or a column called ScoreCol when using a samplesheet with dba()).

Looking at this now, I see that there is actually an error in the documentation for the scoreCol parameter for dba.peakset(). Setting this to zero will not prevent the scores from being normalized, rather it will prevent any scores from being extracted from the peak file (in which case all scores will be 1 for samples where the peak is called). I have fixed this in the man page moving forward.

Cheers-

Rory

ADD COMMENT
2
Entering edit mode
Gord Brown ▴ 670
@gord-brown-5664
Last seen 3.9 years ago
United Kingdom

A minor comment on "signalValue" versus -log10(p-value): at least in MACS, signalValue is fold change, which can be quite unstable, particularly for small peaks (1 more or fewer background reads can change the apparent fold change dramatically if there are only a few reads).  -log10(p-value) is probably more consistent.  My $0.02, anyway... :)

ADD COMMENT
0
Entering edit mode

Yep, that's why we chose p-value over signalValue as our default score.

ADD REPLY

Login before adding your answer.

Traffic: 718 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6