Question

Score column for DiffBind::dba.peakset

0

Entering edit mode

teckpor • 0

@teckpor-7598

Last seen 9.6 years ago

Singapore

I am puzzled by the documentation of dba.peakset(), where the detailed description of peak.format argument in the help shows:
“narrow”: narrowPeaks file; scoreCol=8

https://genome.ucsc.edu/FAQ/FAQformat.html#format12
shows that column 8 of NarrowPeaks file is the p-value.

Would it be better to use the signalValue column (column 7 instead of 8)? I wonder what are the pros and cons of using (unadjusted) p-value vs signalValue?

The detailed description of scoreCol argument in the help shows:
peak column to normalize to 0...1 scale when adding a peakset; 0 indicates no normalization

Does that mean peak column is doing two things: the first is to override the column choice given above, so that I can for instance force DiffBind to use signalValue column; the second is to indicate normalization is TRUE/FALSE.

Thank you
Teck Por Lim
National University of Singapore

DiffBind chipseq • 3.3k views

ADD COMMENT • link updated 9.6 years ago by Gord Brown ▴ 670 • written 9.6 years ago by teckpor • 0

0

Entering edit mode

From the statistical point of view, it is often frowned upon to compare p-values, but if signalValue/fold change is unstable, as mentioned by Gord Brown in the answer below, then is there any way out of this dilemma?

ADD REPLY • link 9.6 years ago teckpor • 0

1

Entering edit mode

The p-values don't really mean very much anyway. If you're using them for much of anything beyond ranking peaks, you're already some way into the realm of fiction... different peak callers will give different p-values; different parameters will give different p-values, etc... About all you can say with confidence is that, within a peak set, smaller p-values (larger -log10(p)) are reasonably correlated with stronger peaks. Rory, would you agree?

ADD REPLY • link 9.6 years ago Gord Brown ▴ 670

0

Entering edit mode

Thank you, Rory Stark and Gord Brown, your input has been food for thought.

ADD REPLY • link 9.6 years ago teckpor • 0

score 3 · Accepted Answer · 2015-04-15

Hi Teckpor-

The default scoring for each peakset type is as you describe. Using the p-value means that the highest confidence peaks (taking into account eg background) will have the highest scores. It is worth noting that for the narrowPeak and MACS xls formats, the p-values are reported as -log10(p), so higher scores indicate higher confidence peaks. As you suggest, other values can be used as scores, and as you figured out, you can override the default column using the scoreCol parameter in dba.peakset (or a column called ScoreCol when using a samplesheet with dba()).

Looking at this now, I see that there is actually an error in the documentation for the scoreCol parameter for dba.peakset(). Setting this to zero will not prevent the scores from being normalized, rather it will prevent any scores from being extracted from the peak file (in which case all scores will be 1 for samples where the peak is called). I have fixed this in the man page moving forward.

Cheers-

Rory

score 2 · Accepted Answer · 2015-04-15

2

Entering edit mode

Gord Brown ▴ 670

@gord-brown-5664

Last seen 3.9 years ago

United Kingdom

A minor comment on "signalValue" versus -log10(p-value): at least in MACS, signalValue is fold change, which can be quite unstable, particularly for small peaks (1 more or fewer background reads can change the apparent fold change dramatically if there are only a few reads). -log10(p-value) is probably more consistent. My $0.02, anyway... :)