I am puzzled by the documentation of dba.peakset(), where the detailed description of peak.format argument in the help shows:
“narrow”: narrowPeaks file; scoreCol=8
https://genome.ucsc.edu/FAQ/FAQformat.html#format12
shows that column 8 of NarrowPeaks file is the p-value.
Would it be better to use the signalValue column (column 7 instead of 8)? I wonder what are the pros and cons of using (unadjusted) p-value vs signalValue?
The detailed description of scoreCol argument in the help shows:
peak column to normalize to 0...1 scale when adding a peakset; 0 indicates no normalization
Does that mean peak column is doing two things: the first is to override the column choice given above, so that I can for instance force DiffBind to use signalValue column; the second is to indicate normalization is TRUE/FALSE.
Thank you
Teck Por Lim
National University of Singapore
From the statistical point of view, it is often frowned upon to compare p-values, but if signalValue/fold change is unstable, as mentioned by Gord Brown in the answer below, then is there any way out of this dilemma?
The p-values don't really mean very much anyway. If you're using them for much of anything beyond ranking peaks, you're already some way into the realm of fiction... different peak callers will give different p-values; different parameters will give different p-values, etc... About all you can say with confidence is that, within a peak set, smaller p-values (larger -log10(p)) are reasonably correlated with stronger peaks. Rory, would you agree?
The default scoring for each peakset type is as you describe. Using the p-value means that the highest confidence peaks (taking into account eg background) will have the highest scores. It is worth noting that for the narrowPeak and MACS xls formats, the p-values are reported as -log10(p), so higher scores indicate higher confidence peaks. As you suggest, other values can be used as scores, and as you figured out, you can override the default column using the scoreCol parameter in dba.peakset (or a column called ScoreCol when using a samplesheet with dba()).
Looking at this now, I see that there is actually an error in the documentation for the scoreCol parameter for dba.peakset(). Setting this to zero will not prevent the scores from being normalized, rather it will prevent any scores from being extracted from the peak file (in which case all scores will be 1 for samples where the peak is called). I have fixed this in the man page moving forward.
A minor comment on "signalValue" versus -log10(p-value): at least in MACS, signalValue is fold change, which can be quite unstable, particularly for small peaks (1 more or fewer background reads can change the apparent fold change dramatically if there are only a few reads). -log10(p-value) is probably more consistent. My $0.02, anyway... :)
From the statistical point of view, it is often frowned upon to compare p-values, but if signalValue/fold change is unstable, as mentioned by Gord Brown in the answer below, then is there any way out of this dilemma?
The p-values don't really mean very much anyway. If you're using them for much of anything beyond ranking peaks, you're already some way into the realm of fiction... different peak callers will give different p-values; different parameters will give different p-values, etc... About all you can say with confidence is that, within a peak set, smaller p-values (larger -log10(p)) are reasonably correlated with stronger peaks. Rory, would you agree?
Thank you, Rory Stark and Gord Brown, your input has been food for thought.