May I know how DiffBind analyze peak present/absent data for a PCA analysis? The traditional PCA needs continuous values with the normal distribution. Does DiffBind use a non-parametric method? Or DiffBind use scores (e.g. FDR value) for each peak to perform PCA? Many thanks.
DiffBind uses for computing principal components depend on whether is has peak scores or read counts (before or after calling
In the cases of peak scores, the score for each peak is normalized to a 0..1 scale. When a merged set of peaks is computed to form the binding matrix, the (maximum) peak score is used for each sample. If the peak was not identified in a given sample (missing), it is assigned a value of -1. So we have values for all peaks and all samples.
In the case os read counts, reads are counted for every peak in every sample whether or not the peak was identified for that sample. So we have a read count for all peaks and all samples. Various scores are calculated from these read scores (as described in the man page for
dba.count). For PCAs based on analyzed contrasts, the normalized read counts are use.
DiffBind uses the
princomp function to compute principal components.
In the sample sheet, if you set the
PeakCaller values to "
narrow", it will by default use the 8th column. You can control which column is used by adding a column called
ScoreCol to the sample sheet.
The impact of missing peaks (in the form of higher variance) that result from using a -1 score for missing peaks is intentional. The idea is that when correlating peak calls, the binary aspect of whether a binding site is identified (occupancy) is more important than the specific score assigned to a called peak. This is especially true when using the p-value based score that MACS generates, which is an indication of the confidence that a called peak is real, not a value indicating the "strength" of the peak.
In general, using peak scores to do clustering (whether by correlation heatmap or PCA) is of far less interest than using scores derived from read counts. Peak calling is very noisy and somewhat arbitrary as it requires a binary decision (peak/not a peak). Read scores on the other hand are a much more detailed reflection of the underlying data, allowing quantitative analysis, and are preferable to peak scores for most purposes.