Search
Question: How DiffBind run PCA based on peak data?
0
gravatar for Gary
12 days ago by
Gary0
Gary0 wrote:

May I know how DiffBind analyze peak present/absent data for a PCA analysis? The traditional PCA needs continuous values with the normal distribution. Does DiffBind use a non-parametric method? Or DiffBind use scores (e.g. FDR value) for each peak to perform PCA? Many thanks.

ADD COMMENTlink modified 11 days ago by Rory Stark2.0k • written 12 days ago by Gary0
1
gravatar for Rory Stark
12 days ago by
Rory Stark2.0k
CRUK, Cambridge, UK
Rory Stark2.0k wrote:

The scores DiffBind uses for computing principal components depend on whether is has peak scores or read counts (before or after calling dba.count).

In the cases of peak scores, the score for each peak is normalized to a 0..1 scale. When a merged set of peaks is computed to form the binding matrix, the (maximum) peak score is used for each sample. If the peak was not identified in a given sample (missing), it is assigned a value of -1. So we have values for all peaks and all samples.

In the case os read counts, reads are counted for every peak in every sample whether or not the peak was identified for that sample. So we have a read count for all peaks and all samples. Various scores are calculated from these read scores (as described in the man page for dba.count). For PCAs based on analyzed contrasts, the normalized read counts are use.

Internally, DiffBind uses the princomp function to compute principal components.

Cheers-

Rory

ADD COMMENTlink written 12 days ago by Rory Stark2.0k

Thanks a lot. May I double check with you? I use macs2 to do peak calling and obtain peaks.narrowPeak (BED6+4 format) files. In this case, the peak scores DiffBind adopted is the 5th column (integer score for display), the 8th column (-log10pvalue), or the 9th column (-log10qvalue) for computing principal components? In addition, a value of -1 is assigned for samples without peaks. May I say that the first and second principal components are largely contributed by these missing peaks, because -1 make a large variation? Thanks again.
Best,
Gary

ADD REPLYlink written 11 days ago by Gary0
1
gravatar for Rory Stark
11 days ago by
Rory Stark2.0k
CRUK, Cambridge, UK
Rory Stark2.0k wrote:

In the sample sheet, if you set the PeakCaller values to "narrow", it will by default use the 8th column. You can control which column is used by adding a column called ScoreCol to the sample sheet.

The impact of missing peaks (in the form of higher variance) that result from using a -1 score for missing peaks is intentional. The idea is that when correlating peak calls, the binary aspect of whether a binding site is identified (occupancy) is more important than the specific score assigned to a called peak. This is especially true when using the p-value based score that MACS generates, which is an indication of the confidence that a called peak is real, not a value indicating the "strength" of the peak.

In general, using peak scores to do clustering (whether by correlation heatmap or PCA) is of far less interest than using scores derived from read counts. Peak calling is very noisy and somewhat arbitrary as it requires a binary decision (peak/not a peak). Read scores on the other hand are a much more detailed reflection of the underlying data, allowing quantitative analysis, and are preferable to peak scores for most purposes. 

-Rory

ADD COMMENTlink written 11 days ago by Rory Stark2.0k

I really appreciate your detailed and clear explanation.
Gary

ADD REPLYlink written 11 days ago by Gary0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 122 users visited in the last hour