How DiffBind run PCA based on peak data?
2
0
Entering edit mode
Gary ▴ 20
@gary-7967
Last seen 2.8 years ago

May I know how DiffBind analyze peak present/absent data for a PCA analysis? The traditional PCA needs continuous values with the normal distribution. Does DiffBind use a non-parametric method? Or DiffBind use scores (e.g. FDR value) for each peak to perform PCA? Many thanks.

DiffBind • 1.1k views
1
Entering edit mode
Rory Stark ★ 4.2k
@rory-stark-5741
Last seen 8 weeks ago
CRUK, Cambridge, UK

The scores DiffBind uses for computing principal components depend on whether is has peak scores or read counts (before or after calling dba.count).

In the cases of peak scores, the score for each peak is normalized to a 0..1 scale. When a merged set of peaks is computed to form the binding matrix, the (maximum) peak score is used for each sample. If the peak was not identified in a given sample (missing), it is assigned a value of -1. So we have values for all peaks and all samples.

In the case os read counts, reads are counted for every peak in every sample whether or not the peak was identified for that sample. So we have a read count for all peaks and all samples. Various scores are calculated from these read scores (as described in the man page for dba.count). For PCAs based on analyzed contrasts, the normalized read counts are use.

Internally, DiffBind uses the princomp function to compute principal components.

Cheers-

Rory

0
Entering edit mode

Thanks a lot. May I double check with you? I use macs2 to do peak calling and obtain peaks.narrowPeak (BED6+4 format) files. In this case, the peak scores DiffBind adopted is the 5th column (integer score for display), the 8th column (-log10pvalue), or the 9th column (-log10qvalue) for computing principal components? In addition, a value of -1 is assigned for samples without peaks. May I say that the first and second principal components are largely contributed by these missing peaks, because -1 make a large variation? Thanks again.
Best,
Gary

1
Entering edit mode
Rory Stark ★ 4.2k
@rory-stark-5741
Last seen 8 weeks ago
CRUK, Cambridge, UK

In the sample sheet, if you set the PeakCaller values to "narrow", it will by default use the 8th column. You can control which column is used by adding a column called ScoreCol to the sample sheet.

The impact of missing peaks (in the form of higher variance) that result from using a -1 score for missing peaks is intentional. The idea is that when correlating peak calls, the binary aspect of whether a binding site is identified (occupancy) is more important than the specific score assigned to a called peak. This is especially true when using the p-value based score that MACS generates, which is an indication of the confidence that a called peak is real, not a value indicating the "strength" of the peak.

In general, using peak scores to do clustering (whether by correlation heatmap or PCA) is of far less interest than using scores derived from read counts. Peak calling is very noisy and somewhat arbitrary as it requires a binary decision (peak/not a peak). Read scores on the other hand are a much more detailed reflection of the underlying data, allowing quantitative analysis, and are preferable to peak scores for most purposes.

-Rory

0
Entering edit mode

I really appreciate your detailed and clear explanation.
Gary

0
Entering edit mode

Hi Rory,

Just in connection with this old thread, can I then please get your confirmation that if I call dba.count and then do dba.plotPCA, then I would be using the normalized read counts? Can you tell me which exact value of the score argument does it use? Thanks!

For example, I know that by specifying score=DBA_SCORE_READS, I can force the usage of raw counts. What I want to know is that if I don't specify this, then what is the default score?

Thanks!

0
Entering edit mode

If you run dba.plotPCA with no value for the contrast parameter, the score specified in dba.count() will be used. The default score in the current release version is DBA_SCORE_TMM_MINUS_FULL, which are TMM-normalized read counts, with any control reds subtracted (minimum read count is 1), with the total number of sequenced reads in each library used.

You can easily change the score without re-counting reads by calling dba.count(myDBA, peak=NULL, score=[whatever score you want]).

Note that in the upcoming release due to be released on Wednesday (October 28), the normalization and scoring has been substantially re-worked. The new default in this case are read counts normalized only by full library sizes (total number of reads in each bam file, divided by the mean library size). Control reads are only subtracted if no greylist is applies, and the minimum read count is 0).

0
Entering edit mode

Thank you very much for that answer! Appreciate it very much.

If I don't specify a contrast, but then specify score=DBA_SCORE_READS while calling bna.plotPCA, then the raw read counts will be used, is this correct?

0
Entering edit mode

So long as you have read counts (by calling dba.count()), that is correct.