Question

How DiffBind run PCA based on peak data?

0

Entering edit mode

Gary ▴ 20

@gary-7967

Last seen 5.2 years ago

May I know how DiffBind analyze peak present/absent data for a PCA analysis? The traditional PCA needs continuous values with the normal distribution. Does DiffBind use a non-parametric method? Or DiffBind use scores (e.g. FDR value) for each peak to perform PCA? Many thanks.

DiffBind • 2.3k views

ADD COMMENT • link updated 3.5 years ago by Rory Stark ★ 5.2k • written 6.7 years ago by Gary ▴ 20

score 2 · Answer 1 · 2017-09-11

The scores DiffBind uses for computing principal components depend on whether is has peak scores or read counts (before or after calling dba.count).

In the cases of peak scores, the score for each peak is normalized to a 0..1 scale. When a merged set of peaks is computed to form the binding matrix, the (maximum) peak score is used for each sample. If the peak was not identified in a given sample (missing), it is assigned a value of -1. So we have values for all peaks and all samples.

In the case os read counts, reads are counted for every peak in every sample whether or not the peak was identified for that sample. So we have a read count for all peaks and all samples. Various scores are calculated from these read scores (as described in the man page for dba.count). For PCAs based on analyzed contrasts, the normalized read counts are use.

Internally, DiffBind uses the princomp function to compute principal components.

Cheers-

Rory

score 1 · Answer 2 · 2017-09-12

1

Entering edit mode

Rory Stark ★ 5.2k

@rory-stark-5741

Last seen 8 days ago

Cambridge, UK

In the sample sheet, if you set the PeakCaller values to "narrow", it will by default use the 8th column. You can control which column is used by adding a column called ScoreCol to the sample sheet.

The impact of missing peaks (in the form of higher variance) that result from using a -1 score for missing peaks is intentional. The idea is that when correlating peak calls, the binary aspect of whether a binding site is identified (occupancy) is more important than the specific score assigned to a called peak. This is especially true when using the p-value based score that MACS generates, which is an indication of the confidence that a called peak is real, not a value indicating the "strength" of the peak.

In general, using peak scores to do clustering (whether by correlation heatmap or PCA) is of far less interest than using scores derived from read counts. Peak calling is very noisy and somewhat arbitrary as it requires a binary decision (peak/not a peak). Read scores on the other hand are a much more detailed reflection of the underlying data, allowing quantitative analysis, and are preferable to peak scores for most purposes.

-Rory

ADD COMMENT • link 6.7 years ago Rory Stark ★ 5.2k

0

Entering edit mode

I really appreciate your detailed and clear explanation.
Gary

ADD REPLY • link 6.7 years ago Gary ▴ 20

0

Entering edit mode

Hi Rory,

Just in connection with this old thread, can I then please get your confirmation that if I call dba.count and then do dba.plotPCA, then I would be using the normalized read counts? Can you tell me which exact value of the score argument does it use? Thanks!

For example, I know that by specifying score=DBA_SCORE_READS, I can force the usage of raw counts. What I want to know is that if I don't specify this, then what is the default score?

Thanks!

ADD REPLY • link 3.5 years ago CodeAway ▴ 70

0

Entering edit mode

If you run dba.plotPCA with no value for the contrast parameter, the score specified in dba.count() will be used. The default score in the current release version is DBA_SCORE_TMM_MINUS_FULL, which are TMM-normalized read counts, with any control reds subtracted (minimum read count is 1), with the total number of sequenced reads in each library used.

You can easily change the score without re-counting reads by calling dba.count(myDBA, peak=NULL, score=[whatever score you want]).

Note that in the upcoming release due to be released on Wednesday (October 28), the normalization and scoring has been substantially re-worked. The new default in this case are read counts normalized only by full library sizes (total number of reads in each bam file, divided by the mean library size). Control reads are only subtracted if no greylist is applies, and the minimum read count is 0).

ADD REPLY • link 3.5 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Thank you very much for that answer! Appreciate it very much.

If I don't specify a contrast, but then specify score=DBA_SCORE_READS while calling bna.plotPCA, then the raw read counts will be used, is this correct?

ADD REPLY • link 3.5 years ago CodeAway ▴ 70

0

Entering edit mode

So long as you have read counts (by calling dba.count()), that is correct.

ADD REPLY • link 3.5 years ago Rory Stark ★ 5.2k