Some confuse about the DBA_SCORE_TMM_READS_EFFECTIVE_CPM in DiffBind
Hello, Dr Stark. I am confused about the parameter DBASCORETMMREADSEFFECTIVE_CPM.

It says that DBASCORETMMREADSEFFECTIVE -> TMM normalized (using edgeR), using ChIP read counts and Effective Library size DBASCORETMMREADSEFFECTIVECPM -> same as DBASCORETMMREADS_EFFECTIVE, but reported in counts-per-million.

I konw the meaning of TMM and CPM. But I am confused about the DBASCORETMMREADSEFFECTIVE_CPM . At first I think it first get the TMM value, than normazlie TMM vlaue into CPM value. But I find the sum of value is not 1M, which confuse me

> colSums(CPM_merge)
E5_0h_R1  E5_0h_R2  E5_3D_R1  E5_3D_R2  E5_3D_R3
1076920.7 1050878.6 1154048.9 1100915.8 1023065.8
G3_R1     G3_R2     G3_R3   G3E1_R1   G3E1_R2
1013695.3  984226.9 1116362.4  941578.9  924370.8
G3E3_R1   G3E3_R2   G3E7_R1   G3E7_R2
955514.5  917167.8  915660.3  873877.5


My Tototal code is

dba_meta <- dba(minOverlap = 1, sampleSheet = sample_info)
dba_count <- dba.count(dba_meta,minOverlap = 1,score = DBA_SCORE_TMM_READS_EFFECTIVE_CPM)
peak_CPM_list <- dba_count$peaks names(peak_CPM_list) <- dba_count$samples$SampleID scores <- lapply(peak_CPM_list, function(x) {x$Score})
CPM_merge <- do.call(cbind, scores)


Best wishes Guandong Shang

And I have another confuse about the DiffBind Question: The coordinate system problem about DiffBind output if you can help me also, that will be greatul :)

Rory Stark
@rory-stark-5741
Last seen 24 days ago
CRUK, Cambridge, UK

Basically, in the CPM versions, DiffBind TMM scores assume each library was sequenced to a depth of 1M reads.

Details:

First, edgeR is used to calculate the $lib.size and$norm.factors for each sample. These are multiplied to derive a scaling factor.

Next, the raw read counts are divided by this scaling factor.

Finally, the adjusted read counts are expanded back into useful values by multiplying by a single representative library size. In the CPM case, this is set to a constant, 1E06. In the non-CPM case, this is taken as the mean \$lib.size.

This is all somewhat arbitrary. These scores are only used for plotting non-analyzed data (heatmaps and PCAs), and these values are useful for that. In no case are DiffBind "scores" used directly in an analysis, only in certain clustering plots.

I am not sure I understand well Just a example

- Raw count

|        | Sample 1 | Sample 2 |
| ------ | -------- | -------- |
| Peak_1 | 24       | 15       |

Scaling factor

- Sample 1: **1.2**
- Sample 2:  **0.8**

- divided by scaling factor

|        | Sample 1 | Sample 2 |
| ------ | -------- | -------- |
| Peak_1 | 20       | 18.75    |

-  multiplying by a single representative library size. In CPM case, 1E06

|        | Sample 1        | Sample 2           |
| ------ | --------------- | ------------------ |
| Peak_1 | 20 * 1E06 (???) | 18.75 * 1E06 (???) |