Correct use of TMM / normalization factors based on large bins
1
1
Entering edit mode
ATpoint ▴ 650
@atpoint-13662
Last seen 43 minutes ago
Germany

The csaw package suggests to use TMM normalization based on large, e.g. 10kb, bins across the genome if ChIP-seq samples are expected to show rather global differences / composition bias is expected. As I want to use the resulting normalization factors to scale non-standard (=non DGElist files, such as bedGraph/bigwig files with raw counts for every base of the genome) I am asking for clarification if my understanding is correct:

One creates a count matrix for the 10kb bins across the genomes, then feeds this into calcNormFactors() and obtains normalization factors. Based on the calculateCPM() and cpm() source code I think one now uses these factors to correct the library size for each sample, therefore library.size/norm.factor, and this multiplied (edit) divided, as Aaron explains) (/edit) by 1e+06 to get a per-million scaling factor.

Eventually one would now divide the "raw" counts by this per-million factor. In my case that could be these bigwig/bedGraph files, which is simply a four-column format with chr-start-end and $4 being the raw counts for every base in the genome of a given sample, therefore $4 / per.million.factor.

Is that correct?

csaw TMM • 457 views
2
Entering edit mode
Aaron Lun ★ 26k
@alun
Last seen 4 hours ago
The city by the bay

So close. The effective library size is library.size * norm.factor. This means that you should divide your count/coverage/etc. value by library.size * norm.factor / 1e6 to get normalized per-million equivalents.