Correct use of TMM / normalization factors based on large bins
1
1
Entering edit mode
ATpoint ▴ 650
@atpoint-13662
Last seen 43 minutes ago
Germany

The csaw package suggests to use TMM normalization based on large, e.g. 10kb, bins across the genome if ChIP-seq samples are expected to show rather global differences / composition bias is expected. As I want to use the resulting normalization factors to scale non-standard (=non DGElist files, such as bedGraph/bigwig files with raw counts for every base of the genome) I am asking for clarification if my understanding is correct:

One creates a count matrix for the 10kb bins across the genomes, then feeds this into calcNormFactors() and obtains normalization factors. Based on the calculateCPM() and cpm() source code I think one now uses these factors to correct the library size for each sample, therefore library.size/norm.factor, and this multiplied (edit) divided, as Aaron explains) (/edit) by 1e+06 to get a per-million scaling factor.

Eventually one would now divide the "raw" counts by this per-million factor. In my case that could be these bigwig/bedGraph files, which is simply a four-column format with chr-start-end and $4 being the raw counts for every base in the genome of a given sample, therefore $4 / per.million.factor.

Is that correct?

csaw TMM • 457 views
ADD COMMENT
2
Entering edit mode
Aaron Lun ★ 26k
@alun
Last seen 4 hours ago
The city by the bay

So close. The effective library size is library.size * norm.factor. This means that you should divide your count/coverage/etc. value by library.size * norm.factor / 1e6 to get normalized per-million equivalents.

ADD COMMENT

Login before adding your answer.

Traffic: 393 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6