Question

DESeq normalisation strategy

0

Entering edit mode

Cittaro Davide ▴ 240

@cittaro-davide-5375

Last seen 11.1 years ago

Dear list, I've been reading about DESeq normalization strategy and, as far as I understand, it works on a sample basis: counts for each samples are normalized according to a factor calculated using the geometric mean of the counts. Three questions: - is this strategy robust when comparing samples with extremely different library sizes? - If I wanted to calculate cpm on normalized counts, should I rescale the library size according to the sizeFactor? - counts are calculated on genomic intervals, would the same approach make sense if I use counts on single nucleotides? Thanks d

Normalization DESeq Normalization DESeq • 2.2k views

ADD COMMENT • link updated 12.4 years ago by Simon Anders ★ 3.8k • written 12.4 years ago by Cittaro Davide ▴ 240

score 0 · Answer 1 · 2013-05-29

Hi Davide On 29/05/13 10:58, Davide Cittaro wrote: > I've been reading about DESeq normalization strategy and, as far as I understand, it works on a sample basis: counts for each samples are normalized according to a factor calculated using the geometric mean of the counts. > Three questions: > - is this strategy robust when comparing samples with extremely different library sizes? Sure, why shouldn't it be? > - If I wanted to calculate cpm on normalized counts, should I rescale the library size according to the sizeFactor? Actually, no. I assume that by "cpm", you mean "Counts per million", which is a terse phrase meaning "number of reads mapped to the feature per one million of aligned reads". As such, "cpm" is _defined_ to mean the quantity that you get by dividing the counts for your feature by the number of aligned reads and multiply by one million. The notion of "calculating cpm on normalized counts" is hence a contradiction in terms. The whole point of DESeq's library size normalization is, of course, that simply dividing by the number of aligned reads is not a good strategy to get numbers which can be compared across samples, and that hence cpm, RPKM, FPKM or any of the other variations on the "per million" scheme are not useful quantities for differential analyses. > - counts are calculated on genomic intervals, would the same approach make sense if I use counts on single nucleotides? In principle, yes. The problem is that once your feature are very small, very many of the counts may be zero, and the geometric mean of any set of numbers containing at least one zero is zero. Hence, you can only use feature with sufficiently high counts to get a stable estimate, and you may not have enough of these. Simon