DESeq normalisation strategy
1
0
Entering edit mode
@cittaro-davide-5375
Last seen 10.3 years ago
Dear list, I've been reading about DESeq normalization strategy and, as far as I understand, it works on a sample basis: counts for each samples are normalized according to a factor calculated using the geometric mean of the counts. Three questions: - is this strategy robust when comparing samples with extremely different library sizes? - If I wanted to calculate cpm on normalized counts, should I rescale the library size according to the sizeFactor? - counts are calculated on genomic intervals, would the same approach make sense if I use counts on single nucleotides? Thanks d
Normalization DESeq Normalization DESeq • 2.1k views
ADD COMMENT
0
Entering edit mode
Simon Anders ★ 3.8k
@simon-anders-3855
Last seen 4.4 years ago
Zentrum für Molekularbiologie, Universi…
Hi Davide On 29/05/13 10:58, Davide Cittaro wrote: > I've been reading about DESeq normalization strategy and, as far as I understand, it works on a sample basis: counts for each samples are normalized according to a factor calculated using the geometric mean of the counts. > Three questions: > - is this strategy robust when comparing samples with extremely different library sizes? Sure, why shouldn't it be? > - If I wanted to calculate cpm on normalized counts, should I rescale the library size according to the sizeFactor? Actually, no. I assume that by "cpm", you mean "Counts per million", which is a terse phrase meaning "number of reads mapped to the feature per one million of aligned reads". As such, "cpm" is _defined_ to mean the quantity that you get by dividing the counts for your feature by the number of aligned reads and multiply by one million. The notion of "calculating cpm on normalized counts" is hence a contradiction in terms. The whole point of DESeq's library size normalization is, of course, that simply dividing by the number of aligned reads is not a good strategy to get numbers which can be compared across samples, and that hence cpm, RPKM, FPKM or any of the other variations on the "per million" scheme are not useful quantities for differential analyses. > - counts are calculated on genomic intervals, would the same approach make sense if I use counts on single nucleotides? In principle, yes. The problem is that once your feature are very small, very many of the counts may be zero, and the geometric mean of any set of numbers containing at least one zero is zero. Hence, you can only use feature with sufficiently high counts to get a stable estimate, and you may not have enough of these. Simon
ADD COMMENT
0
Entering edit mode
Hi Simon, On May 29, 2013, at 11:46 AM, Simon Anders <anders at="" embl.de=""> wrote: > Hi Davide > > On 29/05/13 10:58, Davide Cittaro wrote: >> I've been reading about DESeq normalization strategy and, as far as I understand, it works on a sample basis: counts for each samples are normalized according to a factor calculated using the geometric mean of the counts. >> Three questions: >> - is this strategy robust when comparing samples with extremely different library sizes? > > Sure, why shouldn't it be? > You know, just a check :-) In a small dataset I've artificially reduced the counts for a sample by different factors and checked the ratios between the counts of that sample and an invariant one. Indeed there are different but the rms is really small. > > The notion of "calculating cpm on normalized counts" is hence a > contradiction in terms. I somehow agree with you, I'm a bit puzzled about the fact I've seen this in other packages (such as edgeR, but that may be another story). > >> - counts are calculated on genomic intervals, would the same approach make sense if I use counts on single nucleotides? > > In principle, yes. The problem is that once your feature are very small, > very many of the counts may be zero, and the geometric mean of any set > of numbers containing at least one zero is zero. Hence, you can only use > feature with sufficiently high counts to get a stable estimate, and you > may not have enough of these. Well, that happens also with intervals, especially if you deal with some kind of ChIP-seq experiments. The way you use to calculate factors goes through log(counts), and you exclude intervals with at least one zero count. I tried to get the size factors sampling my dataset and using 1/10 of it and the factor estimates are quite robust. My problem, if that was not clear, is that I would like to have a normalization strategy for signals across the genome. Typically these are at small-interval level (less than 200 bp) Thanks d
ADD REPLY

Login before adding your answer.

Traffic: 647 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6