Dear list,
I've been reading about DESeq normalization strategy and, as far as I
understand, it works on a sample basis: counts for each samples are
normalized according to a factor calculated using the geometric mean
of the counts.
Three questions:
- is this strategy robust when comparing samples with extremely
different library sizes?
- If I wanted to calculate cpm on normalized counts, should I rescale
the library size according to the sizeFactor?
- counts are calculated on genomic intervals, would the same approach
make sense if I use counts on single nucleotides?
Thanks
d
Hi Davide
On 29/05/13 10:58, Davide Cittaro wrote:
> I've been reading about DESeq normalization strategy and, as far as
I understand, it works on a sample basis: counts for each samples are
normalized according to a factor calculated using the geometric mean
of the counts.
> Three questions:
> - is this strategy robust when comparing samples with extremely
different library sizes?
Sure, why shouldn't it be?
> - If I wanted to calculate cpm on normalized counts, should I
rescale the library size according to the sizeFactor?
Actually, no. I assume that by "cpm", you mean "Counts per million",
which is a terse phrase meaning "number of reads mapped to the feature
per one million of aligned reads". As such, "cpm" is _defined_ to mean
the quantity that you get by dividing the counts for your feature by
the
number of aligned reads and multiply by one million.
The notion of "calculating cpm on normalized counts" is hence a
contradiction in terms.
The whole point of DESeq's library size normalization is, of course,
that simply dividing by the number of aligned reads is not a good
strategy to get numbers which can be compared across samples, and that
hence cpm, RPKM, FPKM or any of the other variations on the "per
million" scheme are not useful quantities for differential analyses.
> - counts are calculated on genomic intervals, would the same
approach make sense if I use counts on single nucleotides?
In principle, yes. The problem is that once your feature are very
small,
very many of the counts may be zero, and the geometric mean of any set
of numbers containing at least one zero is zero. Hence, you can only
use
feature with sufficiently high counts to get a stable estimate, and
you
may not have enough of these.
Simon
Hi Simon,
On May 29, 2013, at 11:46 AM, Simon Anders <anders at="" embl.de=""> wrote:
> Hi Davide
>
> On 29/05/13 10:58, Davide Cittaro wrote:
>> I've been reading about DESeq normalization strategy and, as far as
I understand, it works on a sample basis: counts for each samples are
normalized according to a factor calculated using the geometric mean
of the counts.
>> Three questions:
>> - is this strategy robust when comparing samples with extremely
different library sizes?
>
> Sure, why shouldn't it be?
>
You know, just a check :-)
In a small dataset I've artificially reduced the counts for a sample
by different factors and checked the ratios between the counts of that
sample and an invariant one. Indeed there are different but the rms is
really small.
>
> The notion of "calculating cpm on normalized counts" is hence a
> contradiction in terms.
I somehow agree with you, I'm a bit puzzled about the fact I've seen
this in other packages (such as edgeR, but that may be another story).
>
>> - counts are calculated on genomic intervals, would the same
approach make sense if I use counts on single nucleotides?
>
> In principle, yes. The problem is that once your feature are very
small,
> very many of the counts may be zero, and the geometric mean of any
set
> of numbers containing at least one zero is zero. Hence, you can only
use
> feature with sufficiently high counts to get a stable estimate, and
you
> may not have enough of these.
Well, that happens also with intervals, especially if you deal with
some kind of ChIP-seq experiments. The way you use to calculate
factors goes through log(counts), and you exclude intervals with at
least one zero count. I tried to get the size factors sampling my
dataset and using 1/10 of it and the factor estimates are quite
robust.
My problem, if that was not clear, is that I would like to have a
normalization strategy for signals across the genome. Typically these
are at small-interval level (less than 200 bp)
Thanks
d