Dear DEseq developers,
I have a few questions related to the normalization step in
DEseq.
It is stated that it will normalize the raw counts by library
size,
but how the mathmatical idea is? would you mind giving a more detailed
explanation?
Now I have two groups of metatranscriptome data, one group
contain
H.pylori, the other not. For sure, I have some transcripts in the
first
group that are from H.pylori but not is in group two.
I am wondering if I want to do differential expression
analysis for
these two groups, should I filter out the group specific transcripts
before
putting into DEseq? Will this affect the normalization step?
Thanks!
best,
Laurie
--
Rui Luo
Lab phone : 310 794-7537
Geschwind Lab
Human Genetics Department
UCLA
[[alternative HTML version deleted]]

Dear Laurie
Normalisation: Briefly, the normalisation works as follows: if k_ij is
the count of the i-th gene (or in your case, I guess, taxon) in the
j-th
sample, then we compute f_i as the geometric mean of these values
across
samples. The normalised count is k_ij / f_i.
In more detail, it is described in the paper "Differential expression
analysis for sequence count data", a preprint is available at Nature
Precedings, (4282), 2010, the full publication will come out in Genome
Biology.
Zero counts: The statistical model of DESeq includes situations in
which
the counts are zero in one group and non-zero in others, so I would
recommend leaving these taxa in the data, because you will benefit
from
getting proper statistical inference for these cases, too.
(Normalisation should, afaIcs, not significantly be affected, unless
there is some really odd asymmetry in your data.)
Best wishes
Wolfgang
Il Oct/19/10 6:56 AM, Rui Luo ha scritto:
> Dear DEseq developers,
> I have a few questions related to the normalization step in
DEseq.
> It is stated that it will normalize the raw counts by
library size,
> but how the mathmatical idea is? would you mind giving a more
detailed
> explanation?
> Now I have two groups of metatranscriptome data, one group
contain
> H.pylori, the other not. For sure, I have some transcripts in the
first
> group that are from H.pylori but not is in group two.
> I am wondering if I want to do differential expression
analysis for
> these two groups, should I filter out the group specific transcripts
before
> putting into DEseq? Will this affect the normalization step?
> Thanks!
> best,
> Laurie
>
>

Hi
On 10/19/2010 03:10 PM, Wolfgang Huber wrote:
> Normalisation: Briefly, the normalisation works as follows: if k_ij
is
> the count of the i-th gene (or in your case, I guess, taxon) in the
j-th
> sample, then we compute f_i as the geometric mean of these values
across
> samples. The normalised count is k_ij / f_i.
Wolfgang forgot to mention the median, so let me try again:
First consider two replicate samples, indexed with j=1 and j=2. As
they
are replicates, we expect the counts for a given gene i to always have
the same ratio k_i1 / k_i2. Of course, it will not be exactly the same
ratio, but if you plot a histogram of these ratios, you will see that
there is a clear, narrow peak, and its median (or its mode) should be
a
good estimate for the actual sequencing depth ratio of the two
samples.
Even if they are not replicates, the median should still be a
reasonable
estimator as long as not more than half of the genes are
differentially
expressed.
In order to deal with more than two samples, we define a fictive
"reference sample" against which to compare everything. This reference
sample has, as value for gene i, the geometric mean f_i over all
samples
j of the counts k_ij.
Then, to get a size factor for sample j, we divide the counts for the
genes, k_ij, by the respective reference values f_j, and use the
median
of all those ratios k_ij/f_i as the size factor:
s_j = median_j k_ij / f_i
>> I am wondering if I want to do differential expression analysis for
>> these two groups, should I filter out the group specific
transcripts
>> before
>> putting into DEseq? Will this affect the normalization step?
Normally not, unless the H. pylori transcripts (which, I understand,
you
expect to turn up only in the patients, not in the healthy control)
make
up more than around half of the transcripts.
A useful diagnostic to double-check the size factors is a histogram of
the ratios. You can do this as follows:
library(DESeq)
# get an example count data set -- or use your data:
cds <- makeExampleCountDataSet()
# estimate the size factors:
cds <- estimateSizeFactors( cds )
# calculate the gene-wise geometric means
geomeans <- exp( rowMeans( log( counts(cds) ) ) )
# choose a sample whose size factor estimate we want to check
j <- 1
# just for clarity: the size factor was estimated as described
above,
# so the following two lines give the same result
print( sizeFactors(cds)[j] )
print( median( ( counts(cds)[,j]/geomeans )[ geomeans>0 ] ) )
# Plot a histogram of the ratios of which we have just taken
# the median:
hist( log2( counts(cds)[,j] / geomeans ), breaks=100 )
# This histogram should be unimodal, with a clear peak at the value
# of the size factor. Mark it in red:
abline( v=log2( sizeFactors(cds)[ j ] ), col="red" )
If the histogram looks fine, the normalization has succeeded. If the
size factor does not hit the median, you can either filter down to
genes
which are present in control and patient, as you suggested, or you
could
also try to replace the median with another mode estimator, say, the
shorth (defined in the genefilter package). (I had once a data set
where
the shorth seemed to work better than the median, but in general, I'd
stick to the median, unless the histogram seems weird.)
Best regards
Simon

The mathematical calculation for the normalized reads in DEseq is k_ij/f_i, where Kij is the raw read count for gene i in sample j and the geometric mean f_i over all samples j of the counts k_ij. The size factor s_j will be needed for differential expression analysis but not for the normalized read calculation, right? The size factor s_j is calculated as s_j = median_j *k_ij / f_i =median_j * normalized reads.