Question

Normalization by DEseq

0

Entering edit mode

Rui Luo ▴ 20

@rui-luo-4306

Last seen 9.6 years ago

Dear DEseq developers, I have a few questions related to the normalization step in DEseq. It is stated that it will normalize the raw counts by library size, but how the mathmatical idea is? would you mind giving a more detailed explanation? Now I have two groups of metatranscriptome data, one group contain H.pylori, the other not. For sure, I have some transcripts in the first group that are from H.pylori but not is in group two. I am wondering if I want to do differential expression analysis for these two groups, should I filter out the group specific transcripts before putting into DEseq? Will this affect the normalization step? Thanks! best, Laurie -- Rui Luo Lab phone : 310 794-7537 Geschwind Lab Human Genetics Department UCLA [[alternative HTML version deleted]]

Genetics Normalization DESeq Genetics Normalization DESeq • 13k views

ADD COMMENT • link updated 6.2 years ago by diannedai • 0 • written 13.5 years ago by Rui Luo ▴ 20

score 0 · Answer 1 · 2010-10-19

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 14 days ago

EMBL European Molecular Biology Laborat…

Dear Laurie Normalisation: Briefly, the normalisation works as follows: if k_ij is the count of the i-th gene (or in your case, I guess, taxon) in the j-th sample, then we compute f_i as the geometric mean of these values across samples. The normalised count is k_ij / f_i. In more detail, it is described in the paper "Differential expression analysis for sequence count data", a preprint is available at Nature Precedings, (4282), 2010, the full publication will come out in Genome Biology. Zero counts: The statistical model of DESeq includes situations in which the counts are zero in one group and non-zero in others, so I would recommend leaving these taxa in the data, because you will benefit from getting proper statistical inference for these cases, too. (Normalisation should, afaIcs, not significantly be affected, unless there is some really odd asymmetry in your data.) Best wishes Wolfgang Il Oct/19/10 6:56 AM, Rui Luo ha scritto: > Dear DEseq developers, > I have a few questions related to the normalization step in DEseq. > It is stated that it will normalize the raw counts by library size, > but how the mathmatical idea is? would you mind giving a more detailed > explanation? > Now I have two groups of metatranscriptome data, one group contain > H.pylori, the other not. For sure, I have some transcripts in the first > group that are from H.pylori but not is in group two. > I am wondering if I want to do differential expression analysis for > these two groups, should I filter out the group specific transcripts before > putting into DEseq? Will this affect the normalization step? > Thanks! > best, > Laurie > >

ADD COMMENT • link 13.5 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Hi On 10/19/2010 03:10 PM, Wolfgang Huber wrote: > Normalisation: Briefly, the normalisation works as follows: if k_ij is > the count of the i-th gene (or in your case, I guess, taxon) in the j-th > sample, then we compute f_i as the geometric mean of these values across > samples. The normalised count is k_ij / f_i. Wolfgang forgot to mention the median, so let me try again: First consider two replicate samples, indexed with j=1 and j=2. As they are replicates, we expect the counts for a given gene i to always have the same ratio k_i1 / k_i2. Of course, it will not be exactly the same ratio, but if you plot a histogram of these ratios, you will see that there is a clear, narrow peak, and its median (or its mode) should be a good estimate for the actual sequencing depth ratio of the two samples. Even if they are not replicates, the median should still be a reasonable estimator as long as not more than half of the genes are differentially expressed. In order to deal with more than two samples, we define a fictive "reference sample" against which to compare everything. This reference sample has, as value for gene i, the geometric mean f_i over all samples j of the counts k_ij. Then, to get a size factor for sample j, we divide the counts for the genes, k_ij, by the respective reference values f_j, and use the median of all those ratios k_ij/f_i as the size factor: s_j = median_j k_ij / f_i >> I am wondering if I want to do differential expression analysis for >> these two groups, should I filter out the group specific transcripts >> before >> putting into DEseq? Will this affect the normalization step? Normally not, unless the H. pylori transcripts (which, I understand, you expect to turn up only in the patients, not in the healthy control) make up more than around half of the transcripts. A useful diagnostic to double-check the size factors is a histogram of the ratios. You can do this as follows: library(DESeq) # get an example count data set -- or use your data: cds <- makeExampleCountDataSet() # estimate the size factors: cds <- estimateSizeFactors( cds ) # calculate the gene-wise geometric means geomeans <- exp( rowMeans( log( counts(cds) ) ) ) # choose a sample whose size factor estimate we want to check j <- 1 # just for clarity: the size factor was estimated as described above, # so the following two lines give the same result print( sizeFactors(cds)[j] ) print( median( ( counts(cds)[,j]/geomeans )[ geomeans>0 ] ) ) # Plot a histogram of the ratios of which we have just taken # the median: hist( log2( counts(cds)[,j] / geomeans ), breaks=100 ) # This histogram should be unimodal, with a clear peak at the value # of the size factor. Mark it in red: abline( v=log2( sizeFactors(cds)[ j ] ), col="red" ) If the histogram looks fine, the normalization has succeeded. If the size factor does not hit the median, you can either filter down to genes which are present in control and patient, as you suggested, or you could also try to replace the median with another mode estimator, say, the shorth (defined in the genefilter package). (I had once a data set where the shorth seemed to work better than the median, but in general, I'd stick to the median, unless the histogram seems weird.) Best regards Simon

ADD REPLY • link 13.5 years ago Simon Anders ★ 3.7k

score 0 · Answer 2 · 2018-02-28

0

Entering edit mode

diannedai • 0

@diannedai-15120

Last seen 6.2 years ago

Dear Simon:

I am confused about the read normalization.

The mathematical calculation for the normalized reads in DEseq is k_ij/f_i, where Kij is the raw read count for gene i in sample j and the geometric mean f_i over all samples j of the counts k_ij. The size factor s_j will be needed for differential expression analysis but not for the normalized read calculation, right? The size factor s_j is calculated as s_j = median_j *k_ij / f_i =median_j * normalized reads.

Thank you very much

Jun Dai

Des Moines University

ADD COMMENT • link 6.2 years ago diannedai • 0

0

Entering edit mode

No, the normalized counts are calculated as raw counts divided by size factor.

ADD REPLY • link 6.2 years ago Simon Anders ★ 3.7k