Question

Normalized counts from DESeq2 results in similar but not equal total read count?

0

Entering edit mode

ricardo3889 • 0

@ricardo3889-19646

Last seen 4.8 years ago

University of Pennsylvania

Hi, I think this questioned has been asked before in different ways, but maybe someone can help mw understand this a bit better. My question stems from the need to visually represent the expression levels of a gene between my two groups. I know plotCounts() uses the normalized counts from counts(dds, normalized=T), but when I look at the total read counts per library, I realized that the DESeq2 normalization didn't quite resulted in equal count sizes between my libraries(samples). Is this expected, or is there a parameter I am mising to have equal reads across libraries? Along this line, if I want to visually represent the expression changes of a gene, should I be using the normalized counts from counts(x, normalized=T) as the plotCounts() does, or should I be using the counts from rlog() or vst()? Thank you for your help. My counts are below.

colSums(counts(dds))
       WT_rep3    WT_rep4   WT_rep13   WT_rep14  Null_rep1  Null_rep2  Null_rep3  Null_rep4 
      25372528   25524255   35306510   34688537   28857148   29386607   28380245   24795934 
    Null_rep11 Null_rep12 
      66139067   34391514
colSums(counts(dds, normalized=T))
   WT_rep3    WT_rep4   WT_rep13   WT_rep14  Null_rep1  Null_rep2  Null_rep3  Null_rep4 
  32400476   31980209   30906366   31123613   32129757   32307761   32902001   31931771 
Null_rep11 Null_rep12 
  31123321   3126603

deseq2 normalization • 599 views

ADD COMMENT • link updated 5.2 years ago by James W. MacDonald 65k • written 5.2 years ago by ricardo3889 • 0

score 0 · Answer 1 · 2019-02-12

The only time you would expect the normalized counts to sum to the same exact value across libraries would be if you expect that there are no differentially expressed genes, in which case any differences in library size are due only to technical differences (starting amount of mRNA, variability in library prep, etc).

But if there are some genes that are differentially expressed (and particularly if some of those genes are highly differentially expressed), then you would probably want to exclude them when computing the size factors that you will use to normalize, because the point of the normalization is to account for technical differences while still retaining biological differences. If you included the genes that are likely to be changing expression, then you run the risk of erasing some of the biological signal you want.

There are lots of different ways to choose genes in order to (hopefully) choose just those genes that are different due to technical rather than biological differences, and if you care to know more, there are papers you can read (see for example the citation in ?estimateSizeFactors).