Search
Question: DESeq2 plotCounts() function-what is the unit on y axis?
0
10 months ago by
Alan0
Alan0 wrote:

Hi Michael and community,

Thanks for the work and efforts on DESeq2! I've had this question for some time, I went through quite a number of posts, the manual, and the vignette, but can't really find the answer. What is the unit of the "normalized count" of the y axis of the "plotCounts" plot?

plotCounts(dds, gene=which.min(res\$padj), intgroup="condition")

in the vignette.

Question1: If I type ?plotCounts in R, then it says "Normalized counts plus a pseudocount of 0.5 are shown by default." and also there's this "normalized = TRUE", and  "transform = TRUE" arguments. So is the y axis just raw read counts divided by DESeq2-estimated size factor? or has it been logged? (loge? log2? log10?) Also, why is the psudocount 0.5 here? (it's just to avoid taking log of zero, right? but why don't you set it smaller, say, 0.001?) What happens if we set normalized = FALSE? Likewise,  what happens if we set transform = FALSE?

Question2: This one is indirectly related to the first question. So with the way DESeq2 normalizes raw read counts without the effect of gene size, can we compare the expression of two (or multiple) different genes within the same sample? If not, is there other recommended way of normalization for this purpose?

Alan

ps. R version 3.4.3 (2017-11-30), DESeq2_1.18.1

modified 10 months ago by Michael Love20k • written 10 months ago by Alan0
1
10 months ago by
Michael Love20k
United States
Michael Love20k wrote:

With defaults, normalized=TRUE and transform=TRUE, plotted on the y-axis is the normalized count + 0.5. The y-axis is log scale, but the tick marks are on the count scale. So it's not log counts on the y-axis.

Try this:

coldata <- data.frame(x=factor(c(1,1,2,2)))
cts <- matrix(1:16,ncol=4)
dds <- DESeqDataSetFromMatrix(cts, coldata, ~x)
sizeFactors(dds) <- rep(1,4)
plotCounts(dds, 1, "x")
abline(h=counts(dds, normalized=TRUE)[1,] + .5)

The normalized count is given by counts(dds, normalized=TRUE) and this is K_ij / s_ij in the language of the DESeq2 paper and the formula in the vignette.

If I were to make the pseudocount smaller, if a gene has a 0, it would inflate the vertical space between 0 and 1, where there is little useful information, and squash the part of the plot between 1 and max(count), where there is more useful information. Actually, this fact is related to why we suggest to use variance stabilizing transformations, and why log(x + small pseudocount) is a bad decision for data exploration of counts. Compare these plots:

plot(0:3 + .5, log="y")
plot(0:3 + .001, log="y")

You can't compare DESeq2 normalized counts across genes. I would recommend to compare TPM, for example, estimated by a transcript quantifier and aggregated to the gene level using a package like tximport.