Question

DESeq2 plotCounts() function-what is the unit on y axis?

0

Entering edit mode

Alan ▴ 10

@alan-15011

Last seen 5.3 years ago

Hi Michael and community,

Thanks for the work and efforts on DESeq2! I've had this question for some time, I went through quite a number of posts, the manual, and the vignette, but can't really find the answer. What is the unit of the "normalized count" of the y axis of the "plotCounts" plot?

plotCounts(dds, gene=which.min(res$padj), intgroup="condition")

in the vignette.

Question1: If I type ?plotCounts in R, then it says "Normalized counts plus a pseudocount of 0.5 are shown by default." and also there's this "normalized = TRUE", and "transform = TRUE" arguments. So is the y axis just raw read counts divided by DESeq2-estimated size factor? or has it been logged? (loge? log2? log10?) Also, why is the psudocount 0.5 here? (it's just to avoid taking log of zero, right? but why don't you set it smaller, say, 0.001?) What happens if we set normalized = FALSE? Likewise, what happens if we set transform = FALSE?

Question2: This one is indirectly related to the first question. So with the way DESeq2 normalizes raw read counts without the effect of gene size, can we compare the expression of two (or multiple) different genes within the same sample? If not, is there other recommended way of normalization for this purpose?

Thank you in advance!

Alan

ps. R version 3.4.3 (2017-11-30), DESeq2_1.18.1

deseq2 • 2.8k views

ADD COMMENT • link updated 6.2 years ago by Michael Love 41k • written 6.2 years ago by Alan ▴ 10

score 1 · Answer 1 · 2018-02-13

With defaults, normalized=TRUE and transform=TRUE, plotted on the y-axis is the normalized count + 0.5. The y-axis is log scale, but the tick marks are on the count scale. So it's not log counts on the y-axis.

Try this:

coldata <- data.frame(x=factor(c(1,1,2,2)))
cts <- matrix(1:16,ncol=4)
dds <- DESeqDataSetFromMatrix(cts, coldata, ~x)
sizeFactors(dds) <- rep(1,4)
plotCounts(dds, 1, "x")
abline(h=counts(dds, normalized=TRUE)[1,] + .5)

The normalized count is given by counts(dds, normalized=TRUE) and this is K_ij / s_ij in the language of the DESeq2 paper and the formula in the vignette.

If I were to make the pseudocount smaller, if a gene has a 0, it would inflate the vertical space between 0 and 1, where there is little useful information, and squash the part of the plot between 1 and max(count), where there is more useful information. Actually, this fact is related to why we suggest to use variance stabilizing transformations, and why log(x + small pseudocount) is a bad decision for data exploration of counts. Compare these plots:

plot(0:3 + .5, log="y")
plot(0:3 + .001, log="y")

You can't compare DESeq2 normalized counts across genes. I would recommend to compare TPM, for example, estimated by a transcript quantifier and aggregated to the gene level using a package like tximport.