Question: The calculation of regularized logarithm
0
Yuya Liang10 wrote:

I was wondering (1) how does rlog( )calculate values from raw counts, (2) the relationship between rlog counts and the effect estimator (βir), and (3) what information rlog counts tell.

In vignette:

•                     log2(qij) = Xj βi

Q1: Is log2(qij) here represent the values from rlog (object)
Q2: My understanding from DESeq2 paper is that log2(qij) are the values calculated from Xj βi. Is that correct?
Q3: If Q2 is yes. What's "log qij" values for log qij=∑r Xjr βir? My understanding is: log (raw counts/size factor) instead of log2. Is that correct?

I saw many research made a heatmap of rlog counts. But I'm not sure what kind of information can rlog counts can tell me.
•
• Counts comparisons from my analysis are:
•  Control_1 Control_2 Control_3 Trt_1 Trt_2 Trt_3 Raw count 135 258 190 0 0 0 count(DESeq, normalized = T) 151.73 249.95 198.23 0 0 0 rlog(object, blind=F) 6.50 6.87 6.69 5.33 5.34 5.35 rlog(object, blind=T) 6.50 6.89 6.70 5.20 5.21 5.22 Size factor 0.89 1.03 0.96 1.17 1.02 0.98
• And the log2fold change are:
•  result(object) lfcShrink(res, type = "normal") lfcShrink(res, type = "apeglm") -10.16 -2.68 -11.24

It seems either count(DESeq, normalized = T) or log2foldchange can illustrate the result better. Why we use rlog count for count heatmap?

Would you recommend to present the result by using rlog count heatmap? or log2foldchange?

Thank you!
Yuya

Answer: The calculation of regularized logarithm
2
Michael Love24k wrote:

Yes, rlog is log2(q_ij). I think this formula is clear:

https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#regularized-log-transformation

The formula you find in the above link is the same as in the Regularized logarithm section in the Methods of the DESeq2 paper.

rlog does not use the design matrix with covariates directly. rlog only uses the dispersion trend, and the dispersion estimates that inform the dispersion trend may (blind=FALSE) or may not (blind=TRUE) have used the design matrix.

It's perfectly fine to instead show (log or not log) normalized counts plus a pseudocount for an individual gene or for a heatmap. Normalized counts is what DESeq2 does for plotting data for individual genes for example (see plotCounts). If you pick a big enough pseudocount, log normalized counts are also good for distance calculation, our point in introducing new transformations is that the optimal pseudocount in terms of stabilizing the systematic variance trends will be different for different datasets.

What we've shown is that for calculating distances between samples over all genes the vst and rlog show some advantages over log plus a pseudocount of 1, for example, which is a common choice. So it makes sense for PCA, distance calculation, clustering. Due to the regularization, effects across samples will by definition be reduced, as they will when you add a pseudocount, but there is nevertheless an advantage for distance calculation from reducing the systematic trend in variance seen at low counts.

Note that the difference in the vst or rlog transformed counts is not the best estimator for the LFC, and this is why we do not recommend to use vst or rlog for statistical analysis of LFCs per gene in the vignette or workflow.

I ran vst and rlog on some simulated data in DESeq2:

> dds <- makeExampleDESeqDataSet(m=6)
> counts(dds)[1,] <- rep(c(0L,200L),each=3)
> counts(dds)[1,]
sample1 sample2 sample3 sample4 sample5 sample6
0       0       0     200     200     200
> rld <- rlog(dds)
> assay(rld)[1,]
sample1  sample2  sample3  sample4  sample5  sample6
3.200444 3.215431 3.205335 6.909778 6.927411 6.884917
> 6.9-3.2
 3.7
> vsd <- varianceStabilizingTransformation(dds)
> assay(vsd)[1,]
sample1  sample2  sample3  sample4  sample5  sample6
4.156427 4.156427 4.156427 7.821539 7.841081 7.794033
> 7.8 - 4.1
 3.7

Thank you for replying! It helps a lot.

I have two more questions:

1. This is the formula I cropped from DESeq2 paper: The log_qij for solving βir is the value from rlog(), is that correct? (I'm not sure is log2 or log10.)

2.   I also found a formula in the above link: y = log2 ( n + 1). Is this y same as log_qij?

I don't understand why sample with 0 raw counts will have the value greater than 0. Just like the example, you showed
above.

Yes, that is the log q_ij we are after, except the standard X is not used but a different one (see the paper Methods and vignette -- the link I gave above -- for full details).

We write log in the Methods, but convert to log2 whenever we give output to the user. So you can imagine it is log2 q_ij. DESeq2 provides estimates of log2 q_ij as the rlog output.

No, you can ignore that y=log2(n+1), that y is not relevant here when talking about rlog. That section is talking about the simple transformation f(x)=log(x+1).

Got it!

I found a previous reply A: How does DESeq2 handle zero counts in one condition?. Is this the reason why sample with 0 raw counts will have the value greater than 0?

Not exactly, but it’s a somewhat related point. That post is about estimated LFCs and your question is about the rlog.

Can you give me some information for why rlog makes raw counts from zero to non-zero?

I expected to see huge different in this kind of extremely opposite (i.e. zero vs. some counts) but not. Even the rlog counts were not that different, they still got large LFC and significant pdaj. I'm just curious.

It shrinks the differences between samples: where you have some samples with 0s and some with high counts, they all move toward the middle. Again it's doing something very different than log(x + pseudocount), and the transformation are useful for computing distances across many genes. For visualizing the data for a given gene, we use normalized counts in the package. For a heatmap you can use log normalized counts or the transformed data, this is up to you.

Thank you so much!!
So I can just simply say that this difference between raw count and rlog count due to "counts in all samples were moved toward the center(middle)" and this will help to calculate distances, correct?  (also help to calculate coefficient as well?)

Not useful for estimating the coefficient, I said this in my first reply above. “Note that the difference in the vst or rlog transformed counts is not the best estimator for the LFC...”

Thank you for bringing this out again!

I didn't express this clearly. I guess "Not recommend to use vst or rlog for statistical analysis of LFCs per gene" meaning not use the rlog counts for comparing control and treatment.

If rlog is not helpful for solving βir, why formula (2) we put rlog instead of others? DESeq() doesn't need to compute distances across genes, right? I don't think we're getting anywhere here, but instead going in circles.

I think that vignette, the workflow, and the paper are the best places to understand the difference between the transformations and the GLM for testing coefficients. All the formula and the motivation for the methods are there.