Question: What does DESeq2 rlog() return exactly?
0
lanhuong20 wrote:

In the documentation or the rlog() function we can find that

rlog(K_ij) = log2(q_ij) = beta_i0 + beta_ij,

Which means the function would return the log2 transformed data after normalization by a size factor, estimating dispersion, shrinking dispersion and then the the beta parameters.

Following the description of the paper accompanying DESeq2 package, it seems like the model for q_ij is:

q_ij  =  exp(x^T * beta)

where x is the vector of covariates and beta the vector of coefficients in glm negative binomial model.

It seems like if we only have 1 factor covariate with 2 possible levels, then x  is in  {0,1} and we only have two possible values for beta_1j (depending on whether x_j = 1 or 0).

When I run rlog on the raw count data, the transformed counts are still different (even though similar) for each column even when belonging to the same class (with the same covariate).

It would be great if one of the developers could answer this question. I would greatly appreciate it.

Best,

Lan

Answer: What does DESeq2 rlog() return exactly?
0
Michael Love25k wrote:
Can you read over the description of rlog in the DESeq2 paper and come back with more questions if that part is not clear?
Answer: What does DESeq2 rlog() return exactly?
0
lanhuong20 wrote:

Hi,

So to make sure I understand all the steps correctly, since the part on rlog in DESEq2 paper is a bit short. Is this the sequence of operations done by rlog?

1. Matrix of initial LFC estimates is computed as M_ij = log_2 (K_ij/s_j + 1/2) / mean_j (K_ij/s_j + 1/2)  for all i and j.

2. The prior variance if found for each row of M_ij by matching a zero centered normal by matching quantiles.

3. The negative binomial GLM is fit to every row of M using only an intercept term to obtain row-wise dispersion estimates.

4. A trend is fit to the dispersion estimates get alpha_tr(mu_bar) to capture the variance-normalized means dependence.

5. Using a design matrix M x (N+1) with a column of all ones and the indicator columns corresponding to every sample, and priors from step 2, rlog fits a GLM negative binomial model with dispersion parameters fixed at estimates from the trend alpha_tr(mu_bar) to each row of the LFC matrix M.

Is this the correct understanding of the procedure? Are there any steps that are missing in the above?

Thank you!

1) Yes.

2) We calculate one prior variance for the whole matrix: "The prior variance is found by matching the 97.5% quantile of a zero-centered normal distribution to the 95% quantile of the absolute values in the LFC matrix."

3-4) Yes, if blind=TRUE, otherwise we use the dispersion trend already calculated using the experimental design (see vignette discussion of blind=TRUE or FALSE)

5) Yes.

The idea is to shrink sample-to-sample differences when there is little information (low counts) and to preserve these differences when there is information (high counts).