Question

Limma: posterior distribution of error variance

0

Entering edit mode

gabriel.hoffman ▴ 110

@gabrielhoffman-8391

Last seen 6 days ago

United States

Hi Gordon Smyth,

I am using limma to perform standard analysis on an RNA-seq dataset. I colleague asked if I could do a test of differential variance between case and controls: for one subset of the data is the estimated error variance for a gene larger than for another subset. I could use var.test() built into R, but I figured that I could use limma to account for covariates and do the test on residual variance. Moreover, I'm thinking I can take advantage of the empirical Bayes step which estimates the posterior mean of the residual variance for each gene. From this I can compare the full posterior distributions of the residual variances from two subsets of the data, and compute the probability that a draw from one posterior is greater than a draw from another.

First, is this a reasonable idea? Have you seen this before? Do you have other suggestions?

Second is a more specific technical question that I'm stuck on. In your 2004 paper, you use a scaled inverse chisq prior on the residual variance. Since it is a conjugate prior, the posterior has the same form. You state that the posterior mean of the error variance for gene g is \frac{d_0 s^2_0 + d_g s^2_g}{d_0 + d_g}, but I'm not able to figure out the parameterization of the posterior from this. Is this doable?

Cheers,

Gabriel

limma • 906 views

ADD COMMENT • link 17 months ago gabriel.hoffman ▴ 110

0

Entering edit mode

I need to ask you whether you using limma-voom or limma-trend, because the definition of residual variance is different between the two.

ADD REPLY • link 17 months ago Gordon Smyth 50k

0

Entering edit mode

Limma-voom

ADD REPLY • link 17 months ago gabriel.hoffman ▴ 110

score 0 · Answer 1 · 2022-11-17

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

If you expect variances to be larger in the cases vs controls for most genes, then I recommend estimating empirical weights as a fucntion of disease status by

fit <- voomLmFit(y, design, var.group=group)

or

v <- voomWithQualityWeights(y, design, var.group=group)

where group is a factor separating cases from controls. The estimated sample weights will be stored in fit$samples. Although the empirical weights doesn't provide a p-value, they will nevertheless convincingly demonstrate whether the cases are more variable than the controls.

If you want conduct differential variability tests for each gene, you might be able to use the DiffVar method from https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0465-4.

The posterior distribution of the residual variances is an F-distribution, as shown in Section 4 of my 2004 paper: http://www.statsci.org/smyth/pubs/ebayes.pdf. I suppose you could use the posterior distribution to construct a differential variability test, but I have not tried to do so. The most straightforward method would be treat log-F as approximately normal with known mean and variance as a function of the degrees of freedom.

ADD COMMENT • link 17 months ago Gordon Smyth 50k

0

Entering edit mode

I thought I would chime in as the author of DiffVar and say that the varFit function in the missMethyl package can take an edgeR object, run voom and then test for differential variability between groups. The method is inspired by Levene's test and uses limma under the hood. You could try it and see if the results are sensible. There is an RNA-seq example in the vignette: https://bioconductor.org/packages/release/bioc/vignettes/missMethyl/inst/doc/missMethyl.html#rna-seq-expression-data

ADD REPLY • link 17 months ago belinda.phipson ▴ 40

0

Entering edit mode

Belinda, Gordan,

Thanks for your reply. I was't familiar with Levene's test, but your work makes it really easy to perform a test of differential variance using limma. That seems easier than directly comparison of posteriors like I was thinking.

Your paper and code are really clear, but I wanted to get a better understanding of how you weight the residuals by the leverage.

Letting H be the hat matrix, then diag(H) give that hat values. I am familiar with standardized residuals r / sqrt(1-h).

In your code, you weight residuals by

n.inv <- diag(design %*% V.inv %*% t(design))
n <- 1/n.inv
lvg <- sqrt(n/(n-1))

where n.inv is the leverage for each data point. Do you have a derivation or intuition about that? Why is that preferable to standardized residuals?

Best,

Gabriel

ADD REPLY • link 17 months ago gabriel.hoffman ▴ 110