Question

95% confidence intervals for PCA in DESEQ2

0

Entering edit mode

tylerdodgeball2019 • 0

@add481ab

Last seen 3.0 years ago

United States

Dear Help,

We have used your package DESEQ2 (including the vst transform) on some RNA-seq data, in order to perform PCA analysis. We were hoping to add Monte-Carlo noise to the data in order to estimate 95% confidence intervals surrounding each cluster in PCA space.

To do so, currently, we are using a multivariate normal distribution. But this produces artifacts in the shapes of the Monte Carlo clouds. We have confirmed that these artifacts are not due to sign flips on the eigenvectors.

Therefore, our suspicion is that we are using the wrong distribution shape for resampling, which is producing the artifacts. Do you know or have a suggestion for which distribution I should use for resampling?

Best regards, Dodgeball

DESEQ2 DESeq2 • 1.5k views

ADD COMMENT • link 3.0 years ago tylerdodgeball2019 • 0

score 0 · Answer 1 · 2021-11-15

0

Entering edit mode

Kevin Blighe ★ 4.0k

@kevin

Last seen 24 days ago

Republic of Ireland

As an aside, you could take a look at the ellipse functionality of PCAtools (my package): 5.2.3 Stat ellipses

The parameter that controls the ellipse distribution is ellipseType:

@param ellipseType [paraphrased from https://ggplot2.tidyverse.org/reference/stat_ellipse.html] The type of ellipse. "t" assumes a multivariate t-distribution, while "norm" assumes a multivariate normal distribution. "euclid" draws a circle with the radius equal to level, representing the euclidean distance from the center. This ellipse probably won't appear circular unless coord_fixed() is applied.

There is a Quick Start for data deriving from DESeq2: 3 Quick start: DESeq2

ADD COMMENT • link 3.0 years ago Kevin Blighe ★ 4.0k

0

Entering edit mode

Agree with Kevin on using ellipses to show uncertainty via PCAtools.

A note about the vst(). I would recommend to "fix" the VST so it isn't adding variation in the estimation of the dispersion.

The VST in DESeq2 is given by the following R code (in getVarianceStabilizedData())

coefs <- attr(dispersionFunction(object), "coefficients")
vst.fn <- function(q) {
log((1 + coefs["extraPois"] + 2 * coefs["asymptDisp"] * q + 
  2 * sqrt(coefs["asymptDisp"] * q * (1 + coefs["extraPois"] + 
  coefs["asymptDisp"] * q)))/(4 * coefs["asymptDisp"]))/log(2)
}

This formula is derived in vst.pdf in inst/script. You can estimate the dispersion function on the original dataset (estimateSizeFactors followed by estimateDispersionsGeneEst and estimateDispersionsFit). Then you can extract coefs and define vst.fn as above. You can then apply a fixed VST to the MC datasets.

Finally, a question: when you say you are applying "noise to the data", to which data are you adding noise?

ADD REPLY • link 3.0 years ago Michael Love 43k

0

Entering edit mode

Thank you both! I am actually applying noise to the high-dimensional data prior to the entire PCA process. The idea is the following: You add noise to this high-dimensional data based on the mean and covariance of that data. So, I'm just assuming a multivariate normal, then resampling many times from a multivariate normal with those exact mean and covariance values. Then I put the entire cloud of data points through PCA. Then I throw away all but two of the dimensions. Then I fit the ellipse. That is why I'm asking about the proper distribution to sample from.

-Dodgeball

ADD REPLY • link 3.0 years ago tylerdodgeball2019 • 0

0

Entering edit mode

Hmm, I'm not sure I have any constructive thoughts in this direction. We often look at Gibbs posterior distributions and/or bootstrap distributions to model uncertainty for count data, but I haven't taken this approach above.