Dear Help,
We have used your package DESEQ2 (including the vst transform) on some RNA-seq data, in order to perform PCA analysis. We were hoping to add Monte-Carlo noise to the data in order to estimate 95% confidence intervals surrounding each cluster in PCA space.
To do so, currently, we are using a multivariate normal distribution. But this produces artifacts in the shapes of the Monte Carlo clouds. We have confirmed that these artifacts are not due to sign flips on the eigenvectors.
Therefore, our suspicion is that we are using the wrong distribution shape for resampling, which is producing the artifacts. Do you know or have a suggestion for which distribution I should use for resampling?
Best regards, Dodgeball
Agree with Kevin on using ellipses to show uncertainty via PCAtools.
A note about the vst(). I would recommend to "fix" the VST so it isn't adding variation in the estimation of the dispersion.
The VST in DESeq2 is given by the following R code (in
getVarianceStabilizedData()
)This formula is derived in vst.pdf in
inst/script
. You can estimate the dispersion function on the original dataset (estimateSizeFactors
followed byestimateDispersionsGeneEst
andestimateDispersionsFit
). Then you can extractcoefs
and definevst.fn
as above. You can then apply a fixed VST to the MC datasets.Finally, a question: when you say you are applying "noise to the data", to which data are you adding noise?
Thank you both! I am actually applying noise to the high-dimensional data prior to the entire PCA process. The idea is the following: You add noise to this high-dimensional data based on the mean and covariance of that data. So, I'm just assuming a multivariate normal, then resampling many times from a multivariate normal with those exact mean and covariance values. Then I put the entire cloud of data points through PCA. Then I throw away all but two of the dimensions. Then I fit the ellipse. That is why I'm asking about the proper distribution to sample from.
-Dodgeball
Hmm, I'm not sure I have any constructive thoughts in this direction. We often look at Gibbs posterior distributions and/or bootstrap distributions to model uncertainty for count data, but I haven't taken this approach above.
No worries, thank you for taking a look at it! -Dodgeball