Question

How do you know if your DESeq2 data has normal distribution?

2

Entering edit mode

ecg1g15 ▴ 20

@ecg1g15-19970

Last seen 3.4 years ago

Hi,

I am working with a set of genes over 20 samples therefore I have been using the DESeq2 package. When plotting a PCA (after normalising using vat etc...) I would like to draw the ellipses on a high confidence level, and justify the clustering, AT the moment it works very good for t-distribution and normal distribution. However, I am unsure how is my data distributed (normal, t-distribution?) - how can I find out?

Another way is to use Euclidean distance and decrease the confidence level. But is this recommended?

This is more a text question rather than code question, therefore not expecting answers replicating code, but here is what I have used.

ggplot(pcaData, aes(x = PC1, y = PC2, color = A, shape = B)) +   geom_point(size =3) + scale_color_gradientn(colours = rainbow(10)) +   xlab(paste0("PC1: ", percentVar[1], "% variance")) +   ylab(paste0("PC2: ", percentVar[2], "% variance")) +    coord_fixed()
+   stat_ellipse(type = "euclid", lty=2, col=1)

R pca ellipse normaldistribution • 1.1k views

ADD COMMENT • link updated 4.1 years ago by Kevin Blighe ★ 3.9k • written 4.1 years ago by ecg1g15 ▴ 20

score 0 · Answer 1 · 2020-03-25

after normalising using vat etc...

I presume that you mean vst()? Just to be clear, vst() provides a variance-stabilising transformation of your normalised count data. In a typical workflow, raw counts will be normalised via DESeq(), and then a transformation for downstream analyses is performed via rlog() or vst().

You can check the distribution of your transformed expression data via hist(). Generally, though, using the variance stabilised expression levels is fine for most downstream analyses, including PCA and clustering via Euclidean distance. Just check for outlier samples, of course, and ensure that you have removed, for example, genes that have high missingness.

Kevin