Question

PCA and heatmaps: most variable genes selection

0

Entering edit mode

Clara • 0

@deut2016-16915

Last seen 2.3 years ago

Germany

Hi,

questions from a non-statistical expert...

The PCA function within DESeq2 selects ntop genes before calculating PCA. This appears to make differences across samples in the 2D plot clearer. What is the statistical reason for doing this pre-selection? And for using absolute variance versus, for example, coefficient of variation?

And if I want to do a heatmap of the "top variable" genes is using absolute variance or CV only a matter of what genes we want to focus on, or is one statistically preferable over the other? I would be inclined to use CV, to make the selection independent of the expression level.

Thanks!

Clara

DESeq2 ExpressionData heatmaps pca • 1.7k views

ADD COMMENT • link 2.8 years ago Clara • 0

score 0 · Answer 1 · 2021-06-16

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 1 hour ago

United States

We focus on the top genes by variance (in transformed space) because this focuses the plot on genes with most differences. This is kind of a tautology... We are not for example interested in the tens of thousands of genes that are not, or barely, detected.

We do not need coefficient of variation because we have already removed the systematic dependence of the variance on the mean when we run plotPCA on VST data as recommended in the DESeq2 workflow. Dividing by the mean would actually re-introduce a systematic bias, which would be undesirable.