PCA and heatmaps: most variable genes selection
1
0
Entering edit mode
Clara • 0
@deut2016-16915
Last seen 2.3 years ago
Germany

Hi,

questions from a non-statistical expert...

The PCA function within DESeq2 selects ntop genes before calculating PCA. This appears to make differences across samples in the 2D plot clearer. What is the statistical reason for doing this pre-selection? And for using absolute variance versus, for example, coefficient of variation?

And if I want to do a heatmap of the "top variable" genes is using absolute variance or CV only a matter of what genes we want to focus on, or is one statistically preferable over the other? I would be inclined to use CV, to make the selection independent of the expression level.

Thanks!

Clara

DESeq2 ExpressionData heatmaps pca • 1.7k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 1 hour ago
United States

We focus on the top genes by variance (in transformed space) because this focuses the plot on genes with most differences. This is kind of a tautology... We are not for example interested in the tens of thousands of genes that are not, or barely, detected.

We do not need coefficient of variation because we have already removed the systematic dependence of the variance on the mean when we run plotPCA on VST data as recommended in the DESeq2 workflow. Dividing by the mean would actually re-introduce a systematic bias, which would be undesirable.

ADD COMMENT
0
Entering edit mode

Thank you!

ADD REPLY

Login before adding your answer.

Traffic: 933 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6