Top variable features used by 'runPCA' in scater
jws
@jws-18804
Last seen 2.8 years ago

As the scater vignette (https://bioconductor.org/packages/devel/bioc/vignettes/scater/inst/doc/vignette-dataviz.html#generating-pca-plots) describes, by default, runPCA performs PCA on the log-counts using the 500 features with the most variable expression across all cells.

I am wondering how the most variable expression is determined, and how the names of features (genes) can be extracted. Thanks!

scater pca features
Aaron Lun
@alun
Last seen 4 hours ago
The city by the bay

It's pretty literal. The top 500 genes with the largest variance of the log-counts are used - and that's it. You can get them by doing:

vars <- DelayedMatrixStats::rowVars(logcounts(sce))
head(order(vars, decreasing=TRUE), 500)

There's no consideration of the mean-variance trend or of technical components of variance or anything like that. If you want something more sophisticated, check out trendVar and decomposeVar (or possibly technicalCV2 and improvedCV2) in scran.