2.2 years ago by
I think there are some clear use cases for t-SNE, for example within a clustering algorithm, but from my testing and that of others, I think it can potentially lead you astray a bit, and so I recommend PCA plot for general purpose bulk RNA-seq EDA (exploratory data analysis). I'm interested in what methods are developed for factor analysis of scRNA-seq, particularly ZINB-WaVE (Bioc).
A little more on why I prefer PCA for bulk RNA-seq EDA: with some simulations, I have seen t-SNE generate artificial structure (though this may be due to a since-fixed bug in one of the R pkgs), and also t-SNE can "snap" groups apart farther than what represents the data generating mechanism (which I know because I simulated the data). I've been told by t-SNE experts that for both issues, parameters can be optimized such that the artifacts or snapping are minimized, or that PCA should be first applied and the right number of top dimensions passed to t-SNE. Also, I've been told that the "snapping" I observe is a known consequence of the method. Critically I don't think that biologists or investigators (who we like to share our dimension reduction plots with) are aware of what caveats are needed to interpret the t-SNE plots, and that the large scale structure or cluster separation distances should not be interpreted as representative of anything meaningful in the data. The nice thing about PCA (or MDS) is its simplicity, and that, if the groups are overlapping or separated, this is clear from the plot. With PCA, you can compare the inter-cluster distances relative to the intra-cluster distances, and get a sense of their ratio. If the groups are in a particular arrangement, you can look up the loadings to understand why this is the case. The first link is really informative about these qualities with respect to t-SNE, see sections 2 and 3: