When clear clusters aren't formed with PCA, should I revert to tSNE? How many data points (row X columns) are needed at a minimum for tSNE to work? Is tSNE better suited for single-cell RNA-seq? And PCA better suited for whole-tissue/bulk RNA-seq? In GWAS we usually use PCA, but I guess genotypes have a more linear distribution compared to gene expression. May it be better, for that reason, to use tSNE with RNA-seq anytime?
I think there are some clear use cases for t-SNE, for example within a clustering algorithm, but from my testing and that of others, I think it can potentially lead you astray a bit, and so I recommend PCA plot for general purpose bulk RNA-seq EDA (exploratory data analysis). I'm interested in what methods are developed for factor analysis of scRNA-seq, particularly ZINB-WaVE (Bioc).
A little more on why I prefer PCA for bulk RNA-seq EDA: with some simulations, I have seen t-SNE generate artificial structure (though this may be due to a since-fixed bug in one of the R pkgs), and also t-SNE can "snap" groups apart farther than what represents the data generating mechanism (which I know because I simulated the data). I've been told by t-SNE experts that for both issues, parameters can be optimized such that the artifacts or snapping are minimized, or that PCA should be first applied and the right number of top dimensions passed to t-SNE. Also, I've been told that the "snapping" I observe is a known consequence of the method. Critically I don't think that biologists or investigators (who we like to share our dimension reduction plots with) are aware of what caveats are needed to interpret the t-SNE plots, and that the large scale structure or cluster separation distances should not be interpreted as representative of anything meaningful in the data. The nice thing about PCA (or MDS) is its simplicity, and that, if the groups are overlapping or separated, this is clear from the plot. With PCA, you can compare the inter-cluster distances relative to the intra-cluster distances, and get a sense of their ratio. If the groups are in a particular arrangement, you can look up the loadings to understand why this is the case. The first link is really informative about these qualities with respect to t-SNE, see sections 2 and 3:
Since Mike mentioned the ZINB-WaVE approach, I will add a few more things to my comment on twitter.
The advantage of PCA or ZINB-WaVE is that these are factor analysis models, hence there is a clear interpretation of the reduced space in which you do the clustering (in the case of ZINB-Wave, the matrix W). Briefly, you assume that the "true" signal is intrinsically low-dimensional, and your factor analysis model estimates that signal in a principled way (if you believe the assumptions of the model, of course). It is much harder to interpret t-SNE coordinates, especially when looking at long-range distances / relations between clusters, which cannot be very easily interpreted.
As I said, we use t-SNE in our papers, but just as a way to better visualize the clustering / gene expression of particular genes (mostly because our biology collaborators find them "nice"), but we don't use them to infer the clustering.
As for the original question, I don't think that single-cell vs bulk RNA-seq should be treated differently with respect to t-SNE, except that the advantage of t-SNE's visualization is when you have many points (samples) since it avoids overlapping close points. So, I guess that if you have a large-scale bulk RNA-seq experiment, t-SNE could be a good visualization tool.
Davide this is really helpful, thanks. Re: the last sentence, my opinion is, I'd rather settle for overplotting, to avoid the problems with cluster spread and inter-cluster distance and arrangement. I had a data set recently where it was important to see that one cluster was tight (with overplotting) relative to the others.
Ah, so many thanks for these excellent answers and thoughts on this subject. So, now I understand a bit better the reason why people use t-SNE for single-cell RNA-seq: it's just to "better visualize the clustering / gene expression of particular gene" which makes perfect sense if you want certain groups of cells to really 'cluster' together visually to get the point across in that paper. But for exploratory analyses the consensus is PCA! Clear. Thanks!
Mike and Davide have already covered the issue really well. I would just like to stress a couple of points in follow up. I have found t-SNE to be extremely sensitive to the dimensionality of reduction, the perplexity and the number of iterations run. Mike already posted this, where the authors (Wattenberg et al) present a very nice explanation of the issues with t-SNE and how best to control for them. I have seen many people use the default setting in t-sne package and report the results, but I would recommend using multiple options for the input arguments.
Even with this adjustment, my take from running on different types of data simulations is that the first two PCs/MDS dimensions capture the global structure in the better than t-SNE (on 2 dims) while the local patterns better preserved in the latter. This is sort of intuitive if one thinks about how the algorithms work for the two cases. Also, the clustering from PCA is more believable to me purely because of the statistical understanding of the eigen-spaces behind it.
Lastly, to talk about a personal experience, for a paper of mine, I ran t-SNE on bulk RNA-seq data from GTEx tissues ( first figure here) and my first reaction was of awe at how well t-sne captures the clusters. But on closer look, I did see a few discrepancies, for instance, Liver samples forming two groups, which on follow up analysis, we did not see find this grouping to be meaningful.
Since Mike mentioned the ZINB-WaVE approach, I will add a few more things to my comment on twitter.
The advantage of PCA or ZINB-WaVE is that these are factor analysis models, hence there is a clear interpretation of the reduced space in which you do the clustering (in the case of ZINB-Wave, the matrix W). Briefly, you assume that the "true" signal is intrinsically low-dimensional, and your factor analysis model estimates that signal in a principled way (if you believe the assumptions of the model, of course). It is much harder to interpret t-SNE coordinates, especially when looking at long-range distances / relations between clusters, which cannot be very easily interpreted.
As I said, we use t-SNE in our papers, but just as a way to better visualize the clustering / gene expression of particular genes (mostly because our biology collaborators find them "nice"), but we don't use them to infer the clustering.
As for the original question, I don't think that single-cell vs bulk RNA-seq should be treated differently with respect to t-SNE, except that the advantage of t-SNE's visualization is when you have many points (samples) since it avoids overlapping close points. So, I guess that if you have a large-scale bulk RNA-seq experiment, t-SNE could be a good visualization tool.
Davide this is really helpful, thanks. Re: the last sentence, my opinion is, I'd rather settle for overplotting, to avoid the problems with cluster spread and inter-cluster distance and arrangement. I had a data set recently where it was important to see that one cluster was tight (with overplotting) relative to the others.
Ah, so many thanks for these excellent answers and thoughts on this subject. So, now I understand a bit better the reason why people use t-SNE for single-cell RNA-seq: it's just to "better visualize the clustering / gene expression of particular gene" which makes perfect sense if you want certain groups of cells to really 'cluster' together visually to get the point across in that paper. But for exploratory analyses the consensus is PCA! Clear. Thanks!
There are some useful comments from quantitative folks who know more about t-SNE than me here.