Question

What to use: PCA or tSNE dimension reduction in DESeq2 analysis?

1

Entering edit mode

s.w.vanderlaan ▴ 30

@swvanderlaan-12768

Last seen 7.1 years ago

Hi,

I am curious why some papers and workflows (http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) using DESeq2 use PCA to cluster data to identify outliers in RNA-seq analyses. I've noticed that many single-cell RNA-seq experiments use tSNE to reduce data and visualise it. Also, the clustering seems a bit better - or maybe that's just "coincidence" (https://www.nature.com/ng/journal/v48/n10/fig_tab/ng.3646_SF7.html).

When clear clusters aren't formed with PCA, should I revert to tSNE? How many data points (row X columns) are needed at a minimum for tSNE to work? Is tSNE better suited for single-cell RNA-seq? And PCA better suited for whole-tissue/bulk RNA-seq? In GWAS we usually use PCA, but I guess genotypes have a more linear distribution compared to gene expression. May it be better, for that reason, to use tSNE with RNA-seq anytime?

Thanks and best,

Sander

deseq2 rnaseq pca tsne • 13k views

ADD COMMENT • link updated 8.4 years ago by Kushal K Dey ▴ 10 • written 8.4 years ago by s.w.vanderlaan ▴ 30

score 5 · Answer 1 · 2017-06-29

5

Entering edit mode

Michael Love 43k

@mikelove

Last seen 1 day ago

United States

I think there are some clear use cases for t-SNE, for example within a clustering algorithm, but from my testing and that of others, I think it can potentially lead you astray a bit, and so I recommend PCA plot for general purpose bulk RNA-seq EDA (exploratory data analysis). I'm interested in what methods are developed for factor analysis of scRNA-seq, particularly ZINB-WaVE (Bioc).

A little more on why I prefer PCA for bulk RNA-seq EDA: with some simulations, I have seen t-SNE generate artificial structure (though this may be due to a since-fixed bug in one of the R pkgs), and also t-SNE can "snap" groups apart farther than what represents the data generating mechanism (which I know because I simulated the data). I've been told by t-SNE experts that for both issues, parameters can be optimized such that the artifacts or snapping are minimized, or that PCA should be first applied and the right number of top dimensions passed to t-SNE. Also, I've been told that the "snapping" I observe is a known consequence of the method. Critically I don't think that biologists or investigators (who we like to share our dimension reduction plots with) are aware of what caveats are needed to interpret the t-SNE plots, and that the large scale structure or cluster separation distances should not be interpreted as representative of anything meaningful in the data. The nice thing about PCA (or MDS) is its simplicity, and that, if the groups are overlapping or separated, this is clear from the plot. With PCA, you can compare the inter-cluster distances relative to the intra-cluster distances, and get a sense of their ratio. If the groups are in a particular arrangement, you can look up the loadings to understand why this is the case. The first link is really informative about these qualities with respect to t-SNE, see sections 2 and 3:

http://distill.pub/2016/misread-tsne/

ADD COMMENT • link 8.4 years ago Michael Love 43k

2

Entering edit mode

Since Mike mentioned the ZINB-WaVE approach, I will add a few more things to my comment on twitter.

The advantage of PCA or ZINB-WaVE is that these are factor analysis models, hence there is a clear interpretation of the reduced space in which you do the clustering (in the case of ZINB-Wave, the matrix W). Briefly, you assume that the "true" signal is intrinsically low-dimensional, and your factor analysis model estimates that signal in a principled way (if you believe the assumptions of the model, of course). It is much harder to interpret t-SNE coordinates, especially when looking at long-range distances / relations between clusters, which cannot be very easily interpreted.

As I said, we use t-SNE in our papers, but just as a way to better visualize the clustering / gene expression of particular genes (mostly because our biology collaborators find them "nice"), but we don't use them to infer the clustering.

As for the original question, I don't think that single-cell vs bulk RNA-seq should be treated differently with respect to t-SNE, except that the advantage of t-SNE's visualization is when you have many points (samples) since it avoids overlapping close points. So, I guess that if you have a large-scale bulk RNA-seq experiment, t-SNE could be a good visualization tool.

ADD REPLY • link 8.4 years ago davide risso ▴ 980

0

Entering edit mode

Davide this is really helpful, thanks. Re: the last sentence, my opinion is, I'd rather settle for overplotting, to avoid the problems with cluster spread and inter-cluster distance and arrangement. I had a data set recently where it was important to see that one cluster was tight (with overplotting) relative to the others.

ADD REPLY • link 8.4 years ago Michael Love 43k

0

Entering edit mode

Ah, so many thanks for these excellent answers and thoughts on this subject. So, now I understand a bit better the reason why people use t-SNE for single-cell RNA-seq: it's just to "better visualize the clustering / gene expression of particular gene" which makes perfect sense if you want certain groups of cells to really 'cluster' together visually to get the point across in that paper. But for exploratory analyses the consensus is PCA! Clear. Thanks!

ADD REPLY • link 8.3 years ago s.w.vanderlaan ▴ 30

0

Entering edit mode

There are some useful comments from quantitative folks who know more about t-SNE than me here.

ADD REPLY • link 8.4 years ago Michael Love 43k

score 1 · Answer 2 · 2017-06-30

Mike and Davide have already covered the issue really well. I would just like to stress a couple of points in follow up. I have found t-SNE to be extremely sensitive to the dimensionality of reduction, the perplexity and the number of iterations run. Mike already posted this, where the authors (Wattenberg et al) present a very nice explanation of the issues with t-SNE and how best to control for them. I have seen many people use the default setting in t-sne package and report the results, but I would recommend using multiple options for the input arguments.

Even with this adjustment, my take from running on different types of data simulations is that the first two PCs/MDS dimensions capture the global structure in the better than t-SNE (on 2 dims) while the local patterns better preserved in the latter. This is sort of intuitive if one thinks about how the algorithms work for the two cases. Also, the clustering from PCA is more believable to me purely because of the statistical understanding of the eigen-spaces behind it.

Lastly, to talk about a personal experience, for a paper of mine, I ran t-SNE on bulk RNA-seq data from GTEx tissues ( first figure here) and my first reaction was of awe at how well t-sne captures the clusters. But on closer look, I did see a few discrepancies, for instance, Liver samples forming two groups, which on follow up analysis, we did not see find this grouping to be meaningful.