Question: What to use: PCA or tSNE dimension reduction in DESeq2 analysis?
1
gravatar for s.w.vanderlaan
24 months ago by
s.w.vanderlaan20 wrote:

Hi,

I am curious why some papers and workflows (http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) using DESeq2 use PCA to cluster data to identify outliers in RNA-seq analyses. I've noticed that many single-cell RNA-seq experiments use tSNE to reduce data and visualise it. Also, the clustering seems a bit better - or maybe that's just "coincidence" (https://www.nature.com/ng/journal/v48/n10/fig_tab/ng.3646_SF7.html)

When clear clusters aren't formed with PCA, should I revert to tSNE? How many data points (row X columns) are needed at a minimum for tSNE to work? Is tSNE better suited for single-cell RNA-seq? And PCA better suited for whole-tissue/bulk RNA-seq? In GWAS we usually use PCA, but I guess genotypes have a more linear distribution compared to gene expression. May it be better, for that reason, to use tSNE with RNA-seq anytime?

Thanks and best,

Sander

rnaseq deseq2 pca tsne • 5.1k views
ADD COMMENTlink modified 24 months ago by Kushal K Dey10 • written 24 months ago by s.w.vanderlaan20
Answer: What to use: PCA or tSNE dimension reduction in DESeq2 analysis?
5
gravatar for Michael Love
24 months ago by
Michael Love24k
United States
Michael Love24k wrote:

I think there are some clear use cases for t-SNE, for example within a clustering algorithm, but from my testing and that of others, I think it can potentially lead you astray a bit, and so I recommend PCA plot for general purpose bulk RNA-seq EDA (exploratory data analysis). I'm interested in what methods are developed for factor analysis of scRNA-seq, particularly ZINB-WaVE (Bioc).

A little more on why I prefer PCA for bulk RNA-seq EDA: with some simulations, I have seen t-SNE generate artificial structure (though this may be due to a since-fixed bug in one of the R pkgs), and also t-SNE can "snap" groups apart farther than what represents the data generating mechanism (which I know because I simulated the data). I've been told by t-SNE experts that for both issues, parameters can be optimized such that the artifacts or snapping are minimized, or that PCA should be first applied and the right number of top dimensions passed to t-SNE. Also, I've been told that the "snapping" I observe is a known consequence of the method. Critically I don't think that biologists or investigators (who we like to share our dimension reduction plots with) are aware of what caveats are needed to interpret the t-SNE plots, and that the large scale structure or cluster separation distances should not be interpreted as representative of anything meaningful in the data. The nice thing about PCA (or MDS) is its simplicity, and that, if the groups are overlapping or separated, this is clear from the plot. With PCA, you can compare the inter-cluster distances relative to the intra-cluster distances, and get a sense of their ratio. If the groups are in a particular arrangement, you can look up the loadings to understand why this is the case. The first link is really informative about these qualities with respect to t-SNE, see sections 2 and 3:

http://distill.pub/2016/misread-tsne/

ADD COMMENTlink modified 24 months ago • written 24 months ago by Michael Love24k
2

Since Mike mentioned the ZINB-WaVE approach, I will add a few more things to my comment on twitter.

The advantage of PCA or ZINB-WaVE is that these are factor analysis models, hence there is a clear interpretation of the reduced space in which you do the clustering (in the case of ZINB-Wave, the matrix W). Briefly, you assume that the "true" signal is intrinsically low-dimensional, and your factor analysis model estimates that signal in a principled way (if you believe the assumptions of the model, of course). It is much harder to interpret t-SNE coordinates, especially when looking at long-range distances / relations between clusters, which cannot be very easily interpreted.

As I said, we use t-SNE in our papers, but just as a way to better visualize the clustering / gene expression of particular genes (mostly because our biology collaborators find them "nice"), but we don't use them to infer the clustering.

As for the original question, I don't think that single-cell vs bulk RNA-seq should be treated differently with respect to t-SNE, except that the advantage of t-SNE's visualization is when you have many points (samples) since it avoids overlapping close points. So, I guess that if you have a large-scale bulk RNA-seq experiment, t-SNE could be a good visualization tool.

 

ADD REPLYlink written 24 months ago by davide risso830

Davide this is really helpful, thanks. Re: the last sentence, my opinion is, I'd rather settle for overplotting, to avoid the problems with cluster spread and inter-cluster distance and arrangement. I had a data set recently where it was important to see that one cluster was tight (with overplotting) relative to the others.

 

ADD REPLYlink written 24 months ago by Michael Love24k

Ah, so many thanks for these excellent answers and thoughts on this subject. So, now I understand a bit better the reason why people use t-SNE for single-cell RNA-seq: it's just to "better visualize the clustering / gene expression of particular gene" which makes perfect sense if you want certain groups of cells to really 'cluster' together visually to get the point across in that paper. But for exploratory analyses the consensus is PCA! Clear. Thanks!

ADD REPLYlink written 23 months ago by s.w.vanderlaan20

There are some useful comments from quantitative folks who know more about t-SNE than me here.

ADD REPLYlink written 24 months ago by Michael Love24k
Answer: What to use: PCA or tSNE dimension reduction in DESeq2 analysis?
1
gravatar for Kushal K Dey
24 months ago by
Kushal K Dey10
Chicago, iL
Kushal K Dey10 wrote:

Mike and Davide have already covered the issue really well. I would just like to stress a couple of points in follow up. I have found t-SNE  to be extremely sensitive to the dimensionality of reduction, the perplexity and the number of iterations run. Mike already posted this, where the authors (Wattenberg et al) present a very nice explanation of the issues with t-SNE and how best to control for them.  I have seen many people use the default setting in t-sne package and report the results, but I would recommend using multiple options for the input arguments.

Even with this adjustment, my take from running on different types of data simulations is that the first two PCs/MDS dimensions capture the global structure in the better than t-SNE (on 2 dims) while the local patterns better preserved in the latter. This is sort of intuitive if one thinks about how the algorithms work for the two cases. Also, the clustering from PCA is more believable to me purely because of the statistical understanding of the eigen-spaces behind it. 

Lastly, to talk about a personal experience, for a paper of mine, I ran t-SNE on bulk RNA-seq data from GTEx tissues ( first figure here) and my first reaction was of awe at how well t-sne captures the clusters. But on closer look, I did see a few discrepancies, for instance, Liver samples forming two groups, which on follow up analysis, we did not see find this grouping to be meaningful. 

 

 

ADD COMMENTlink written 24 months ago by Kushal K Dey10

Thanks Kushal for the post, useful to know.

ADD REPLYlink written 24 months ago by Michael Love24k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 270 users visited in the last hour