Question: Large differences in sample clustering using cummeRbund (csDendro) and rlog-distance (DESeq2)
gravatar for Jon Bråte
2.8 years ago by
Jon Bråte150
Jon Bråte150 wrote:

We have RNA-Seq data from different body parts; oral and aboral parts of small, medium and large sized specimens. We have mapped the reads using tophat2 and run cuffdiff and DESeq2 (with HTSeq count). Using the csDendro function in cummeRbund the samples cluster largely by oral/aboral parts, but clustering a distance matrix of rlog-values from DESeq2  the samples group mainly by size.


#Heatmap of sample-to-sample distances
distsRL <- dist(t(assay(rld)))
mat <- as.matrix(distsRL)
heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13), main = "Sample-to-sample distances (rlog)")

I guess the main difference between the two approaches is that fpkm-values are the basis of csDendro, while rlog-transformed raw counts are used for the DESeq2 approach? But I could not find out whether all genes are included in the csDendro function? For the DESeq2-approach we excluded genes with zero counts in all samples, otherwise all genes should be included.

When we create PCA-plots using both packages they are roughly similar. Hvave anyone have experienced similar results? What can be the explanation for these differences?

ADD COMMENTlink modified 2.7 years ago • written 2.8 years ago by Jon Bråte150
gravatar for Wolfgang Huber
2.8 years ago by
EMBL European Molecular Biology Laboratory
Wolfgang Huber13k wrote:


you say that the PCA plots look similar but the dendrograms don't. So the problem would not lie with the transformation (up-stream of both) but with the dendrogram.

Note that the linear ordering of the leaves in the plot is not fully determined by the dendrogram, so that interpretation of the former is often problematic.

Irrespective of that, it tends to be useful to remove features (genes) with low-signal to noise ratio (since if anything these tend to pick up batch effects etc. rather than biology), and focus e.g. on the top few thousand genes with highest overall (e.g. mean, median, 1st quartile - your choice) signal.


ADD COMMENTlink written 2.8 years ago by Wolfgang Huber13k
gravatar for Michael Love
2.8 years ago by
Michael Love19k
United States
Michael Love19k wrote:

First, I would definitely expect differences, given the methods are very different. In DESeq2, the clustering PCA plot is showing you a 2d projection where distances are similar to the is based on Euclidean distances between vectors of log transformed counts, where the noisiness of low count values is minimized. I would expect that distances on the log FPKM scale would be different. 

There is no canonical 2d representation clustering based on inter sample distances, but the different transformations, filtering and distance choices all give different weight to different genes, etc.

If you look at our workflow and our vignette, I also suggest trying the PoissonDistance (see workflow) and the VST. I can't say which is the "right" distance to use or the "right" ordination, but you may choose the one which best helps you explore the data.

In addition, it's important to note that the 2d PCA plot only shows the first 2 components. So if component 3 is close to component 2 in terms of maximizing the variance of the projected samples, then the choice of which component is 2 or 3 is subject to small variations in the values (which are estimated, normalized and transformed differently). Likewise, the 2nd component must be orthogonal to the 1st, the 3rd to the 1st and 2nd, and so on. So again, small variations in the values can give you a very different 2d representation.

(edit: I misread, you said the PCA plots are similar, Wolfgang's answer is more useful then)

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by Michael Love19k
gravatar for Jon Bråte
2.7 years ago by
Jon Bråte150
Jon Bråte150 wrote:

Thank you for the replies! I quickly tried a few filterings of the rlog counts (2000 most highly expressed, and variance filtering) and it seems that the clustering becomes more like the csDendro plot.

I guess it is not a yes or no answer to this, but would you say that PCA plots are more useful (or the most useful) to examine similarities/differences between samples and to detect outliers?

ADD COMMENTlink written 2.7 years ago by Jon Bråte150

yes. I find PCA plots more useful than dendrograms. In the PCA plot, there is another dimension for separating samples, and also there is the problem Wolfgang mentions about horizontal ordering of leaves in the dendrogram.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Michael Love19k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 336 users visited in the last hour