Question

Large differences in sample clustering using cummeRbund (csDendro) and rlog-distance (DESeq2)

0

Entering edit mode

Jon Bråte ▴ 270

@jon-brate-6263

Last seen 19 months ago

Norway

We have RNA-Seq data from different body parts; oral and aboral parts of small, medium and large sized specimens. We have mapped the reads using tophat2 and run cuffdiff and DESeq2 (with HTSeq count). Using the csDendro function in cummeRbund the samples cluster largely by oral/aboral parts, but clustering a distance matrix of rlog-values from DESeq2 the samples group mainly by size.

DESeq2-commands:

#Heatmap of sample-to-sample distances
distsRL <- dist(t(assay(rld)))
mat <- as.matrix(distsRL)
heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13), main = "Sample-to-sample distances (rlog)")

I guess the main difference between the two approaches is that fpkm-values are the basis of csDendro, while rlog-transformed raw counts are used for the DESeq2 approach? But I could not find out whether all genes are included in the csDendro function? For the DESeq2-approach we excluded genes with zero counts in all samples, otherwise all genes should be included.

When we create PCA-plots using both packages they are roughly similar. Hvave anyone have experienced similar results? What can be the explanation for these differences?

deseq2 cummerbund rlog fpkm clustering • 2.6k views

ADD COMMENT • link 10.1 years ago Jon Bråte ▴ 270

score 1 · Answer 1 · 2016-01-19

Jon,

you say that the PCA plots look similar but the dendrograms don't. So the problem would not lie with the transformation (up-stream of both) but with the dendrogram.

Note that the linear ordering of the leaves in the plot is not fully determined by the dendrogram, so that interpretation of the former is often problematic.

Irrespective of that, it tends to be useful to remove features (genes) with low-signal to noise ratio (since if anything these tend to pick up batch effects etc. rather than biology), and focus e.g. on the top few thousand genes with highest overall (e.g. mean, median, 1st quartile - your choice) signal.

Wolfgang

score 1 · Answer 2 · 2016-01-19

First, I would definitely expect differences, given the methods are very different. In DESeq2, the clustering ~~PCA plot is showing you a 2d projection where distances are similar to the~~ is based on Euclidean distances between vectors of log transformed counts, where the noisiness of low count values is minimized. I would expect that distances on the log FPKM scale would be different.

There is no canonical ~~2d representation~~ clustering based on inter sample distances, but the different transformations, filtering and distance choices all give different weight to different genes, etc.

If you look at our workflow and our vignette, I also suggest trying the PoissonDistance (see workflow) and the VST. I can't say which is the "right" distance to use or the "right" ordination, but you may choose the one which best helps you explore the data.

In addition, it's important to note that the 2d PCA plot only shows the first 2 components. So if component 3 is close to component 2 in terms of maximizing the variance of the projected samples, then the choice of which component is 2 or 3 is subject to small variations in the values (which are estimated, normalized and transformed differently). Likewise, the 2nd component must be orthogonal to the 1st, the 3rd to the 1st and 2nd, and so on. So again, small variations in the values can give you a very different 2d representation.

(edit: I misread, you said the PCA plots are similar, Wolfgang's answer is more useful then)

score 0 · Answer 3 · 2016-01-19

0

Entering edit mode

Jon Bråte ▴ 270

@jon-brate-6263

Last seen 19 months ago

Norway

Thank you for the replies! I quickly tried a few filterings of the rlog counts (2000 most highly expressed, and variance filtering) and it seems that the clustering becomes more like the csDendro plot.

I guess it is not a yes or no answer to this, but would you say that PCA plots are more useful (or the most useful) to examine similarities/differences between samples and to detect outliers?

ADD COMMENT • link 10.1 years ago Jon Bråte ▴ 270

1

Entering edit mode

yes. I find PCA plots more useful than dendrograms. In the PCA plot, there is another dimension for separating samples, and also there is the problem Wolfgang mentions about horizontal ordering of leaves in the dendrogram.

ADD REPLY • link 10.1 years ago Michael Love 43k