Question

Supervised clustering by sample

0

Entering edit mode

rmgreer • 0

@rmgreer-7397

Last seen 9.2 years ago

United States

Hello all.
I'm relatively new to the world of R and Bioconductor, and just as a disclaimer I am not trained as a biostatistician or programmer.

So my problem is this. I have a set of microarray data, and over all I have 27 pairs of control and LPS treated samples. But it's actually more complex than that. I have three distinct cell populations, and each cell population was treated with or without LPS at three distinct time points, and each of those was done in triplicate. And one of the analyses I did shows that one of the time points is very different than the other two for LPS treatment. When I do an unsupervised hierarchical clustering analysis in each cell population one of the control samples no long clusters according to the time of treatment, while the LPS samples all do. And my PI is insistent that I correct the heatmaps so that it shows for both control and LPS samples, the samples according to their time of treatment. Does anyone know a relatively simple way to accomplish this?

Thank you,

Rachel

hclust heatmap • 2.4k views

ADD COMMENT • link 9.2 years ago rmgreer • 0

score 1 · Answer 1 · 2015-02-25

Setting aside the idea that data are supposed to be pummeled with sufficient vigor to conform to the expectations of your PI, it would be helpful to know the purpose of the heatmap. Is the goal to show that certain samples are similar, or is it to show groups of genes that appear to be acting similarly across groups of samples? The former is probably the most common use case, in my experience.

If you are trying to show the general grouping structure of your samples, you might instead consider using principal components analysis (PCA) plots, which tend to be a bit more informative. But I am betting this will not suffice.

There is no simple way to 'correct' the heatmap, primarily because it isn't incorrect (your PI's opinion notwithstanding). It just is. In other words, clustering isn't an inferential process where you can say that 'this cluster is different than what I would expect to see by chance'. You are just grouping samples (and/or genes), based on how 'close' they are to each other, where the measure of closeness can be defined many different ways.

How the samples group is dependent on what genes you are using. So adding or subtracting genes will have an effect on the heatmap. In addition, how you define the distance between samples will affect the results as well. If you look at ?dist, you can see all the available choices. An additional choice is 1 - correlation (e.g., if you do

distfun <- function(x) as.dist(1-cor(t(x)))

and then in your call to heatmap.2(), you say

heatmap.2(<some args>, distfun = distfun)

then you will use 1-correlation as your distance measure.

You can also play around with the agglomeration methods (used to define the position of the groups). In other words, when you start out, you just have 27 samples, and their distances from each other. The first step involves finding the two closest samples, and saying they are a group. But now instead of having a single point for all 27 samples, you have 25 samples and a group of two samples, so you now need to define where that group of two samples is. There are any number of ways to do that, so see the agglomeration methods under ?hclust for your choices.

score 0 · Answer 2 · 2015-02-25

The purpose in the heatmap was to look at one cell population (really in my one gestational age of lung mesenchymal cells) and survey the effects of time of treatment of LPS. So starting out with 18 samples and using the gene list of differentially expressed genes with a fold change greater than 2 and a p <0.01.

I've done a PCA on the dataset as a whole, and instead of grouping by treatment or time, all of the samples group by gestational age of the cells used in the experiment, I haven't tried to run a PCA on just one subset of the data however. Maybe I should give that a try too.