Maxim,
Certainly, the sorted heatmap example is only useful for visualization
purposes, while the clustering example is included to show how to
perform
clustering on your data in R in general not to provide a final
solution
for your research problem.
If your intention is to cluster the final clustering results of your
factors then you could compute a similarity/distance metric among
their
clustering results using a set similarity measure such as the Jaccard
partitioning index. This way you can cluster entire clustering data
sets. For hierarchical clustering results you would first need to
generate discrete clusters from the dendrogams. R's cutree function is
very useful for this. Some examples and links to related R libraries
for
clustering partitioning results are available here:
http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/R_BioCondManual.htm
l#clustering_jaccard
Similarly, you could compute membership similarities among the
clusters
obtained for all your factors using again the Jaccard partitioning
index
or the variation of information criterion. This could be used to
cluster
all the groupings obtained from your clustering results ("clustering
of clusters").
Since I do not know enough about your data sets or the genomics
technology you are
using this case, I am not sure if any of this makes much sense. But
perhaps
it is useful for exploring your options...
Thomas
On Sun, Dec 13, 2009 at 10:28:11PM +0100, Maxim wrote:
> This works nicely.
>
> It means I cluster the data for one factor and sort the other
factors
> according to the clustering result. But what next?
>
> What I will have to do is to find (cluster together) patterns that
are
> similar *between* different factors and not within the data of one
factor
> only. As I mentioned the attached data was already clustered in such
a
> manner and obviously the patterns of factor 1-3 are somewhat
similar.
> Therefore clustering makes not much sense with this data.
>
> I'm not sure whether I might just miss some aspect you try to
explain but as
> far as I got it, there is some essential thing still missing in
order to
> accomplish the clustering analysis. When you look at the attached
heatmap
> you'll find clusters, where factor 1 and 2 are showing similar
patterns,
> others for 2 and 3, again others for 1 and 2 and 3 and so forth. To
sort out
> this differences some additional trick has to be done.
>
> Despite of this my R programming skills get better and better and I
think I
> will soon substitute lots of my Perl-Code with R-Code, especially
that for
> the data container manipulation tools.
>
> Maxim
>
>
>
>
>
>
> 2009/12/13 Thomas Girke <thomas.girke at="" ucr.edu="">
>
> > Great. - The remaining parts of your analysis seem to be solvable
> > with simple subsetting and sorting routines of data objects in R.
> >
> > Here are some more suggestions that might help you here:
> >
> > ## Accessing data components of hclust objects:
> > names(hr)
> > hr$labels; hr$order
> >
> > ## How to return object labels in the order of a hierarchical
clustering
> > result:
> > hr$labels[hr$order]
> >
> > ## Sorting vector to join rows of same type in a data matrix (e.g.
your
> > factorX data sets)
> > sortv <- as.vector(matrix(1:500, nrow=5, ncol=100, byrow=T))
> > somematrix[sortv, ]
> >
> > Thomas
> >
> >
> > On Sun, Dec 13, 2009 at 11:38:54AM +0100, Maxim wrote:
> > > Thank you,
> > >
> > > that works great. I was not aware of the "image" function. To do
scaling
> > > helped definitely o improve the quality of the heatmaps. In
addition I
> > > learned a lot about handling/importing data.
> > >
> > > I have to think about the clustering itself. Your suggestions
works fine
> > to
> > > cluster 1 dataset. Unfortunately clustering has to be performed
comparing
> > > all 5 datasets with each other. That means a given row in
dataset 1 has
> > to
> > > be aligned with the same rows in datasets 2-5. When I perform
clustering
> > of
> > > the individual datasets each set will be clustered differently.
> > >
> > > Despite of this you definitely provided a good starting point to
better
> > > understand how to exploit R for my project.
> > >
> > > Maxim
> > >
> > > 2009/12/13 Thomas Girke <thomas.girke at="" ucr.edu="">
> > >
> > > > Below are some suggestions for importing your data into R,
plotting
> > > > heatmaps and clustering them using hierarchical clustering.
For the
> > > > clustering, you have to choose a proper distance measure for
your data
> > > > type and of course an efficient partitioning algorithm. The
task view
> > > > site I sent previously provides a good overview as to what is
> > available.
> > > > Clustering of 7000 objects in R is not a problem for most
clustering
> > > > methods. Many of them, including hclust, are implemented in
C++/Fortran
> > > > to run decently time and memory efficient.
> > > >
> > > > I hope this will help to get you started.
> > > >
> > > > Thomas
> > > >
> > > >
> > > > ## Import of data sets (appended in one data frame)
> > > > ## Note: a row of NA values is inserted to visually separate
data sets
> > in
> > > > heatmap
> > > > filenames <- list.files(pattern="factor*") # Requires data
files
> > "factorX"
> > > > in working dir
> > > > impDF <- data.frame(NULL)
> > > > for(i in filenames) {
> > > > tmp <- read.delim(i, header=FALSE)
> > > > impDF <- rbind(impDF, tmp, rep(NA, dim(tmp)[2]))
> > > > }
> > > > myrownames <- paste(1:length(impDF[,1]), impDF[,2], impDF[,3],
sep="_")
> > > > rownames(impDF) <- myrownames
> > > > impDF <- impDF[,4:44]
> > > > imp <- as.matrix(impDF)
> > > >
> > > > ## Alternative import into list container (not used in
following
> > example)
> > > > filenames <- list.files(pattern="factor*")
> > > > datalist <- lapply(filenames, function(x) read.delim(x,
header=F))
> > > > names(datalist) <- filenames
> > > >
> > > > ## Plot heatmap with heatmap.2
> > > > library(gplots)
> > > > heatmap.2(imp, dendrogram="none", Rowv=F, Colv=F,
col=redgreen(75),
> > > > scale="row", trace="none", key=F)
> > > >
> > > > ## Plot a separate heatmap for each data set using R's image()
function
> > > > index <- matrix(1:505, nrow=5, ncol=101, byrow=T)[,-101]
> > > > par(mfrow=c(1,5))
> > > > for(i in 1:5) {
> > > > image(scale(t(imp[rev(index[i,]), ])),
col=redgreen(75),
> > xaxt="n",
> > > > yaxt="n", main=i)
> > > > }
> > > >
> > > > ## Example for hierarchical clustering of first data set using
hclust
> > > > y <- imp[1:100, ] # Selects first 100 rows (=1st data set).
> > > > d <- as.dist(1-cor(t(y))) # Creates a distance matrix using
Pearson
> > > > correlations; here you
> > > > # want to choose a distance measure
that is
> > best
> > > > for your data.
> > > > hr <- hclust(d, method="complete") # Performs hierarchical
clustering
> > > > heatmap.2(y, Rowv=as.dendrogram(hr), dendrogram="row", Colv=F,
> > > > col=redgreen(75), scale="row", trace="none")
> > > >
> > > >
> > > >
> > > > On Sat, Dec 12, 2009 at 11:25:05PM +0100, Maxim wrote:
> > > > > Hi,
> > > > >
> > > > > At first: thanks for taking the time!!
> > > > >
> > > > > I could send some of the data and a jpeg that illustates
where I
> > would
> > > > like
> > > > > to go to. Unfortunately the data is large. It is genomics
data
> > measuring
> > > > > binding of several factors to specific genomic regions. I'd
like to
> > > > identify
> > > > > clusters where different factors show similar binding
behaviour.
> > > > >
> > > > > My current dataset has data for 10 factors. I'm looking at
7000
> > "sites",
> > > > > each represented by 40 datapoints (that is in 100bp steps
from
> > position
> > > > > -2000 to +2000 relative to the sites). Each site represents
a certain
> > > > > genomic location and I look at the same sites for every
factor.
> > > > >
> > > > > I wonder if this can be done straightforward in R?
> > > > >
> > > > > I attached an example of the data. It is the first 100
"sites" for 5
> > > > factors
> > > > > and the corresponding heatmap (for the complete set) after
external
> > > > > clustering.
> > > > >
> > > > > Concerning doing the clustering in R: I have no clue how to
do
> > clustering
> > > > > with such "mutli-dimensional" data within R. I'd be glad in
case you
> > > > could
> > > > > point me at the right direction how to approach such a task.
The
> > > > approaches
> > > > > in the literature appear to be quite complex and need lots
of CPU
> > power
> > > > and
> > > > > are scripted in C, I guess for speed reasons. I wonder
whether R is
> > fast
> > > > > enough to accomplish such a task in a reasonable time.
> > > > >
> > > > > To explain the data:
> > > > > It is 5 small tab-delimited files, each with data from 100
"sites"
> > for 5
> > > > > factors. The example data corresponds to the bottom of the
attached
> > > > heatmap,
> > > > > so it is having clearly positive signals (blue color) at the
center
> > > > position
> > > > > for factor 1 and factor 2.
> > > > >
> > > > > I hope this might help to illustrate my project a little
better.
> > > > >
> > > > > Maxim
> > > > >
> > > > >
> > > > >
> > > > > 2009/12/11 Thomas Girke <thomas.girke at="" ucr.edu="">
> > > > >
> > > > > > The example I send before works but there was a typo in
the
> > heatmap.2
> > > > > > command
> > > > > > where mysort needs to replaced by mydata. Like this:
> > > > > >
> > > > > > heatmap.2(mydata, dendrogram="none", Rowv=F, Colv=F,
trace="none",
> > > > key=T,
> > > > > > cellnote=mysamples)
> > > > > >
> > > > > > In heatmap.2 you have the option to include a color bar on
the left
> > > > side
> > > > > > that
> > > > > > can be used to highlight clusters. See the help
documentation for
> > more
> > > > > > details.
> > > > > >
> > > > > > The dimensions of heatmap.2 plots can be controlled like
for any
> > other
> > > > plot
> > > > > > in
> > > > > > R, using the hight and width arguments, e.g. x11(height=6,
width=2)
> > or
> > > > > > pdf(...).
> > > > > >
> > > > > > To provide more specific help, you may want to send a
simple sample
> > > > data
> > > > > > set that
> > > > > > illustrates what you are trying to do exactly. Without
this it is
> > > > really
> > > > > > hard
> > > > > > to understand your problem.
> > > > > >
> > > > > > If I were you then I would perform the entire clustering
procedure
> > in
> > > > R. Is
> > > > > > there any
> > > > > > good reason not use R for this? For hierarchical
clustering you can
> > use
> > > > the
> > > > > > hclust function.
> > > > > > A relatively complete list of clustering algorithms
available in R
> > can
> > > > be
> > > > > > found on
> > > > > > the cluster task view page:
> > > > > >
http://cran.at.r-project.org/web/views/Cluster.html
> > > > > >
> > > > > > Thomas
> > > > > >
> > > > > > On Fri, Dec 11, 2009 at 10:17:22PM +0100, Maxim wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > what you suggested sounds interesting but actually I do
not
> > > > understand
> > > > > > where
> > > > > > > it is going to. Actually I'll get an error doing it like
this as
> > > > mysort
> > > > > > in
> > > > > > > heatmap.2 is not defined (yet).
> > > > > > >
> > > > > > > What did work for me in meantime, is simply to plot the
complete
> > > > heatmap
> > > > > > at
> > > > > > > once. What is missing in this approach is a label on the
left or
> > > > right
> > > > > > side
> > > > > > > of the heatmap, I'd love to have a colorcoded block that
allows
> > me to
> > > > > > see,
> > > > > > > where different IDs were plotted (different IDs are
actually
> > clusters
> > > > > > coming
> > > > > > > from hierarchical clustering not performed in R).
> > > > > > >
> > > > > > > Second I can do it by generating individual heatmaps for
each ID
> > > > loaded
> > > > > > from
> > > > > > > individual files, unfortunately for some IDs there are
thousand
> > rows
> > > > of
> > > > > > > data, for others only 50. But R's heatmap always
produces
> > similarly
> > > > sized
> > > > > > > maps. I'd prefer to have the height of the individual
heatmaps
> > > > according
> > > > > > to
> > > > > > > the corresponding number of rows rather than automatic
scaling.
> > > > > > >
> > > > > > > Is there a way to do this in R? I found an old mail in
the
> > mailing
> > > > list
> > > > > > > discussing this point, the result was to use
TreeView/Cluster,
> > but I
> > > > > > cannot
> > > > > > > get this to work without doing clustering (the data is
clustered
> > > > > > already),
> > > > > > > additionally I do not know how to do batch processing in
> > TreeView.
> > > > > > >
> > > > > > > Maxim
> > > > > > > 2009/12/10 Thomas Girke <thomas.girke at="" ucr.edu="">
> > > > > > >
> > > > > > > > I am not sure if I understand every part of your
problem
> > correctly,
> > > > > > > > but here is an example how something like this could
be done in
> > R.
> > > > > > > > Its main idea is to keep the entire data set in one
matrix and
> > use
> > > > > > > > the cell note feature of heatmap.2 for sample
tracking.
> > > > > > > >
> > > > > > > > ## Sample matrix for demo purpose. If your
> > > > > > > > y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g",
1:10,
> > > > sep=""),
> > > > > > > > paste("t", 1:5, sep="")))
> > > > > > > >
> > > > > > > > ## Sort each row by its values
> > > > > > > > mydata <- t(apply(y, 1, sort))
> > > > > > > >
> > > > > > > > ## Obtain sample labels (column titles) for sorted
rows
> > > > > > > > mysamples <- t(apply(y, 1, function(x)
names(sort(x))))
> > > > > > > >
> > > > > > > > ## Plot heatmap where the sample labels are given as
cell notes
> > for
> > > > > > > > tracking purposes
> > > > > > > > library(gplots)
> > > > > > > > heatmap.2(mydata, dendrogram="none", Rowv=F, Colv=F,
> > > > col=redgreen(75),
> > > > > > > > scale="row", trace="none", key=T, cellnote=mysamples)
> > > > > > > >
> > > > > > > > Thomas
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Dec 10, 2009 at 03:13:31PM +0100, Maxim wrote:
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I'm stuck with parsing data into R for heatmap
> > representation.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > The data looks like:
> > > > > > > > >
> > > > > > > > > 1 id1 x1 x2 x3 .... x20
> > > > > > > > >
> > > > > > > > > 2 id1 x1 x2 x3 .... x20
> > > > > > > > >
> > > > > > > > > 3 id1 x1 x2 x3 .... x20
> > > > > > > > >
> > > > > > > > > 4 id1 x1 x2 x3 .... x20
> > > > > > > > >
> > > > > > > > > .........
> > > > > > > > >
> > > > > > > > > 348 id2 x1 x2 x3 .... x20
> > > > > > > > >
> > > > > > > > > 349 id2 x1 x2 x3 .... x20
> > > > > > > > >
> > > > > > > > > 350 id2 x1 x2 x3 .... x20
> > > > > > > > >
> > > > > > > > > 351 id2 x1 x2 x3 .... x20
> > > > > > > > >
> > > > > > > > > .........
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > The data is sorted for the IDs (id1,id2 .....id40)
and I like
> > to
> > > > > > produce
> > > > > > > > 40
> > > > > > > > > heatmaps thereof, 1 heatmap per data corresponding
to a
> > single
> > > > ID.
> > > > > > The
> > > > > > > > data
> > > > > > > > > that has to be plotted is 20 values (x1 to x20).
There is
> > > > different
> > > > > > > > amounts
> > > > > > > > > of data for respective IDs. In the end I'd like to
have the
> > 40
> > > > > > heatmaps
> > > > > > > > > stacked on top of each other sorted by ID and
heatmap heights
> > > > > > according
> > > > > > > > to
> > > > > > > > > the amount (number of rows) of data. Unfortunately
the
> > individual
> > > > > > data
> > > > > > > > lines
> > > > > > > > > have to be sorted with respect to the maximum of the
values
> > X1 to
> > > > x20
> > > > > > in
> > > > > > > > > individual rows. Actually this not that important as
I guess
> > this
> > > > > > might
> > > > > > > > be
> > > > > > > > > easier to realize in upstream Perl scripts producing
the
> > data.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > The data is available as data per ID in individual
files or
> > as a
> > > > > > sorted
> > > > > > > > file
> > > > > > > > > with the complete dataset (as shown above).
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Is it possible in R to break a file as above into
distinct
> > blocks
> > > > > > > > (depending
> > > > > > > > > on ID) and then to process it (sorting according to
maximum,
> > > > > > heatmap)?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Which commands do I have to issue for the
manipulation of the
> > > > > > data.frame?
> > > > > > > > I
> > > > > > > > > tried the
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I'd be glad if someone could help me finding the
correct
> > > > direction
> > > > > > to
> > > > > > > > solve
> > > > > > > > > my problem!
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Best regards
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Maxim
> > > > > > > > >
> > > > > > > > > [[alternative HTML version deleted]]
> > > > > > > > >
> > > > > > > > > _______________________________________________
> > > > > > > > > Bioconductor mailing list
> > > > > > i> > > Bioconductor at stat.math.ethz.ch
> > > > > > > > >
https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > > > > > > > Search the archives:
> > > > > > > >
> >
http://news.gmane.org/gmane.science.biology.informatics.conductor
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > > >
> > > >
> >