problem with data processing in R

Entering edit mode

Maxim ▴ 170

@maxim-3843

Last seen 11.1 years ago

Hi, I'm stuck with parsing data into R for heatmap representation. The data looks like: 1 id1 x1 x2 x3 .... x20 2 id1 x1 x2 x3 .... x20 3 id1 x1 x2 x3 .... x20 4 id1 x1 x2 x3 .... x20 ......... 348 id2 x1 x2 x3 .... x20 349 id2 x1 x2 x3 .... x20 350 id2 x1 x2 x3 .... x20 351 id2 x1 x2 x3 .... x20 ......... The data is sorted for the IDs (id1,id2 .....id40) and I like to produce 40 heatmaps thereof, 1 heatmap per data corresponding to a single ID. The data that has to be plotted is 20 values (x1 to x20). There is different amounts of data for respective IDs. In the end I'd like to have the 40 heatmaps stacked on top of each other sorted by ID and heatmap heights according to the amount (number of rows) of data. Unfortunately the individual data lines have to be sorted with respect to the maximum of the values X1 to x20 in individual rows. Actually this not that important as I guess this might be easier to realize in upstream Perl scripts producing the data. The data is available as data per ID in individual files or as a sorted file with the complete dataset (as shown above). Is it possible in R to break a file as above into distinct blocks (depending on ID) and then to process it (sorting according to maximum, heatmap)? Which commands do I have to issue for the manipulation of the data.frame? I tried the I'd be glad if someone could help me finding the correct direction to solve my problem! Best regards Maxim [[alternative HTML version deleted]]

PROcess GLAD PROcess GLAD • 2.5k views

ADD COMMENT • link updated 15.8 years ago by Thomas Girke ★ 1.7k • written 15.8 years ago by Maxim ▴ 170

Entering edit mode

Thomas Girke ★ 1.7k

@thomas-girke-993

Last seen 18 months ago

United States

I am not sure if I understand every part of your problem correctly, but here is an example how something like this could be done in R. Its main idea is to keep the entire data set in one matrix and use the cell note feature of heatmap.2 for sample tracking. ## Sample matrix for demo purpose. If your y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) ## Sort each row by its values mydata <- t(apply(y, 1, sort)) ## Obtain sample labels (column titles) for sorted rows mysamples <- t(apply(y, 1, function(x) names(sort(x)))) ## Plot heatmap where the sample labels are given as cell notes for tracking purposes library(gplots) heatmap.2(mysort, dendrogram="none", Rowv=F, Colv=F, col=redgreen(75), scale="row", trace="none", key=T, cellnote=mysamples Thomas On Thu, Dec 10, 2009 at 03:13:31PM +0100, Maxim wrote: > Hi, > > > I'm stuck with parsing data into R for heatmap representation. > > > The data looks like: > > 1 id1 x1 x2 x3 .... x20 > > 2 id1 x1 x2 x3 .... x20 > > 3 id1 x1 x2 x3 .... x20 > > 4 id1 x1 x2 x3 .... x20 > > ......... > > 348 id2 x1 x2 x3 .... x20 > > 349 id2 x1 x2 x3 .... x20 > > 350 id2 x1 x2 x3 .... x20 > > 351 id2 x1 x2 x3 .... x20 > > ......... > > > > The data is sorted for the IDs (id1,id2 .....id40) and I like to produce 40 > heatmaps thereof, 1 heatmap per data corresponding to a single ID. The data > that has to be plotted is 20 values (x1 to x20). There is different amounts > of data for respective IDs. In the end I'd like to have the 40 heatmaps > stacked on top of each other sorted by ID and heatmap heights according to > the amount (number of rows) of data. Unfortunately the individual data lines > have to be sorted with respect to the maximum of the values X1 to x20 in > individual rows. Actually this not that important as I guess this might be > easier to realize in upstream Perl scripts producing the data. > > > The data is available as data per ID in individual files or as a sorted file > with the complete dataset (as shown above). > > > Is it possible in R to break a file as above into distinct blocks (depending > on ID) and then to process it (sorting according to maximum, heatmap)? > > > Which commands do I have to issue for the manipulation of the data.frame? I > tried the > > > I'd be glad if someone could help me finding the correct direction to solve > my problem! > > > Best regards > > > Maxim > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 15.8 years ago Thomas Girke ★ 1.7k

Entering edit mode

Hi, what you suggested sounds interesting but actually I do not understand where it is going to. Actually I'll get an error doing it like this as mysort in heatmap.2 is not defined (yet). What did work for me in meantime, is simply to plot the complete heatmap at once. What is missing in this approach is a label on the left or right side of the heatmap, I'd love to have a colorcoded block that allows me to see, where different IDs were plotted (different IDs are actually clusters coming from hierarchical clustering not performed in R). Second I can do it by generating individual heatmaps for each ID loaded from individual files, unfortunately for some IDs there are thousand rows of data, for others only 50. But R's heatmap always produces similarly sized maps. I'd prefer to have the height of the individual heatmaps according to the corresponding number of rows rather than automatic scaling. Is there a way to do this in R? I found an old mail in the mailing list discussing this point, the result was to use TreeView/Cluster, but I cannot get this to work without doing clustering (the data is clustered already), additionally I do not know how to do batch processing in TreeView. Maxim 2009/12/10 Thomas Girke <thomas.girke@ucr.edu> > I am not sure if I understand every part of your problem correctly, > but here is an example how something like this could be done in R. > Its main idea is to keep the entire data set in one matrix and use > the cell note feature of heatmap.2 for sample tracking. > > ## Sample matrix for demo purpose. If your > y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), > paste("t", 1:5, sep=""))) > > ## Sort each row by its values > mydata <- t(apply(y, 1, sort)) > > ## Obtain sample labels (column titles) for sorted rows > mysamples <- t(apply(y, 1, function(x) names(sort(x)))) > > ## Plot heatmap where the sample labels are given as cell notes for > tracking purposes > library(gplots) > heatmap.2(mysort, dendrogram="none", Rowv=F, Colv=F, col=redgreen(75), > scale="row", trace="none", key=T, cellnote=mysamples > > Thomas > > > On Thu, Dec 10, 2009 at 03:13:31PM +0100, Maxim wrote: > > Hi, > > > > > > I'm stuck with parsing data into R for heatmap representation. > > > > > > The data looks like: > > > > 1 id1 x1 x2 x3 .... x20 > > > > 2 id1 x1 x2 x3 .... x20 > > > > 3 id1 x1 x2 x3 .... x20 > > > > 4 id1 x1 x2 x3 .... x20 > > > > ......... > > > > 348 id2 x1 x2 x3 .... x20 > > > > 349 id2 x1 x2 x3 .... x20 > > > > 350 id2 x1 x2 x3 .... x20 > > > > 351 id2 x1 x2 x3 .... x20 > > > > ......... > > > > > > > > The data is sorted for the IDs (id1,id2 .....id40) and I like to produce > 40 > > heatmaps thereof, 1 heatmap per data corresponding to a single ID. The > data > > that has to be plotted is 20 values (x1 to x20). There is different > amounts > > of data for respective IDs. In the end I'd like to have the 40 heatmaps > > stacked on top of each other sorted by ID and heatmap heights according > to > > the amount (number of rows) of data. Unfortunately the individual data > lines > > have to be sorted with respect to the maximum of the values X1 to x20 in > > individual rows. Actually this not that important as I guess this might > be > > easier to realize in upstream Perl scripts producing the data. > > > > > > The data is available as data per ID in individual files or as a sorted > file > > with the complete dataset (as shown above). > > > > > > Is it possible in R to break a file as above into distinct blocks > (depending > > on ID) and then to process it (sorting according to maximum, heatmap)? > > > > > > Which commands do I have to issue for the manipulation of the data.frame? > I > > tried the > > > > > > I'd be glad if someone could help me finding the correct direction to > solve > > my problem! > > > > > > Best regards > > > > > > Maxim > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > [[alternative HTML version deleted]]

ADD REPLY • link 15.8 years ago Maxim ▴ 170

Entering edit mode

Thomas Girke ★ 1.7k

@thomas-girke-993

Last seen 18 months ago

United States

The example I send before works but there was a typo in the heatmap.2 command where mysort needs to replaced by mydata. Like this: heatmap.2(mydata, dendrogram="none", Rowv=F, Colv=F, trace="none", key=T, cellnote=mysamples) In heatmap.2 you have the option to include a color bar on the left side that can be used to highlight clusters. See the help documentation for more details. The dimensions of heatmap.2 plots can be controlled like for any other plot in R, using the hight and width arguments, e.g. x11(height=6, width=2) or pdf(...). To provide more specific help, you may want to send a simple sample data set that illustrates what you are trying to do exactly. Without this it is really hard to understand your problem. If I were you then I would perform the entire clustering procedure in R. Is there any good reason not use R for this? For hierarchical clustering you can use the hclust function. A relatively complete list of clustering algorithms available in R can be found on the cluster task view page: http://cran.at.r-project.org/web/views/Cluster.html Thomas On Fri, Dec 11, 2009 at 10:17:22PM +0100, Maxim wrote: > Hi, > > what you suggested sounds interesting but actually I do not understand where > it is going to. Actually I'll get an error doing it like this as mysort in > heatmap.2 is not defined (yet). > > What did work for me in meantime, is simply to plot the complete heatmap at > once. What is missing in this approach is a label on the left or right side > of the heatmap, I'd love to have a colorcoded block that allows me to see, > where different IDs were plotted (different IDs are actually clusters coming > from hierarchical clustering not performed in R). > > Second I can do it by generating individual heatmaps for each ID loaded from > individual files, unfortunately for some IDs there are thousand rows of > data, for others only 50. But R's heatmap always produces similarly sized > maps. I'd prefer to have the height of the individual heatmaps according to > the corresponding number of rows rather than automatic scaling. > > Is there a way to do this in R? I found an old mail in the mailing list > discussing this point, the result was to use TreeView/Cluster, but I cannot > get this to work without doing clustering (the data is clustered already), > additionally I do not know how to do batch processing in TreeView. > > Maxim > 2009/12/10 Thomas Girke <thomas.girke at="" ucr.edu=""> > > > I am not sure if I understand every part of your problem correctly, > > but here is an example how something like this could be done in R. > > Its main idea is to keep the entire data set in one matrix and use > > the cell note feature of heatmap.2 for sample tracking. > > > > ## Sample matrix for demo purpose. If your > > y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), > > paste("t", 1:5, sep=""))) > > > > ## Sort each row by its values > > mydata <- t(apply(y, 1, sort)) > > > > ## Obtain sample labels (column titles) for sorted rows > > mysamples <- t(apply(y, 1, function(x) names(sort(x)))) > > > > ## Plot heatmap where the sample labels are given as cell notes for > > tracking purposes > > library(gplots) > > heatmap.2(mydata, dendrogram="none", Rowv=F, Colv=F, col=redgreen(75), > > scale="row", trace="none", key=T, cellnote=mysamples) > > > > Thomas > > > > > > On Thu, Dec 10, 2009 at 03:13:31PM +0100, Maxim wrote: > > > Hi, > > > > > > > > > I'm stuck with parsing data into R for heatmap representation. > > > > > > > > > The data looks like: > > > > > > 1 id1 x1 x2 x3 .... x20 > > > > > > 2 id1 x1 x2 x3 .... x20 > > > > > > 3 id1 x1 x2 x3 .... x20 > > > > > > 4 id1 x1 x2 x3 .... x20 > > > > > > ......... > > > > > > 348 id2 x1 x2 x3 .... x20 > > > > > > 349 id2 x1 x2 x3 .... x20 > > > > > > 350 id2 x1 x2 x3 .... x20 > > > > > > 351 id2 x1 x2 x3 .... x20 > > > > > > ......... > > > > > > > > > > > > The data is sorted for the IDs (id1,id2 .....id40) and I like to produce > > 40 > > > heatmaps thereof, 1 heatmap per data corresponding to a single ID. The > > data > > > that has to be plotted is 20 values (x1 to x20). There is different > > amounts > > > of data for respective IDs. In the end I'd like to have the 40 heatmaps > > > stacked on top of each other sorted by ID and heatmap heights according > > to > > > the amount (number of rows) of data. Unfortunately the individual data > > lines > > > have to be sorted with respect to the maximum of the values X1 to x20 in > > > individual rows. Actually this not that important as I guess this might > > be > > > easier to realize in upstream Perl scripts producing the data. > > > > > > > > > The data is available as data per ID in individual files or as a sorted > > file > > > with the complete dataset (as shown above). > > > > > > > > > Is it possible in R to break a file as above into distinct blocks > > (depending > > > on ID) and then to process it (sorting according to maximum, heatmap)? > > > > > > > > > Which commands do I have to issue for the manipulation of the data.frame? > > I > > > tried the > > > > > > > > > I'd be glad if someone could help me finding the correct direction to > > solve > > > my problem! > > > > > > > > > Best regards > > > > > > > > > Maxim > > > > > > [[alternative HTML version deleted]] > > > > > > _______________________________________________ > > > Bioconductor mailing list i> > > Bioconductor at stat.math.ethz.ch > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > >

ADD COMMENT • link 15.8 years ago Thomas Girke ★ 1.7k

Entering edit mode

Hi, At first: thanks for taking the time!! I could send some of the data and a jpeg that illustates where I would like to go to. Unfortunately the data is large. It is genomics data measuring binding of several factors to specific genomic regions. I'd like to identify clusters where different factors show similar binding behaviour. My current dataset has data for 10 factors. I'm looking at 7000 "sites", each represented by 40 datapoints (that is in 100bp steps from position -2000 to +2000 relative to the sites). Each site represents a certain genomic location and I look at the same sites for every factor. I wonder if this can be done straightforward in R? I attached an example of the data. It is the first 100 "sites" for 5 factors and the corresponding heatmap (for the complete set) after external clustering. Concerning doing the clustering in R: I have no clue how to do clustering with such "mutli-dimensional" data within R. I'd be glad in case you could point me at the right direction how to approach such a task. The approaches in the literature appear to be quite complex and need lots of CPU power and are scripted in C, I guess for speed reasons. I wonder whether R is fast enough to accomplish such a task in a reasonable time. To explain the data: It is 5 small tab-delimited files, each with data from 100 "sites" for 5 factors. The example data corresponds to the bottom of the attached heatmap, so it is having clearly positive signals (blue color) at the center position for factor 1 and factor 2. I hope this might help to illustrate my project a little better. Maxim 2009/12/11 Thomas Girke <thomas.girke at="" ucr.edu=""> > The example I send before works but there was a typo in the heatmap.2 > command > where mysort needs to replaced by mydata. Like this: > > heatmap.2(mydata, dendrogram="none", Rowv=F, Colv=F, trace="none", key=T, > cellnote=mysamples) > > In heatmap.2 you have the option to include a color bar on the left side > that > can be used to highlight clusters. See the help documentation for more > details. > > The dimensions of heatmap.2 plots can be controlled like for any other plot > in > R, using the hight and width arguments, e.g. x11(height=6, width=2) or > pdf(...). > > To provide more specific help, you may want to send a simple sample data > set that > illustrates what you are trying to do exactly. Without this it is really > hard > to understand your problem. > > If I were you then I would perform the entire clustering procedure in R. Is > there any > good reason not use R for this? For hierarchical clustering you can use the > hclust function. > A relatively complete list of clustering algorithms available in R can be > found on > the cluster task view page: > http://cran.at.r-project.org/web/views/Cluster.html > > Thomas > > On Fri, Dec 11, 2009 at 10:17:22PM +0100, Maxim wrote: > > Hi, > > > > what you suggested sounds interesting but actually I do not understand > where > > it is going to. Actually I'll get an error doing it like this as mysort > in > > heatmap.2 is not defined (yet). > > > > What did work for me in meantime, is simply to plot the complete heatmap > at > > once. What is missing in this approach is a label on the left or right > side > > of the heatmap, I'd love to have a colorcoded block that allows me to > see, > > where different IDs were plotted (different IDs are actually clusters > coming > > from hierarchical clustering not performed in R). > > > > Second I can do it by generating individual heatmaps for each ID loaded > from > > individual files, unfortunately for some IDs there are thousand rows of > > data, for others only 50. But R's heatmap always produces similarly sized > > maps. I'd prefer to have the height of the individual heatmaps according > to > > the corresponding number of rows rather than automatic scaling. > > > > Is there a way to do this in R? I found an old mail in the mailing list > > discussing this point, the result was to use TreeView/Cluster, but I > cannot > > get this to work without doing clustering (the data is clustered > already), > > additionally I do not know how to do batch processing in TreeView. > > > > Maxim > > 2009/12/10 Thomas Girke <thomas.girke at="" ucr.edu=""> > > > > > I am not sure if I understand every part of your problem correctly, > > > but here is an example how something like this could be done in R. > > > Its main idea is to keep the entire data set in one matrix and use > > > the cell note feature of heatmap.2 for sample tracking. > > > > > > ## Sample matrix for demo purpose. If your > > > y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), > > > paste("t", 1:5, sep=""))) > > > > > > ## Sort each row by its values > > > mydata <- t(apply(y, 1, sort)) > > > > > > ## Obtain sample labels (column titles) for sorted rows > > > mysamples <- t(apply(y, 1, function(x) names(sort(x)))) > > > > > > ## Plot heatmap where the sample labels are given as cell notes for > > > tracking purposes > > > library(gplots) > > > heatmap.2(mydata, dendrogram="none", Rowv=F, Colv=F, col=redgreen(75), > > > scale="row", trace="none", key=T, cellnote=mysamples) > > > > > > Thomas > > > > > > > > > On Thu, Dec 10, 2009 at 03:13:31PM +0100, Maxim wrote: > > > > Hi, > > > > > > > > > > > > I'm stuck with parsing data into R for heatmap representation. > > > > > > > > > > > > The data looks like: > > > > > > > > 1 id1 x1 x2 x3 .... x20 > > > > > > > > 2 id1 x1 x2 x3 .... x20 > > > > > > > > 3 id1 x1 x2 x3 .... x20 > > > > > > > > 4 id1 x1 x2 x3 .... x20 > > > > > > > > ......... > > > > > > > > 348 id2 x1 x2 x3 .... x20 > > > > > > > > 349 id2 x1 x2 x3 .... x20 > > > > > > > > 350 id2 x1 x2 x3 .... x20 > > > > > > > > 351 id2 x1 x2 x3 .... x20 > > > > > > > > ......... > > > > > > > > > > > > > > > > The data is sorted for the IDs (id1,id2 .....id40) and I like to > produce > > > 40 > > > > heatmaps thereof, 1 heatmap per data corresponding to a single ID. > The > > > data > > > > that has to be plotted is 20 values (x1 to x20). There is different > > > amounts > > > > of data for respective IDs. In the end I'd like to have the 40 > heatmaps > > > > stacked on top of each other sorted by ID and heatmap heights > according > > > to > > > > the amount (number of rows) of data. Unfortunately the individual > data > > > lines > > > > have to be sorted with respect to the maximum of the values X1 to x20 > in > > > > individual rows. Actually this not that important as I guess this > might > > > be > > > > easier to realize in upstream Perl scripts producing the data. > > > > > > > > > > > > The data is available as data per ID in individual files or as a > sorted > > > file > > > > with the complete dataset (as shown above). > > > > > > > > > > > > Is it possible in R to break a file as above into distinct blocks > > > (depending > > > > on ID) and then to process it (sorting according to maximum, > heatmap)? > > > > > > > > > > > > Which commands do I have to issue for the manipulation of the > data.frame? > > > I > > > > tried the > > > > > > > > > > > > I'd be glad if someone could help me finding the correct direction > to > > > solve > > > > my problem! > > > > > > > > > > > > Best regards > > > > > > > > > > > > Maxim > > > > > > > > [[alternative HTML version deleted]] > > > > > > > > _______________________________________________ > > > > Bioconductor mailing list > i> > > Bioconductor at stat.math.ethz.ch > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > > Search the archives: > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > >

ADD REPLY • link 15.8 years ago Maxim ▴ 170

Entering edit mode

Thomas Girke ★ 1.7k

@thomas-girke-993

Last seen 18 months ago

United States

Below are some suggestions for importing your data into R, plotting heatmaps and clustering them using hierarchical clustering. For the clustering, you have to choose a proper distance measure for your data type and of course an efficient partitioning algorithm. The task view site I sent previously provides a good overview as to what is available. Clustering of 7000 objects in R is not a problem for most clustering methods. Many of them, including hclust, are implemented in C++/Fortran to run decently time and memory efficient. I hope this will help to get you started. Thomas ## Import of data sets (appended in one data frame) ## Note: a row of NA values is inserted to visually separate data sets in heatmap filenames <- list.files(pattern="factor*") # Requires data files "factorX" in working dir impDF <- data.frame(NULL) for(i in filenames) { tmp <- read.delim(i, header=FALSE) impDF <- rbind(impDF, tmp, rep(NA, dim(tmp)[2])) } myrownames <- paste(1:length(impDF[,1]), impDF[,2], impDF[,3], sep="_") rownames(impDF) <- myrownames impDF <- impDF[,4:44] imp <- as.matrix(impDF) ## Alternative import into list container (not used in following example) filenames <- list.files(pattern="factor*") datalist <- lapply(filenames, function(x) read.delim(x, header=F)) names(datalist) <- filenames ## Plot heatmap with heatmap.2 library(gplots) heatmap.2(imp, dendrogram="none", Rowv=F, Colv=F, col=redgreen(75), scale="row", trace="none", key=F) ## Plot a separate heatmap for each data set using R's image() function index <- matrix(1:505, nrow=5, ncol=101, byrow=T)[,-101] par(mfrow=c(1,5)) for(i in 1:5) { image(scale(t(imp[rev(index[i,]), ])), col=redgreen(75), xaxt="n", yaxt="n", main=i) } ## Example for hierarchical clustering of first data set using hclust y <- imp[1:100, ] # Selects first 100 rows (=1st data set). d <- as.dist(1-cor(t(y))) # Creates a distance matrix using Pearson correlations; here you # want to choose a distance measure that is best for your data. hr <- hclust(d, method="complete") # Performs hierarchical clustering heatmap.2(y, Rowv=as.dendrogram(hr), dendrogram="row", Colv=F, col=redgreen(75), scale="row", trace="none") On Sat, Dec 12, 2009 at 11:25:05PM +0100, Maxim wrote: > Hi, > > At first: thanks for taking the time!! > > I could send some of the data and a jpeg that illustates where I would like > to go to. Unfortunately the data is large. It is genomics data measuring > binding of several factors to specific genomic regions. I'd like to identify > clusters where different factors show similar binding behaviour. > > My current dataset has data for 10 factors. I'm looking at 7000 "sites", > each represented by 40 datapoints (that is in 100bp steps from position > -2000 to +2000 relative to the sites). Each site represents a certain > genomic location and I look at the same sites for every factor. > > I wonder if this can be done straightforward in R? > > I attached an example of the data. It is the first 100 "sites" for 5 factors > and the corresponding heatmap (for the complete set) after external > clustering. > > Concerning doing the clustering in R: I have no clue how to do clustering > with such "mutli-dimensional" data within R. I'd be glad in case you could > point me at the right direction how to approach such a task. The approaches > in the literature appear to be quite complex and need lots of CPU power and > are scripted in C, I guess for speed reasons. I wonder whether R is fast > enough to accomplish such a task in a reasonable time. > > To explain the data: > It is 5 small tab-delimited files, each with data from 100 "sites" for 5 > factors. The example data corresponds to the bottom of the attached heatmap, > so it is having clearly positive signals (blue color) at the center position > for factor 1 and factor 2. > > I hope this might help to illustrate my project a little better. > > Maxim > > > > 2009/12/11 Thomas Girke <thomas.girke at="" ucr.edu=""> > > > The example I send before works but there was a typo in the heatmap.2 > > command > > where mysort needs to replaced by mydata. Like this: > > > > heatmap.2(mydata, dendrogram="none", Rowv=F, Colv=F, trace="none", key=T, > > cellnote=mysamples) > > > > In heatmap.2 you have the option to include a color bar on the left side > > that > > can be used to highlight clusters. See the help documentation for more > > details. > > > > The dimensions of heatmap.2 plots can be controlled like for any other plot > > in > > R, using the hight and width arguments, e.g. x11(height=6, width=2) or > > pdf(...). > > > > To provide more specific help, you may want to send a simple sample data > > set that > > illustrates what you are trying to do exactly. Without this it is really > > hard > > to understand your problem. > > > > If I were you then I would perform the entire clustering procedure in R. Is > > there any > > good reason not use R for this? For hierarchical clustering you can use the > > hclust function. > > A relatively complete list of clustering algorithms available in R can be > > found on > > the cluster task view page: > > http://cran.at.r-project.org/web/views/Cluster.html > > > > Thomas > > > > On Fri, Dec 11, 2009 at 10:17:22PM +0100, Maxim wrote: > > > Hi, > > > > > > what you suggested sounds interesting but actually I do not understand > > where > > > it is going to. Actually I'll get an error doing it like this as mysort > > in > > > heatmap.2 is not defined (yet). > > > > > > What did work for me in meantime, is simply to plot the complete heatmap > > at > > > once. What is missing in this approach is a label on the left or right > > side > > > of the heatmap, I'd love to have a colorcoded block that allows me to > > see, > > > where different IDs were plotted (different IDs are actually clusters > > coming > > > from hierarchical clustering not performed in R). > > > > > > Second I can do it by generating individual heatmaps for each ID loaded > > from > > > individual files, unfortunately for some IDs there are thousand rows of > > > data, for others only 50. But R's heatmap always produces similarly sized > > > maps. I'd prefer to have the height of the individual heatmaps according > > to > > > the corresponding number of rows rather than automatic scaling. > > > > > > Is there a way to do this in R? I found an old mail in the mailing list > > > discussing this point, the result was to use TreeView/Cluster, but I > > cannot > > > get this to work without doing clustering (the data is clustered > > already), > > > additionally I do not know how to do batch processing in TreeView. > > > > > > Maxim > > > 2009/12/10 Thomas Girke <thomas.girke at="" ucr.edu=""> > > > > > > > I am not sure if I understand every part of your problem correctly, > > > > but here is an example how something like this could be done in R. > > > > Its main idea is to keep the entire data set in one matrix and use > > > > the cell note feature of heatmap.2 for sample tracking. > > > > > > > > ## Sample matrix for demo purpose. If your > > > > y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), > > > > paste("t", 1:5, sep=""))) > > > > > > > > ## Sort each row by its values > > > > mydata <- t(apply(y, 1, sort)) > > > > > > > > ## Obtain sample labels (column titles) for sorted rows > > > > mysamples <- t(apply(y, 1, function(x) names(sort(x)))) > > > > > > > > ## Plot heatmap where the sample labels are given as cell notes for > > > > tracking purposes > > > > library(gplots) > > > > heatmap.2(mydata, dendrogram="none", Rowv=F, Colv=F, col=redgreen(75), > > > > scale="row", trace="none", key=T, cellnote=mysamples) > > > > > > > > Thomas > > > > > > > > > > > > On Thu, Dec 10, 2009 at 03:13:31PM +0100, Maxim wrote: > > > > > Hi, > > > > > > > > > > > > > > > I'm stuck with parsing data into R for heatmap representation. > > > > > > > > > > > > > > > The data looks like: > > > > > > > > > > 1 id1 x1 x2 x3 .... x20 > > > > > > > > > > 2 id1 x1 x2 x3 .... x20 > > > > > > > > > > 3 id1 x1 x2 x3 .... x20 > > > > > > > > > > 4 id1 x1 x2 x3 .... x20 > > > > > > > > > > ......... > > > > > > > > > > 348 id2 x1 x2 x3 .... x20 > > > > > > > > > > 349 id2 x1 x2 x3 .... x20 > > > > > > > > > > 350 id2 x1 x2 x3 .... x20 > > > > > > > > > > 351 id2 x1 x2 x3 .... x20 > > > > > > > > > > ......... > > > > > > > > > > > > > > > > > > > > The data is sorted for the IDs (id1,id2 .....id40) and I like to > > produce > > > > 40 > > > > > heatmaps thereof, 1 heatmap per data corresponding to a single ID. > > The > > > > data > > > > > that has to be plotted is 20 values (x1 to x20). There is different > > > > amounts > > > > > of data for respective IDs. In the end I'd like to have the 40 > > heatmaps > > > > > stacked on top of each other sorted by ID and heatmap heights > > according > > > > to > > > > > the amount (number of rows) of data. Unfortunately the individual > > data > > > > lines > > > > > have to be sorted with respect to the maximum of the values X1 to x20 > > in > > > > > individual rows. Actually this not that important as I guess this > > might > > > > be > > > > > easier to realize in upstream Perl scripts producing the data. > > > > > > > > > > > > > > > The data is available as data per ID in individual files or as a > > sorted > > > > file > > > > > with the complete dataset (as shown above). > > > > > > > > > > > > > > > Is it possible in R to break a file as above into distinct blocks > > > > (depending > > > > > on ID) and then to process it (sorting according to maximum, > > heatmap)? > > > > > > > > > > > > > > > Which commands do I have to issue for the manipulation of the > > data.frame? > > > > I > > > > > tried the > > > > > > > > > > > > > > > I'd be glad if someone could help me finding the correct direction > > to > > > > solve > > > > > my problem! > > > > > > > > > > > > > > > Best regards > > > > > > > > > > > > > > > Maxim > > > > > > > > > > [[alternative HTML version deleted]] > > > > > > > > > > _______________________________________________ > > > > > Bioconductor mailing list > > i> > > Bioconductor at stat.math.ethz.ch > > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > > > Search the archives: > > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > > >

ADD COMMENT • link 15.8 years ago Thomas Girke ★ 1.7k

Entering edit mode

Thank you, that works great. I was not aware of the "image" function. To do scaling helped definitely o improve the quality of the heatmaps. In addition I learned a lot about handling/importing data. I have to think about the clustering itself. Your suggestions works fine to cluster 1 dataset. Unfortunately clustering has to be performed comparing all 5 datasets with each other. That means a given row in dataset 1 has to be aligned with the same rows in datasets 2-5. When I perform clustering of the individual datasets each set will be clustered differently. Despite of this you definitely provided a good starting point to better understand how to exploit R for my project. Maxim 2009/12/13 Thomas Girke <thomas.girke@ucr.edu> > Below are some suggestions for importing your data into R, plotting > heatmaps and clustering them using hierarchical clustering. For the > clustering, you have to choose a proper distance measure for your data > type and of course an efficient partitioning algorithm. The task view > site I sent previously provides a good overview as to what is available. > Clustering of 7000 objects in R is not a problem for most clustering > methods. Many of them, including hclust, are implemented in C++/Fortran > to run decently time and memory efficient. > > I hope this will help to get you started. > > Thomas > > > ## Import of data sets (appended in one data frame) > ## Note: a row of NA values is inserted to visually separate data sets in > heatmap > filenames <- list.files(pattern="factor*") # Requires data files "factorX" > in working dir > impDF <- data.frame(NULL) > for(i in filenames) { > tmp <- read.delim(i, header=FALSE) > impDF <- rbind(impDF, tmp, rep(NA, dim(tmp)[2])) > } > myrownames <- paste(1:length(impDF[,1]), impDF[,2], impDF[,3], sep="_") > rownames(impDF) <- myrownames > impDF <- impDF[,4:44] > imp <- as.matrix(impDF) > > ## Alternative import into list container (not used in following example) > filenames <- list.files(pattern="factor*") > datalist <- lapply(filenames, function(x) read.delim(x, header=F)) > names(datalist) <- filenames > > ## Plot heatmap with heatmap.2 > library(gplots) > heatmap.2(imp, dendrogram="none", Rowv=F, Colv=F, col=redgreen(75), > scale="row", trace="none", key=F) > > ## Plot a separate heatmap for each data set using R's image() function > index <- matrix(1:505, nrow=5, ncol=101, byrow=T)[,-101] > par(mfrow=c(1,5)) > for(i in 1:5) { > image(scale(t(imp[rev(index[i,]), ])), col=redgreen(75), xaxt="n", > yaxt="n", main=i) > } > > ## Example for hierarchical clustering of first data set using hclust > y <- imp[1:100, ] # Selects first 100 rows (=1st data set). > d <- as.dist(1-cor(t(y))) # Creates a distance matrix using Pearson > correlations; here you > # want to choose a distance measure that is best > for your data. > hr <- hclust(d, method="complete") # Performs hierarchical clustering > heatmap.2(y, Rowv=as.dendrogram(hr), dendrogram="row", Colv=F, > col=redgreen(75), scale="row", trace="none") > > > > On Sat, Dec 12, 2009 at 11:25:05PM +0100, Maxim wrote: > > Hi, > > > > At first: thanks for taking the time!! > > > > I could send some of the data and a jpeg that illustates where I would > like > > to go to. Unfortunately the data is large. It is genomics data measuring > > binding of several factors to specific genomic regions. I'd like to > identify > > clusters where different factors show similar binding behaviour. > > > > My current dataset has data for 10 factors. I'm looking at 7000 "sites", > > each represented by 40 datapoints (that is in 100bp steps from position > > -2000 to +2000 relative to the sites). Each site represents a certain > > genomic location and I look at the same sites for every factor. > > > > I wonder if this can be done straightforward in R? > > > > I attached an example of the data. It is the first 100 "sites" for 5 > factors > > and the corresponding heatmap (for the complete set) after external > > clustering. > > > > Concerning doing the clustering in R: I have no clue how to do clustering > > with such "mutli-dimensional" data within R. I'd be glad in case you > could > > point me at the right direction how to approach such a task. The > approaches > > in the literature appear to be quite complex and need lots of CPU power > and > > are scripted in C, I guess for speed reasons. I wonder whether R is fast > > enough to accomplish such a task in a reasonable time. > > > > To explain the data: > > It is 5 small tab-delimited files, each with data from 100 "sites" for 5 > > factors. The example data corresponds to the bottom of the attached > heatmap, > > so it is having clearly positive signals (blue color) at the center > position > > for factor 1 and factor 2. > > > > I hope this might help to illustrate my project a little better. > > > > Maxim > > > > > > > > 2009/12/11 Thomas Girke <thomas.girke@ucr.edu> > > > > > The example I send before works but there was a typo in the heatmap.2 > > > command > > > where mysort needs to replaced by mydata. Like this: > > > > > > heatmap.2(mydata, dendrogram="none", Rowv=F, Colv=F, trace="none", > key=T, > > > cellnote=mysamples) > > > > > > In heatmap.2 you have the option to include a color bar on the left > side > > > that > > > can be used to highlight clusters. See the help documentation for more > > > details. > > > > > > The dimensions of heatmap.2 plots can be controlled like for any other > plot > > > in > > > R, using the hight and width arguments, e.g. x11(height=6, width=2) or > > > pdf(...). > > > > > > To provide more specific help, you may want to send a simple sample > data > > > set that > > > illustrates what you are trying to do exactly. Without this it is > really > > > hard > > > to understand your problem. > > > > > > If I were you then I would perform the entire clustering procedure in > R. Is > > > there any > > > good reason not use R for this? For hierarchical clustering you can use > the > > > hclust function. > > > A relatively complete list of clustering algorithms available in R can > be > > > found on > > > the cluster task view page: > > > http://cran.at.r-project.org/web/views/Cluster.html > > > > > > Thomas > > > > > > On Fri, Dec 11, 2009 at 10:17:22PM +0100, Maxim wrote: > > > > Hi, > > > > > > > > what you suggested sounds interesting but actually I do not > understand > > > where > > > > it is going to. Actually I'll get an error doing it like this as > mysort > > > in > > > > heatmap.2 is not defined (yet). > > > > > > > > What did work for me in meantime, is simply to plot the complete > heatmap > > > at > > > > once. What is missing in this approach is a label on the left or > right > > > side > > > > of the heatmap, I'd love to have a colorcoded block that allows me to > > > see, > > > > where different IDs were plotted (different IDs are actually clusters > > > coming > > > > from hierarchical clustering not performed in R). > > > > > > > > Second I can do it by generating individual heatmaps for each ID > loaded > > > from > > > > individual files, unfortunately for some IDs there are thousand rows > of > > > > data, for others only 50. But R's heatmap always produces similarly > sized > > > > maps. I'd prefer to have the height of the individual heatmaps > according > > > to > > > > the corresponding number of rows rather than automatic scaling. > > > > > > > > Is there a way to do this in R? I found an old mail in the mailing > list > > > > discussing this point, the result was to use TreeView/Cluster, but I > > > cannot > > > > get this to work without doing clustering (the data is clustered > > > already), > > > > additionally I do not know how to do batch processing in TreeView. > > > > > > > > Maxim > > > > 2009/12/10 Thomas Girke <thomas.girke@ucr.edu> > > > > > > > > > I am not sure if I understand every part of your problem correctly, > > > > > but here is an example how something like this could be done in R. > > > > > Its main idea is to keep the entire data set in one matrix and use > > > > > the cell note feature of heatmap.2 for sample tracking. > > > > > > > > > > ## Sample matrix for demo purpose. If your > > > > > y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, > sep=""), > > > > > paste("t", 1:5, sep=""))) > > > > > > > > > > ## Sort each row by its values > > > > > mydata <- t(apply(y, 1, sort)) > > > > > > > > > > ## Obtain sample labels (column titles) for sorted rows > > > > > mysamples <- t(apply(y, 1, function(x) names(sort(x)))) > > > > > > > > > > ## Plot heatmap where the sample labels are given as cell notes for > > > > > tracking purposes > > > > > library(gplots) > > > > > heatmap.2(mydata, dendrogram="none", Rowv=F, Colv=F, > col=redgreen(75), > > > > > scale="row", trace="none", key=T, cellnote=mysamples) > > > > > > > > > > Thomas > > > > > > > > > > > > > > > On Thu, Dec 10, 2009 at 03:13:31PM +0100, Maxim wrote: > > > > > > Hi, > > > > > > > > > > > > > > > > > > I'm stuck with parsing data into R for heatmap representation. > > > > > > > > > > > > > > > > > > The data looks like: > > > > > > > > > > > > 1 id1 x1 x2 x3 .... x20 > > > > > > > > > > > > 2 id1 x1 x2 x3 .... x20 > > > > > > > > > > > > 3 id1 x1 x2 x3 .... x20 > > > > > > > > > > > > 4 id1 x1 x2 x3 .... x20 > > > > > > > > > > > > ......... > > > > > > > > > > > > 348 id2 x1 x2 x3 .... x20 > > > > > > > > > > > > 349 id2 x1 x2 x3 .... x20 > > > > > > > > > > > > 350 id2 x1 x2 x3 .... x20 > > > > > > > > > > > > 351 id2 x1 x2 x3 .... x20 > > > > > > > > > > > > ......... > > > > > > > > > > > > > > > > > > > > > > > > The data is sorted for the IDs (id1,id2 .....id40) and I like to > > > produce > > > > > 40 > > > > > > heatmaps thereof, 1 heatmap per data corresponding to a single > ID. > > > The > > > > > data > > > > > > that has to be plotted is 20 values (x1 to x20). There is > different > > > > > amounts > > > > > > of data for respective IDs. In the end I'd like to have the 40 > > > heatmaps > > > > > > stacked on top of each other sorted by ID and heatmap heights > > > according > > > > > to > > > > > > the amount (number of rows) of data. Unfortunately the individual > > > data > > > > > lines > > > > > > have to be sorted with respect to the maximum of the values X1 to > x20 > > > in > > > > > > individual rows. Actually this not that important as I guess this > > > might > > > > > be > > > > > > easier to realize in upstream Perl scripts producing the data. > > > > > > > > > > > > > > > > > > The data is available as data per ID in individual files or as a > > > sorted > > > > > file > > > > > > with the complete dataset (as shown above). > > > > > > > > > > > > > > > > > > Is it possible in R to break a file as above into distinct blocks > > > > > (depending > > > > > > on ID) and then to process it (sorting according to maximum, > > > heatmap)? > > > > > > > > > > > > > > > > > > Which commands do I have to issue for the manipulation of the > > > data.frame? > > > > > I > > > > > > tried the > > > > > > > > > > > > > > > > > > I'd be glad if someone could help me finding the correct > direction > > > to > > > > > solve > > > > > > my problem! > > > > > > > > > > > > > > > > > > Best regards > > > > > > > > > > > > > > > > > > Maxim > > > > > > > > > > > > [[alternative HTML version deleted]] > > > > > > > > > > > > _______________________________________________ > > > > > > Bioconductor mailing list > > > i> > > Bioconductor@stat.math.ethz.ch > > > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > > > > Search the archives: > > > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > > > > > > > > > [[alternative HTML version deleted]]

ADD REPLY • link 15.8 years ago Maxim ▴ 170

Entering edit mode

Thomas Girke ★ 1.7k

@thomas-girke-993

Last seen 18 months ago

United States

Great. - The remaining parts of your analysis seem to be solvable with simple subsetting and sorting routines of data objects in R. Here are some more suggestions that might help you here: ## Accessing data components of hclust objects: names(hr) hr$labels; hr$order ## How to return object labels in the order of a hierarchical clustering result: hr$labels[hr$order] ## Sorting vector to join rows of same type in a data matrix (e.g. your factorX data sets) sortv <- as.vector(matrix(1:500, nrow=5, ncol=100, byrow=T)) somematrix[sortv, ] Thomas On Sun, Dec 13, 2009 at 11:38:54AM +0100, Maxim wrote: > Thank you, > > that works great. I was not aware of the "image" function. To do scaling > helped definitely o improve the quality of the heatmaps. In addition I > learned a lot about handling/importing data. > > I have to think about the clustering itself. Your suggestions works fine to > cluster 1 dataset. Unfortunately clustering has to be performed comparing > all 5 datasets with each other. That means a given row in dataset 1 has to > be aligned with the same rows in datasets 2-5. When I perform clustering of > the individual datasets each set will be clustered differently. > > Despite of this you definitely provided a good starting point to better > understand how to exploit R for my project. > > Maxim > > 2009/12/13 Thomas Girke <thomas.girke at="" ucr.edu=""> > > > Below are some suggestions for importing your data into R, plotting > > heatmaps and clustering them using hierarchical clustering. For the > > clustering, you have to choose a proper distance measure for your data > > type and of course an efficient partitioning algorithm. The task view > > site I sent previously provides a good overview as to what is available. > > Clustering of 7000 objects in R is not a problem for most clustering > > methods. Many of them, including hclust, are implemented in C++/Fortran > > to run decently time and memory efficient. > > > > I hope this will help to get you started. > > > > Thomas > > > > > > ## Import of data sets (appended in one data frame) > > ## Note: a row of NA values is inserted to visually separate data sets in > > heatmap > > filenames <- list.files(pattern="factor*") # Requires data files "factorX" > > in working dir > > impDF <- data.frame(NULL) > > for(i in filenames) { > > tmp <- read.delim(i, header=FALSE) > > impDF <- rbind(impDF, tmp, rep(NA, dim(tmp)[2])) > > } > > myrownames <- paste(1:length(impDF[,1]), impDF[,2], impDF[,3], sep="_") > > rownames(impDF) <- myrownames > > impDF <- impDF[,4:44] > > imp <- as.matrix(impDF) > > > > ## Alternative import into list container (not used in following example) > > filenames <- list.files(pattern="factor*") > > datalist <- lapply(filenames, function(x) read.delim(x, header=F)) > > names(datalist) <- filenames > > > > ## Plot heatmap with heatmap.2 > > library(gplots) > > heatmap.2(imp, dendrogram="none", Rowv=F, Colv=F, col=redgreen(75), > > scale="row", trace="none", key=F) > > > > ## Plot a separate heatmap for each data set using R's image() function > > index <- matrix(1:505, nrow=5, ncol=101, byrow=T)[,-101] > > par(mfrow=c(1,5)) > > for(i in 1:5) { > > image(scale(t(imp[rev(index[i,]), ])), col=redgreen(75), xaxt="n", > > yaxt="n", main=i) > > } > > > > ## Example for hierarchical clustering of first data set using hclust > > y <- imp[1:100, ] # Selects first 100 rows (=1st data set). > > d <- as.dist(1-cor(t(y))) # Creates a distance matrix using Pearson > > correlations; here you > > # want to choose a distance measure that is best > > for your data. > > hr <- hclust(d, method="complete") # Performs hierarchical clustering > > heatmap.2(y, Rowv=as.dendrogram(hr), dendrogram="row", Colv=F, > > col=redgreen(75), scale="row", trace="none") > > > > > > > > On Sat, Dec 12, 2009 at 11:25:05PM +0100, Maxim wrote: > > > Hi, > > > > > > At first: thanks for taking the time!! > > > > > > I could send some of the data and a jpeg that illustates where I would > > like > > > to go to. Unfortunately the data is large. It is genomics data measuring > > > binding of several factors to specific genomic regions. I'd like to > > identify > > > clusters where different factors show similar binding behaviour. > > > > > > My current dataset has data for 10 factors. I'm looking at 7000 "sites", > > > each represented by 40 datapoints (that is in 100bp steps from position > > > -2000 to +2000 relative to the sites). Each site represents a certain > > > genomic location and I look at the same sites for every factor. > > > > > > I wonder if this can be done straightforward in R? > > > > > > I attached an example of the data. It is the first 100 "sites" for 5 > > factors > > > and the corresponding heatmap (for the complete set) after external > > > clustering. > > > > > > Concerning doing the clustering in R: I have no clue how to do clustering > > > with such "mutli-dimensional" data within R. I'd be glad in case you > > could > > > point me at the right direction how to approach such a task. The > > approaches > > > in the literature appear to be quite complex and need lots of CPU power > > and > > > are scripted in C, I guess for speed reasons. I wonder whether R is fast > > > enough to accomplish such a task in a reasonable time. > > > > > > To explain the data: > > > It is 5 small tab-delimited files, each with data from 100 "sites" for 5 > > > factors. The example data corresponds to the bottom of the attached > > heatmap, > > > so it is having clearly positive signals (blue color) at the center > > position > > > for factor 1 and factor 2. > > > > > > I hope this might help to illustrate my project a little better. > > > > > > Maxim > > > > > > > > > > > > 2009/12/11 Thomas Girke <thomas.girke at="" ucr.edu=""> > > > > > > > The example I send before works but there was a typo in the heatmap.2 > > > > command > > > > where mysort needs to replaced by mydata. Like this: > > > > > > > > heatmap.2(mydata, dendrogram="none", Rowv=F, Colv=F, trace="none", > > key=T, > > > > cellnote=mysamples) > > > > > > > > In heatmap.2 you have the option to include a color bar on the left > > side > > > > that > > > > can be used to highlight clusters. See the help documentation for more > > > > details. > > > > > > > > The dimensions of heatmap.2 plots can be controlled like for any other > > plot > > > > in > > > > R, using the hight and width arguments, e.g. x11(height=6, width=2) or > > > > pdf(...). > > > > > > > > To provide more specific help, you may want to send a simple sample > > data > > > > set that > > > > illustrates what you are trying to do exactly. Without this it is > > really > > > > hard > > > > to understand your problem. > > > > > > > > If I were you then I would perform the entire clustering procedure in > > R. Is > > > > there any > > > > good reason not use R for this? For hierarchical clustering you can use > > the > > > > hclust function. > > > > A relatively complete list of clustering algorithms available in R can > > be > > > > found on > > > > the cluster task view page: > > > > http://cran.at.r-project.org/web/views/Cluster.html > > > > > > > > Thomas > > > > > > > > On Fri, Dec 11, 2009 at 10:17:22PM +0100, Maxim wrote: > > > > > Hi, > > > > > > > > > > what you suggested sounds interesting but actually I do not > > understand > > > > where > > > > > it is going to. Actually I'll get an error doing it like this as > > mysort > > > > in > > > > > heatmap.2 is not defined (yet). > > > > > > > > > > What did work for me in meantime, is simply to plot the complete > > heatmap > > > > at > > > > > once. What is missing in this approach is a label on the left or > > right > > > > side > > > > > of the heatmap, I'd love to have a colorcoded block that allows me to > > > > see, > > > > > where different IDs were plotted (different IDs are actually clusters > > > > coming > > > > > from hierarchical clustering not performed in R). > > > > > > > > > > Second I can do it by generating individual heatmaps for each ID > > loaded > > > > from > > > > > individual files, unfortunately for some IDs there are thousand rows > > of > > > > > data, for others only 50. But R's heatmap always produces similarly > > sized > > > > > maps. I'd prefer to have the height of the individual heatmaps > > according > > > > to > > > > > the corresponding number of rows rather than automatic scaling. > > > > > > > > > > Is there a way to do this in R? I found an old mail in the mailing > > list > > > > > discussing this point, the result was to use TreeView/Cluster, but I > > > > cannot > > > > > get this to work without doing clustering (the data is clustered > > > > already), > > > > > additionally I do not know how to do batch processing in TreeView. > > > > > > > > > > Maxim > > > > > 2009/12/10 Thomas Girke <thomas.girke at="" ucr.edu=""> > > > > > > > > > > > I am not sure if I understand every part of your problem correctly, > > > > > > but here is an example how something like this could be done in R. > > > > > > Its main idea is to keep the entire data set in one matrix and use > > > > > > the cell note feature of heatmap.2 for sample tracking. > > > > > > > > > > > > ## Sample matrix for demo purpose. If your > > > > > > y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, > > sep=""), > > > > > > paste("t", 1:5, sep=""))) > > > > > > > > > > > > ## Sort each row by its values > > > > > > mydata <- t(apply(y, 1, sort)) > > > > > > > > > > > > ## Obtain sample labels (column titles) for sorted rows > > > > > > mysamples <- t(apply(y, 1, function(x) names(sort(x)))) > > > > > > > > > > > > ## Plot heatmap where the sample labels are given as cell notes for > > > > > > tracking purposes > > > > > > library(gplots) > > > > > > heatmap.2(mydata, dendrogram="none", Rowv=F, Colv=F, > > col=redgreen(75), > > > > > > scale="row", trace="none", key=T, cellnote=mysamples) > > > > > > > > > > > > Thomas > > > > > > > > > > > > > > > > > > On Thu, Dec 10, 2009 at 03:13:31PM +0100, Maxim wrote: > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > I'm stuck with parsing data into R for heatmap representation. > > > > > > > > > > > > > > > > > > > > > The data looks like: > > > > > > > > > > > > > > 1 id1 x1 x2 x3 .... x20 > > > > > > > > > > > > > > 2 id1 x1 x2 x3 .... x20 > > > > > > > > > > > > > > 3 id1 x1 x2 x3 .... x20 > > > > > > > > > > > > > > 4 id1 x1 x2 x3 .... x20 > > > > > > > > > > > > > > ......... > > > > > > > > > > > > > > 348 id2 x1 x2 x3 .... x20 > > > > > > > > > > > > > > 349 id2 x1 x2 x3 .... x20 > > > > > > > > > > > > > > 350 id2 x1 x2 x3 .... x20 > > > > > > > > > > > > > > 351 id2 x1 x2 x3 .... x20 > > > > > > > > > > > > > > ......... > > > > > > > > > > > > > > > > > > > > > > > > > > > > The data is sorted for the IDs (id1,id2 .....id40) and I like to > > > > produce > > > > > > 40 > > > > > > > heatmaps thereof, 1 heatmap per data corresponding to a single > > ID. > > > > The > > > > > > data > > > > > > > that has to be plotted is 20 values (x1 to x20). There is > > different > > > > > > amounts > > > > > > > of data for respective IDs. In the end I'd like to have the 40 > > > > heatmaps > > > > > > > stacked on top of each other sorted by ID and heatmap heights > > > > according > > > > > > to > > > > > > > the amount (number of rows) of data. Unfortunately the individual > > > > data > > > > > > lines > > > > > > > have to be sorted with respect to the maximum of the values X1 to > > x20 > > > > in > > > > > > > individual rows. Actually this not that important as I guess this > > > > might > > > > > > be > > > > > > > easier to realize in upstream Perl scripts producing the data. > > > > > > > > > > > > > > > > > > > > > The data is available as data per ID in individual files or as a > > > > sorted > > > > > > file > > > > > > > with the complete dataset (as shown above). > > > > > > > > > > > > > > > > > > > > > Is it possible in R to break a file as above into distinct blocks > > > > > > (depending > > > > > > > on ID) and then to process it (sorting according to maximum, > > > > heatmap)? > > > > > > > > > > > > > > > > > > > > > Which commands do I have to issue for the manipulation of the > > > > data.frame? > > > > > > I > > > > > > > tried the > > > > > > > > > > > > > > > > > > > > > I'd be glad if someone could help me finding the correct > > direction > > > > to > > > > > > solve > > > > > > > my problem! > > > > > > > > > > > > > > > > > > > > > Best regards > > > > > > > > > > > > > > > > > > > > > Maxim > > > > > > > > > > > > > > [[alternative HTML version deleted]] > > > > > > > > > > > > > > _______________________________________________ > > > > > > > Bioconductor mailing list > > > > i> > > Bioconductor at stat.math.ethz.ch > > > > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > > > > > Search the archives: > > > > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > > > > > > > > > > > > > > >

ADD COMMENT • link 15.8 years ago Thomas Girke ★ 1.7k

Entering edit mode

You may want to check out coXpress, which has some functionality to compare different clustered datasets: http://www.biomedcentral.com/1471-2105/7/509 On Sun, Dec 13, 2009 at 11:38:54AM +0100, Maxim wrote: > Thank you, > > I have to think about the clustering itself. Your suggestions works fine to > cluster 1 dataset. Unfortunately clustering has to be performed comparing > all 5 datasets with each other. That means a given row in dataset 1 has to > be aligned with the same rows in datasets 2-5. When I perform clustering of > the individual datasets each set will be clustered differently. >

ADD REPLY • link 15.8 years ago michael watson IAH-C ★ 3.4k

Entering edit mode

CoXpress appears indeed to have some functionality useful for my task. Thanks for pointing me there!! Maxim 2009/12/13 michael watson (IAH-C) <michael.watson@bbsrc.ac.uk> > You may want to check out coXpress, which has some functionality to compare > different clustered datasets: > > http://www.biomedcentral.com/1471-2105/7/509 > > On Sun, Dec 13, 2009 at 11:38:54AM +0100, Maxim wrote: > > Thank you, > > > > > I have to think about the clustering itself. Your suggestions works fine > to > > cluster 1 dataset. Unfortunately clustering has to be performed comparing > > all 5 datasets with each other. That means a given row in dataset 1 has > to > > be aligned with the same rows in datasets 2-5. When I perform clustering > of > > the individual datasets each set will be clustered differently. > > > > [[alternative HTML version deleted]]

ADD REPLY • link 15.8 years ago Maxim ▴ 170

Entering edit mode

This works nicely. It means I cluster the data for one factor and sort the other factors according to the clustering result. But what next? What I will have to do is to find (cluster together) patterns that are similar *between* different factors and not within the data of one factor only. As I mentioned the attached data was already clustered in such a manner and obviously the patterns of factor 1-3 are somewhat similar. Therefore clustering makes not much sense with this data. I'm not sure whether I might just miss some aspect you try to explain but as far as I got it, there is some essential thing still missing in order to accomplish the clustering analysis. When you look at the attached heatmap you'll find clusters, where factor 1 and 2 are showing similar patterns, others for 2 and 3, again others for 1 and 2 and 3 and so forth. To sort out this differences some additional trick has to be done. Despite of this my R programming skills get better and better and I think I will soon substitute lots of my Perl-Code with R-Code, especially that for the data container manipulation tools. Maxim 2009/12/13 Thomas Girke <thomas.girke@ucr.edu> > Great. - The remaining parts of your analysis seem to be solvable > with simple subsetting and sorting routines of data objects in R. > > Here are some more suggestions that might help you here: > > ## Accessing data components of hclust objects: > names(hr) > hr$labels; hr$order > > ## How to return object labels in the order of a hierarchical clustering > result: > hr$labels[hr$order] > > ## Sorting vector to join rows of same type in a data matrix (e.g. your > factorX data sets) > sortv <- as.vector(matrix(1:500, nrow=5, ncol=100, byrow=T)) > somematrix[sortv, ] > > Thomas > > > On Sun, Dec 13, 2009 at 11:38:54AM +0100, Maxim wrote: > > Thank you, > > > > that works great. I was not aware of the "image" function. To do scaling > > helped definitely o improve the quality of the heatmaps. In addition I > > learned a lot about handling/importing data. > > > > I have to think about the clustering itself. Your suggestions works fine > to > > cluster 1 dataset. Unfortunately clustering has to be performed comparing > > all 5 datasets with each other. That means a given row in dataset 1 has > to > > be aligned with the same rows in datasets 2-5. When I perform clustering > of > > the individual datasets each set will be clustered differently. > > > > Despite of this you definitely provided a good starting point to better > > understand how to exploit R for my project. > > > > Maxim > > > > 2009/12/13 Thomas Girke <thomas.girke@ucr.edu> > > > > > Below are some suggestions for importing your data into R, plotting > > > heatmaps and clustering them using hierarchical clustering. For the > > > clustering, you have to choose a proper distance measure for your data > > > type and of course an efficient partitioning algorithm. The task view > > > site I sent previously provides a good overview as to what is > available. > > > Clustering of 7000 objects in R is not a problem for most clustering > > > methods. Many of them, including hclust, are implemented in C++/Fortran > > > to run decently time and memory efficient. > > > > > > I hope this will help to get you started. > > > > > > Thomas > > > > > > > > > ## Import of data sets (appended in one data frame) > > > ## Note: a row of NA values is inserted to visually separate data sets > in > > > heatmap > > > filenames <- list.files(pattern="factor*") # Requires data files > "factorX" > > > in working dir > > > impDF <- data.frame(NULL) > > > for(i in filenames) { > > > tmp <- read.delim(i, header=FALSE) > > > impDF <- rbind(impDF, tmp, rep(NA, dim(tmp)[2])) > > > } > > > myrownames <- paste(1:length(impDF[,1]), impDF[,2], impDF[,3], sep="_") > > > rownames(impDF) <- myrownames > > > impDF <- impDF[,4:44] > > > imp <- as.matrix(impDF) > > > > > > ## Alternative import into list container (not used in following > example) > > > filenames <- list.files(pattern="factor*") > > > datalist <- lapply(filenames, function(x) read.delim(x, header=F)) > > > names(datalist) <- filenames > > > > > > ## Plot heatmap with heatmap.2 > > > library(gplots) > > > heatmap.2(imp, dendrogram="none", Rowv=F, Colv=F, col=redgreen(75), > > > scale="row", trace="none", key=F) > > > > > > ## Plot a separate heatmap for each data set using R's image() function > > > index <- matrix(1:505, nrow=5, ncol=101, byrow=T)[,-101] > > > par(mfrow=c(1,5)) > > > for(i in 1:5) { > > > image(scale(t(imp[rev(index[i,]), ])), col=redgreen(75), > xaxt="n", > > > yaxt="n", main=i) > > > } > > > > > > ## Example for hierarchical clustering of first data set using hclust > > > y <- imp[1:100, ] # Selects first 100 rows (=1st data set). > > > d <- as.dist(1-cor(t(y))) # Creates a distance matrix using Pearson > > > correlations; here you > > > # want to choose a distance measure that is > best > > > for your data. > > > hr <- hclust(d, method="complete") # Performs hierarchical clustering > > > heatmap.2(y, Rowv=as.dendrogram(hr), dendrogram="row", Colv=F, > > > col=redgreen(75), scale="row", trace="none") > > > > > > > > > > > > On Sat, Dec 12, 2009 at 11:25:05PM +0100, Maxim wrote: > > > > Hi, > > > > > > > > At first: thanks for taking the time!! > > > > > > > > I could send some of the data and a jpeg that illustates where I > would > > > like > > > > to go to. Unfortunately the data is large. It is genomics data > measuring > > > > binding of several factors to specific genomic regions. I'd like to > > > identify > > > > clusters where different factors show similar binding behaviour. > > > > > > > > My current dataset has data for 10 factors. I'm looking at 7000 > "sites", > > > > each represented by 40 datapoints (that is in 100bp steps from > position > > > > -2000 to +2000 relative to the sites). Each site represents a certain > > > > genomic location and I look at the same sites for every factor. > > > > > > > > I wonder if this can be done straightforward in R? > > > > > > > > I attached an example of the data. It is the first 100 "sites" for 5 > > > factors > > > > and the corresponding heatmap (for the complete set) after external > > > > clustering. > > > > > > > > Concerning doing the clustering in R: I have no clue how to do > clustering > > > > with such "mutli-dimensional" data within R. I'd be glad in case you > > > could > > > > point me at the right direction how to approach such a task. The > > > approaches > > > > in the literature appear to be quite complex and need lots of CPU > power > > > and > > > > are scripted in C, I guess for speed reasons. I wonder whether R is > fast > > > > enough to accomplish such a task in a reasonable time. > > > > > > > > To explain the data: > > > > It is 5 small tab-delimited files, each with data from 100 "sites" > for 5 > > > > factors. The example data corresponds to the bottom of the attached > > > heatmap, > > > > so it is having clearly positive signals (blue color) at the center > > > position > > > > for factor 1 and factor 2. > > > > > > > > I hope this might help to illustrate my project a little better. > > > > > > > > Maxim > > > > > > > > > > > > > > > > 2009/12/11 Thomas Girke <thomas.girke@ucr.edu> > > > > > > > > > The example I send before works but there was a typo in the > heatmap.2 > > > > > command > > > > > where mysort needs to replaced by mydata. Like this: > > > > > > > > > > heatmap.2(mydata, dendrogram="none", Rowv=F, Colv=F, trace="none", > > > key=T, > > > > > cellnote=mysamples) > > > > > > > > > > In heatmap.2 you have the option to include a color bar on the left > > > side > > > > > that > > > > > can be used to highlight clusters. See the help documentation for > more > > > > > details. > > > > > > > > > > The dimensions of heatmap.2 plots can be controlled like for any > other > > > plot > > > > > in > > > > > R, using the hight and width arguments, e.g. x11(height=6, width=2) > or > > > > > pdf(...). > > > > > > > > > > To provide more specific help, you may want to send a simple sample > > > data > > > > > set that > > > > > illustrates what you are trying to do exactly. Without this it is > > > really > > > > > hard > > > > > to understand your problem. > > > > > > > > > > If I were you then I would perform the entire clustering procedure > in > > > R. Is > > > > > there any > > > > > good reason not use R for this? For hierarchical clustering you can > use > > > the > > > > > hclust function. > > > > > A relatively complete list of clustering algorithms available in R > can > > > be > > > > > found on > > > > > the cluster task view page: > > > > > http://cran.at.r-project.org/web/views/Cluster.html > > > > > > > > > > Thomas > > > > > > > > > > On Fri, Dec 11, 2009 at 10:17:22PM +0100, Maxim wrote: > > > > > > Hi, > > > > > > > > > > > > what you suggested sounds interesting but actually I do not > > > understand > > > > > where > > > > > > it is going to. Actually I'll get an error doing it like this as > > > mysort > > > > > in > > > > > > heatmap.2 is not defined (yet). > > > > > > > > > > > > What did work for me in meantime, is simply to plot the complete > > > heatmap > > > > > at > > > > > > once. What is missing in this approach is a label on the left or > > > right > > > > > side > > > > > > of the heatmap, I'd love to have a colorcoded block that allows > me to > > > > > see, > > > > > > where different IDs were plotted (different IDs are actually > clusters > > > > > coming > > > > > > from hierarchical clustering not performed in R). > > > > > > > > > > > > Second I can do it by generating individual heatmaps for each ID > > > loaded > > > > > from > > > > > > individual files, unfortunately for some IDs there are thousand > rows > > > of > > > > > > data, for others only 50. But R's heatmap always produces > similarly > > > sized > > > > > > maps. I'd prefer to have the height of the individual heatmaps > > > according > > > > > to > > > > > > the corresponding number of rows rather than automatic scaling. > > > > > > > > > > > > Is there a way to do this in R? I found an old mail in the > mailing > > > list > > > > > > discussing this point, the result was to use TreeView/Cluster, > but I > > > > > cannot > > > > > > get this to work without doing clustering (the data is clustered > > > > > already), > > > > > > additionally I do not know how to do batch processing in > TreeView. > > > > > > > > > > > > Maxim > > > > > > 2009/12/10 Thomas Girke <thomas.girke@ucr.edu> > > > > > > > > > > > > > I am not sure if I understand every part of your problem > correctly, > > > > > > > but here is an example how something like this could be done in > R. > > > > > > > Its main idea is to keep the entire data set in one matrix and > use > > > > > > > the cell note feature of heatmap.2 for sample tracking. > > > > > > > > > > > > > > ## Sample matrix for demo purpose. If your > > > > > > > y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, > > > sep=""), > > > > > > > paste("t", 1:5, sep=""))) > > > > > > > > > > > > > > ## Sort each row by its values > > > > > > > mydata <- t(apply(y, 1, sort)) > > > > > > > > > > > > > > ## Obtain sample labels (column titles) for sorted rows > > > > > > > mysamples <- t(apply(y, 1, function(x) names(sort(x)))) > > > > > > > > > > > > > > ## Plot heatmap where the sample labels are given as cell notes > for > > > > > > > tracking purposes > > > > > > > library(gplots) > > > > > > > heatmap.2(mydata, dendrogram="none", Rowv=F, Colv=F, > > > col=redgreen(75), > > > > > > > scale="row", trace="none", key=T, cellnote=mysamples) > > > > > > > > > > > > > > Thomas > > > > > > > > > > > > > > > > > > > > > On Thu, Dec 10, 2009 at 03:13:31PM +0100, Maxim wrote: > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > > > I'm stuck with parsing data into R for heatmap > representation. > > > > > > > > > > > > > > > > > > > > > > > > The data looks like: > > > > > > > > > > > > > > > > 1 id1 x1 x2 x3 .... x20 > > > > > > > > > > > > > > > > 2 id1 x1 x2 x3 .... x20 > > > > > > > > > > > > > > > > 3 id1 x1 x2 x3 .... x20 > > > > > > > > > > > > > > > > 4 id1 x1 x2 x3 .... x20 > > > > > > > > > > > > > > > > ......... > > > > > > > > > > > > > > > > 348 id2 x1 x2 x3 .... x20 > > > > > > > > > > > > > > > > 349 id2 x1 x2 x3 .... x20 > > > > > > > > > > > > > > > > 350 id2 x1 x2 x3 .... x20 > > > > > > > > > > > > > > > > 351 id2 x1 x2 x3 .... x20 > > > > > > > > > > > > > > > > ......... > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The data is sorted for the IDs (id1,id2 .....id40) and I like > to > > > > > produce > > > > > > > 40 > > > > > > > > heatmaps thereof, 1 heatmap per data corresponding to a > single > > > ID. > > > > > The > > > > > > > data > > > > > > > > that has to be plotted is 20 values (x1 to x20). There is > > > different > > > > > > > amounts > > > > > > > > of data for respective IDs. In the end I'd like to have the > 40 > > > > > heatmaps > > > > > > > > stacked on top of each other sorted by ID and heatmap heights > > > > > according > > > > > > > to > > > > > > > > the amount (number of rows) of data. Unfortunately the > individual > > > > > data > > > > > > > lines > > > > > > > > have to be sorted with respect to the maximum of the values > X1 to > > > x20 > > > > > in > > > > > > > > individual rows. Actually this not that important as I guess > this > > > > > might > > > > > > > be > > > > > > > > easier to realize in upstream Perl scripts producing the > data. > > > > > > > > > > > > > > > > > > > > > > > > The data is available as data per ID in individual files or > as a > > > > > sorted > > > > > > > file > > > > > > > > with the complete dataset (as shown above). > > > > > > > > > > > > > > > > > > > > > > > > Is it possible in R to break a file as above into distinct > blocks > > > > > > > (depending > > > > > > > > on ID) and then to process it (sorting according to maximum, > > > > > heatmap)? > > > > > > > > > > > > > > > > > > > > > > > > Which commands do I have to issue for the manipulation of the > > > > > data.frame? > > > > > > > I > > > > > > > > tried the > > > > > > > > > > > > > > > > > > > > > > > > I'd be glad if someone could help me finding the correct > > > direction > > > > > to > > > > > > > solve > > > > > > > > my problem! > > > > > > > > > > > > > > > > > > > > > > > > Best regards > > > > > > > > > > > > > > > > > > > > > > > > Maxim > > > > > > > > > > > > > > > > [[alternative HTML version deleted]] > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > Bioconductor mailing list > > > > > i> > > Bioconductor@stat.math.ethz.ch > > > > > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > > > > > > Search the archives: > > > > > > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [[alternative HTML version deleted]]

ADD REPLY • link 15.8 years ago Maxim ▴ 170

Entering edit mode

Thomas Girke ★ 1.7k

@thomas-girke-993

Last seen 18 months ago

United States

Maxim, Certainly, the sorted heatmap example is only useful for visualization purposes, while the clustering example is included to show how to perform clustering on your data in R in general not to provide a final solution for your research problem. If your intention is to cluster the final clustering results of your factors then you could compute a similarity/distance metric among their clustering results using a set similarity measure such as the Jaccard partitioning index. This way you can cluster entire clustering data sets. For hierarchical clustering results you would first need to generate discrete clusters from the dendrogams. R's cutree function is very useful for this. Some examples and links to related R libraries for clustering partitioning results are available here: http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/R_BioCondManual.htm l#clustering_jaccard Similarly, you could compute membership similarities among the clusters obtained for all your factors using again the Jaccard partitioning index or the variation of information criterion. This could be used to cluster all the groupings obtained from your clustering results ("clustering of clusters"). Since I do not know enough about your data sets or the genomics technology you are using this case, I am not sure if any of this makes much sense. But perhaps it is useful for exploring your options... Thomas On Sun, Dec 13, 2009 at 10:28:11PM +0100, Maxim wrote: > This works nicely. > > It means I cluster the data for one factor and sort the other factors > according to the clustering result. But what next? > > What I will have to do is to find (cluster together) patterns that are > similar *between* different factors and not within the data of one factor > only. As I mentioned the attached data was already clustered in such a > manner and obviously the patterns of factor 1-3 are somewhat similar. > Therefore clustering makes not much sense with this data. > > I'm not sure whether I might just miss some aspect you try to explain but as > far as I got it, there is some essential thing still missing in order to > accomplish the clustering analysis. When you look at the attached heatmap > you'll find clusters, where factor 1 and 2 are showing similar patterns, > others for 2 and 3, again others for 1 and 2 and 3 and so forth. To sort out > this differences some additional trick has to be done. > > Despite of this my R programming skills get better and better and I think I > will soon substitute lots of my Perl-Code with R-Code, especially that for > the data container manipulation tools. > > Maxim > > > > > > > 2009/12/13 Thomas Girke <thomas.girke at="" ucr.edu=""> > > > Great. - The remaining parts of your analysis seem to be solvable > > with simple subsetting and sorting routines of data objects in R. > > > > Here are some more suggestions that might help you here: > > > > ## Accessing data components of hclust objects: > > names(hr) > > hr$labels; hr$order > > > > ## How to return object labels in the order of a hierarchical clustering > > result: > > hr$labels[hr$order] > > > > ## Sorting vector to join rows of same type in a data matrix (e.g. your > > factorX data sets) > > sortv <- as.vector(matrix(1:500, nrow=5, ncol=100, byrow=T)) > > somematrix[sortv, ] > > > > Thomas > > > > > > On Sun, Dec 13, 2009 at 11:38:54AM +0100, Maxim wrote: > > > Thank you, > > > > > > that works great. I was not aware of the "image" function. To do scaling > > > helped definitely o improve the quality of the heatmaps. In addition I > > > learned a lot about handling/importing data. > > > > > > I have to think about the clustering itself. Your suggestions works fine > > to > > > cluster 1 dataset. Unfortunately clustering has to be performed comparing > > > all 5 datasets with each other. That means a given row in dataset 1 has > > to > > > be aligned with the same rows in datasets 2-5. When I perform clustering > > of > > > the individual datasets each set will be clustered differently. > > > > > > Despite of this you definitely provided a good starting point to better > > > understand how to exploit R for my project. > > > > > > Maxim > > > > > > 2009/12/13 Thomas Girke <thomas.girke at="" ucr.edu=""> > > > > > > > Below are some suggestions for importing your data into R, plotting > > > > heatmaps and clustering them using hierarchical clustering. For the > > > > clustering, you have to choose a proper distance measure for your data > > > > type and of course an efficient partitioning algorithm. The task view > > > > site I sent previously provides a good overview as to what is > > available. > > > > Clustering of 7000 objects in R is not a problem for most clustering > > > > methods. Many of them, including hclust, are implemented in C++/Fortran > > > > to run decently time and memory efficient. > > > > > > > > I hope this will help to get you started. > > > > > > > > Thomas > > > > > > > > > > > > ## Import of data sets (appended in one data frame) > > > > ## Note: a row of NA values is inserted to visually separate data sets > > in > > > > heatmap > > > > filenames <- list.files(pattern="factor*") # Requires data files > > "factorX" > > > > in working dir > > > > impDF <- data.frame(NULL) > > > > for(i in filenames) { > > > > tmp <- read.delim(i, header=FALSE) > > > > impDF <- rbind(impDF, tmp, rep(NA, dim(tmp)[2])) > > > > } > > > > myrownames <- paste(1:length(impDF[,1]), impDF[,2], impDF[,3], sep="_") > > > > rownames(impDF) <- myrownames > > > > impDF <- impDF[,4:44] > > > > imp <- as.matrix(impDF) > > > > > > > > ## Alternative import into list container (not used in following > > example) > > > > filenames <- list.files(pattern="factor*") > > > > datalist <- lapply(filenames, function(x) read.delim(x, header=F)) > > > > names(datalist) <- filenames > > > > > > > > ## Plot heatmap with heatmap.2 > > > > library(gplots) > > > > heatmap.2(imp, dendrogram="none", Rowv=F, Colv=F, col=redgreen(75), > > > > scale="row", trace="none", key=F) > > > > > > > > ## Plot a separate heatmap for each data set using R's image() function > > > > index <- matrix(1:505, nrow=5, ncol=101, byrow=T)[,-101] > > > > par(mfrow=c(1,5)) > > > > for(i in 1:5) { > > > > image(scale(t(imp[rev(index[i,]), ])), col=redgreen(75), > > xaxt="n", > > > > yaxt="n", main=i) > > > > } > > > > > > > > ## Example for hierarchical clustering of first data set using hclust > > > > y <- imp[1:100, ] # Selects first 100 rows (=1st data set). > > > > d <- as.dist(1-cor(t(y))) # Creates a distance matrix using Pearson > > > > correlations; here you > > > > # want to choose a distance measure that is > > best > > > > for your data. > > > > hr <- hclust(d, method="complete") # Performs hierarchical clustering > > > > heatmap.2(y, Rowv=as.dendrogram(hr), dendrogram="row", Colv=F, > > > > col=redgreen(75), scale="row", trace="none") > > > > > > > > > > > > > > > > On Sat, Dec 12, 2009 at 11:25:05PM +0100, Maxim wrote: > > > > > Hi, > > > > > > > > > > At first: thanks for taking the time!! > > > > > > > > > > I could send some of the data and a jpeg that illustates where I > > would > > > > like > > > > > to go to. Unfortunately the data is large. It is genomics data > > measuring > > > > > binding of several factors to specific genomic regions. I'd like to > > > > identify > > > > > clusters where different factors show similar binding behaviour. > > > > > > > > > > My current dataset has data for 10 factors. I'm looking at 7000 > > "sites", > > > > > each represented by 40 datapoints (that is in 100bp steps from > > position > > > > > -2000 to +2000 relative to the sites). Each site represents a certain > > > > > genomic location and I look at the same sites for every factor. > > > > > > > > > > I wonder if this can be done straightforward in R? > > > > > > > > > > I attached an example of the data. It is the first 100 "sites" for 5 > > > > factors > > > > > and the corresponding heatmap (for the complete set) after external > > > > > clustering. > > > > > > > > > > Concerning doing the clustering in R: I have no clue how to do > > clustering > > > > > with such "mutli-dimensional" data within R. I'd be glad in case you > > > > could > > > > > point me at the right direction how to approach such a task. The > > > > approaches > > > > > in the literature appear to be quite complex and need lots of CPU > > power > > > > and > > > > > are scripted in C, I guess for speed reasons. I wonder whether R is > > fast > > > > > enough to accomplish such a task in a reasonable time. > > > > > > > > > > To explain the data: > > > > > It is 5 small tab-delimited files, each with data from 100 "sites" > > for 5 > > > > > factors. The example data corresponds to the bottom of the attached > > > > heatmap, > > > > > so it is having clearly positive signals (blue color) at the center > > > > position > > > > > for factor 1 and factor 2. > > > > > > > > > > I hope this might help to illustrate my project a little better. > > > > > > > > > > Maxim > > > > > > > > > > > > > > > > > > > > 2009/12/11 Thomas Girke <thomas.girke at="" ucr.edu=""> > > > > > > > > > > > The example I send before works but there was a typo in the > > heatmap.2 > > > > > > command > > > > > > where mysort needs to replaced by mydata. Like this: > > > > > > > > > > > > heatmap.2(mydata, dendrogram="none", Rowv=F, Colv=F, trace="none", > > > > key=T, > > > > > > cellnote=mysamples) > > > > > > > > > > > > In heatmap.2 you have the option to include a color bar on the left > > > > side > > > > > > that > > > > > > can be used to highlight clusters. See the help documentation for > > more > > > > > > details. > > > > > > > > > > > > The dimensions of heatmap.2 plots can be controlled like for any > > other > > > > plot > > > > > > in > > > > > > R, using the hight and width arguments, e.g. x11(height=6, width=2) > > or > > > > > > pdf(...). > > > > > > > > > > > > To provide more specific help, you may want to send a simple sample > > > > data > > > > > > set that > > > > > > illustrates what you are trying to do exactly. Without this it is > > > > really > > > > > > hard > > > > > > to understand your problem. > > > > > > > > > > > > If I were you then I would perform the entire clustering procedure > > in > > > > R. Is > > > > > > there any > > > > > > good reason not use R for this? For hierarchical clustering you can > > use > > > > the > > > > > > hclust function. > > > > > > A relatively complete list of clustering algorithms available in R > > can > > > > be > > > > > > found on > > > > > > the cluster task view page: > > > > > > http://cran.at.r-project.org/web/views/Cluster.html > > > > > > > > > > > > Thomas > > > > > > > > > > > > On Fri, Dec 11, 2009 at 10:17:22PM +0100, Maxim wrote: > > > > > > > Hi, > > > > > > > > > > > > > > what you suggested sounds interesting but actually I do not > > > > understand > > > > > > where > > > > > > > it is going to. Actually I'll get an error doing it like this as > > > > mysort > > > > > > in > > > > > > > heatmap.2 is not defined (yet). > > > > > > > > > > > > > > What did work for me in meantime, is simply to plot the complete > > > > heatmap > > > > > > at > > > > > > > once. What is missing in this approach is a label on the left or > > > > right > > > > > > side > > > > > > > of the heatmap, I'd love to have a colorcoded block that allows > > me to > > > > > > see, > > > > > > > where different IDs were plotted (different IDs are actually > > clusters > > > > > > coming > > > > > > > from hierarchical clustering not performed in R). > > > > > > > > > > > > > > Second I can do it by generating individual heatmaps for each ID > > > > loaded > > > > > > from > > > > > > > individual files, unfortunately for some IDs there are thousand > > rows > > > > of > > > > > > > data, for others only 50. But R's heatmap always produces > > similarly > > > > sized > > > > > > > maps. I'd prefer to have the height of the individual heatmaps > > > > according > > > > > > to > > > > > > > the corresponding number of rows rather than automatic scaling. > > > > > > > > > > > > > > Is there a way to do this in R? I found an old mail in the > > mailing > > > > list > > > > > > > discussing this point, the result was to use TreeView/Cluster, > > but I > > > > > > cannot > > > > > > > get this to work without doing clustering (the data is clustered > > > > > > already), > > > > > > > additionally I do not know how to do batch processing in > > TreeView. > > > > > > > > > > > > > > Maxim > > > > > > > 2009/12/10 Thomas Girke <thomas.girke at="" ucr.edu=""> > > > > > > > > > > > > > > > I am not sure if I understand every part of your problem > > correctly, > > > > > > > > but here is an example how something like this could be done in > > R. > > > > > > > > Its main idea is to keep the entire data set in one matrix and > > use > > > > > > > > the cell note feature of heatmap.2 for sample tracking. > > > > > > > > > > > > > > > > ## Sample matrix for demo purpose. If your > > > > > > > > y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, > > > > sep=""), > > > > > > > > paste("t", 1:5, sep=""))) > > > > > > > > > > > > > > > > ## Sort each row by its values > > > > > > > > mydata <- t(apply(y, 1, sort)) > > > > > > > > > > > > > > > > ## Obtain sample labels (column titles) for sorted rows > > > > > > > > mysamples <- t(apply(y, 1, function(x) names(sort(x)))) > > > > > > > > > > > > > > > > ## Plot heatmap where the sample labels are given as cell notes > > for > > > > > > > > tracking purposes > > > > > > > > library(gplots) > > > > > > > > heatmap.2(mydata, dendrogram="none", Rowv=F, Colv=F, > > > > col=redgreen(75), > > > > > > > > scale="row", trace="none", key=T, cellnote=mysamples) > > > > > > > > > > > > > > > > Thomas > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Dec 10, 2009 at 03:13:31PM +0100, Maxim wrote: > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > > > > > > I'm stuck with parsing data into R for heatmap > > representation. > > > > > > > > > > > > > > > > > > > > > > > > > > > The data looks like: > > > > > > > > > > > > > > > > > > 1 id1 x1 x2 x3 .... x20 > > > > > > > > > > > > > > > > > > 2 id1 x1 x2 x3 .... x20 > > > > > > > > > > > > > > > > > > 3 id1 x1 x2 x3 .... x20 > > > > > > > > > > > > > > > > > > 4 id1 x1 x2 x3 .... x20 > > > > > > > > > > > > > > > > > > ......... > > > > > > > > > > > > > > > > > > 348 id2 x1 x2 x3 .... x20 > > > > > > > > > > > > > > > > > > 349 id2 x1 x2 x3 .... x20 > > > > > > > > > > > > > > > > > > 350 id2 x1 x2 x3 .... x20 > > > > > > > > > > > > > > > > > > 351 id2 x1 x2 x3 .... x20 > > > > > > > > > > > > > > > > > > ......... > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The data is sorted for the IDs (id1,id2 .....id40) and I like > > to > > > > > > produce > > > > > > > > 40 > > > > > > > > > heatmaps thereof, 1 heatmap per data corresponding to a > > single > > > > ID. > > > > > > The > > > > > > > > data > > > > > > > > > that has to be plotted is 20 values (x1 to x20). There is > > > > different > > > > > > > > amounts > > > > > > > > > of data for respective IDs. In the end I'd like to have the > > 40 > > > > > > heatmaps > > > > > > > > > stacked on top of each other sorted by ID and heatmap heights > > > > > > according > > > > > > > > to > > > > > > > > > the amount (number of rows) of data. Unfortunately the > > individual > > > > > > data > > > > > > > > lines > > > > > > > > > have to be sorted with respect to the maximum of the values > > X1 to > > > > x20 > > > > > > in > > > > > > > > > individual rows. Actually this not that important as I guess > > this > > > > > > might > > > > > > > > be > > > > > > > > > easier to realize in upstream Perl scripts producing the > > data. > > > > > > > > > > > > > > > > > > > > > > > > > > > The data is available as data per ID in individual files or > > as a > > > > > > sorted > > > > > > > > file > > > > > > > > > with the complete dataset (as shown above). > > > > > > > > > > > > > > > > > > > > > > > > > > > Is it possible in R to break a file as above into distinct > > blocks > > > > > > > > (depending > > > > > > > > > on ID) and then to process it (sorting according to maximum, > > > > > > heatmap)? > > > > > > > > > > > > > > > > > > > > > > > > > > > Which commands do I have to issue for the manipulation of the > > > > > > data.frame? > > > > > > > > I > > > > > > > > > tried the > > > > > > > > > > > > > > > > > > > > > > > > > > > I'd be glad if someone could help me finding the correct > > > > direction > > > > > > to > > > > > > > > solve > > > > > > > > > my problem! > > > > > > > > > > > > > > > > > > > > > > > > > > > Best regards > > > > > > > > > > > > > > > > > > > > > > > > > > > Maxim > > > > > > > > > > > > > > > > > > [[alternative HTML version deleted]] > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > Bioconductor mailing list > > > > > > i> > > Bioconductor at stat.math.ethz.ch > > > > > > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > > > > > > > Search the archives: > > > > > > > > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >

ADD COMMENT • link 15.8 years ago Thomas Girke ★ 1.7k

Entering edit mode

Hi, after having addressed most of my questions I have a minor problem with the indexing of my data. After I visualized all of my data in heatmaps I'd like to plot average profiles of each cluster for each factor. I added a new (1st) column to my data indicating the cluster ID. Right now I do this like: myrownames <- paste(1:length(impDF[,1]), impDF[,2], impDF[,3], sep="_") rownames(impDF) <- myrownames colnames(impDF, do.NULL = TRUE, prefix = "col") total<-nrow(impDF) clusters <- subset (impDF , V1 == 1) clusters <- clusters[,5:45] clust <- as.matrix(clusters) a<-nrow(clust) sizefactor<- a/total col=(a/count) index <- matrix(1:a, nrow=count, ncol=col, byrow=T)[,-101] Now I'd like to bind the individual factors' data together in way that I get a continuous profile for factors 1-5 (meanwhile 6). I do this like g<-clust[index[1,],] h<-clust[index[2,],] i<-clust[index[3,],] j<-clust[index[4,],] k<-clust[index[5,],] l<-clust[index[6,],] m<-cbind(g,h,i,j,k,l) walk <- seq(1, ncol(m), by=10) pos<-c() sig<-c() for (k in 1:ncol(m)) { pos<-append(pos, rep(walk[k], nrow(m))) sig<-append(sig,as.vector(m[,k])) } par(mar = c(5, 4, 4, 2) + 0.3) avg <- apply(m, 2, mean) plot( avg, col=0,lty=2, lwd =1,) lines(avg, col=8, lwd=2, type = "l" Question 1: the cbind portion looks a bit complicated, how can this be substituted by simpler code? Question 2: each time I do the plot, I get circles plotted, irrespective of what I choose for lty (therefore I plot with col=0 and then do lines(...). What is the explanation for this strange behavior? Maxim 2009/12/13 Thomas Girke <thomas.girke@ucr.edu> > Maxim, > > Certainly, the sorted heatmap example is only useful for visualization > purposes, while the clustering example is included to show how to perform > clustering on your data in R in general not to provide a final solution > for your research problem. > > If your intention is to cluster the final clustering results of your > factors then you could compute a similarity/distance metric among their > clustering results using a set similarity measure such as the Jaccard > partitioning index. This way you can cluster entire clustering data > sets. For hierarchical clustering results you would first need to > generate discrete clusters from the dendrogams. R's cutree function is > very useful for this. Some examples and links to related R libraries for > clustering partitioning results are available here: > > http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/R_BioCondManual.h tml#clustering_jaccard > > Similarly, you could compute membership similarities among the clusters > obtained for all your factors using again the Jaccard partitioning index > or the variation of information criterion. This could be used to cluster > all the groupings obtained from your clustering results ("clustering of > clusters"). > > Since I do not know enough about your data sets or the genomics technology > you are > using this case, I am not sure if any of this makes much sense. But perhaps > it is useful for exploring your options... > > Thomas > > > > > > > > > On Sun, Dec 13, 2009 at 10:28:11PM +0100, Maxim wrote: > > This works nicely. > > > > It means I cluster the data for one factor and sort the other factors > > according to the clustering result. But what next? > > > > What I will have to do is to find (cluster together) patterns that are > > similar *between* different factors and not within the data of one factor > > only. As I mentioned the attached data was already clustered in such a > > manner and obviously the patterns of factor 1-3 are somewhat similar. > > Therefore clustering makes not much sense with this data. > > > > I'm not sure whether I might just miss some aspect you try to explain but > as > > far as I got it, there is some essential thing still missing in order to > > accomplish the clustering analysis. When you look at the attached heatmap > > you'll find clusters, where factor 1 and 2 are showing similar patterns, > > others for 2 and 3, again others for 1 and 2 and 3 and so forth. To sort > out > > this differences some additional trick has to be done. > > > > Despite of this my R programming skills get better and better and I think > I > > will soon substitute lots of my Perl-Code with R-Code, especially that > for > > the data container manipulation tools. > > > > Maxim > > > > > > > > > > > > > > 2009/12/13 Thomas Girke <thomas.girke@ucr.edu> > > > > > Great. - The remaining parts of your analysis seem to be solvable > > > with simple subsetting and sorting routines of data objects in R. > > > > > > Here are some more suggestions that might help you here: > > > > > > ## Accessing data components of hclust objects: > > > names(hr) > > > hr$labels; hr$order > > > > > > ## How to return object labels in the order of a hierarchical > clustering > > > result: > > > hr$labels[hr$order] > > > > > > ## Sorting vector to join rows of same type in a data matrix (e.g. your > > > factorX data sets) > > > sortv <- as.vector(matrix(1:500, nrow=5, ncol=100, byrow=T)) > > > somematrix[sortv, ] > > > > > > Thomas > > > > > > > > > On Sun, Dec 13, 2009 at 11:38:54AM +0100, Maxim wrote: > > > > Thank you, > > > > > > > > that works great. I was not aware of the "image" function. To do > scaling > > > > helped definitely o improve the quality of the heatmaps. In addition > I > > > > learned a lot about handling/importing data. > > > > > > > > I have to think about the clustering itself. Your suggestions works > fine > > > to > > > > cluster 1 dataset. Unfortunately clustering has to be performed > comparing > > > > all 5 datasets with each other. That means a given row in dataset 1 > has > > > to > > > > be aligned with the same rows in datasets 2-5. When I perform > clustering > > > of > > > > the individual datasets each set will be clustered differently. > > > > > > > > Despite of this you definitely provided a good starting point to > better > > > > understand how to exploit R for my project. > > > > > > > > Maxim > > > > > > > > 2009/12/13 Thomas Girke <thomas.girke@ucr.edu> > > > > > > > > > Below are some suggestions for importing your data into R, plotting > > > > > heatmaps and clustering them using hierarchical clustering. For the > > > > > clustering, you have to choose a proper distance measure for your > data > > > > > type and of course an efficient partitioning algorithm. The task > view > > > > > site I sent previously provides a good overview as to what is > > > available. > > > > > Clustering of 7000 objects in R is not a problem for most > clustering > > > > > methods. Many of them, including hclust, are implemented in > C++/Fortran > > > > > to run decently time and memory efficient. > > > > > > > > > > I hope this will help to get you started. > > > > > > > > > > Thomas > > > > > > > > > > > > > > > ## Import of data sets (appended in one data frame) > > > > > ## Note: a row of NA values is inserted to visually separate data > sets > > > in > > > > > heatmap > > > > > filenames <- list.files(pattern="factor*") # Requires data files > > > "factorX" > > > > > in working dir > > > > > impDF <- data.frame(NULL) > > > > > for(i in filenames) { > > > > > tmp <- read.delim(i, header=FALSE) > > > > > impDF <- rbind(impDF, tmp, rep(NA, dim(tmp)[2])) > > > > > } > > > > > myrownames <- paste(1:length(impDF[,1]), impDF[,2], impDF[,3], > sep="_") > > > > > rownames(impDF) <- myrownames > > > > > impDF <- impDF[,4:44] > > > > > imp <- as.matrix(impDF) > > > > > > > > > > ## Alternative import into list container (not used in following > > > example) > > > > > filenames <- list.files(pattern="factor*") > > > > > datalist <- lapply(filenames, function(x) read.delim(x, header=F)) > > > > > names(datalist) <- filenames > > > > > > > > > > ## Plot heatmap with heatmap.2 > > > > > library(gplots) > > > > > heatmap.2(imp, dendrogram="none", Rowv=F, Colv=F, col=redgreen(75), > > > > > scale="row", trace="none", key=F) > > > > > > > > > > ## Plot a separate heatmap for each data set using R's image() > function > > > > > index <- matrix(1:505, nrow=5, ncol=101, byrow=T)[,-101] > > > > > par(mfrow=c(1,5)) > > > > > for(i in 1:5) { > > > > > image(scale(t(imp[rev(index[i,]), ])), col=redgreen(75), > > > xaxt="n", > > > > > yaxt="n", main=i) > > > > > } > > > > > > > > > > ## Example for hierarchical clustering of first data set using > hclust > > > > > y <- imp[1:100, ] # Selects first 100 rows (=1st data set). > > > > > d <- as.dist(1-cor(t(y))) # Creates a distance matrix using Pearson > > > > > correlations; here you > > > > > # want to choose a distance measure that > is > > > best > > > > > for your data. > > > > > hr <- hclust(d, method="complete") # Performs hierarchical > clustering > > > > > heatmap.2(y, Rowv=as.dendrogram(hr), dendrogram="row", Colv=F, > > > > > col=redgreen(75), scale="row", trace="none") > > > > > > > > > > > > > > > > > > > > On Sat, Dec 12, 2009 at 11:25:05PM +0100, Maxim wrote: > > > > > > Hi, > > > > > > > > > > > > At first: thanks for taking the time!! > > > > > > > > > > > > I could send some of the data and a jpeg that illustates where I > > > would > > > > > like > > > > > > to go to. Unfortunately the data is large. It is genomics data > > > measuring > > > > > > binding of several factors to specific genomic regions. I'd like > to > > > > > identify > > > > > > clusters where different factors show similar binding behaviour. > > > > > > > > > > > > My current dataset has data for 10 factors. I'm looking at 7000 > > > "sites", > > > > > > each represented by 40 datapoints (that is in 100bp steps from > > > position > > > > > > -2000 to +2000 relative to the sites). Each site represents a > certain > > > > > > genomic location and I look at the same sites for every factor. > > > > > > > > > > > > I wonder if this can be done straightforward in R? > > > > > > > > > > > > I attached an example of the data. It is the first 100 "sites" > for 5 > > > > > factors > > > > > > and the corresponding heatmap (for the complete set) after > external > > > > > > clustering. > > > > > > > > > > > > Concerning doing the clustering in R: I have no clue how to do > > > clustering > > > > > > with such "mutli-dimensional" data within R. I'd be glad in case > you > > > > > could > > > > > > point me at the right direction how to approach such a task. The > > > > > approaches > > > > > > in the literature appear to be quite complex and need lots of CPU > > > power > > > > > and > > > > > > are scripted in C, I guess for speed reasons. I wonder whether R > is > > > fast > > > > > > enough to accomplish such a task in a reasonable time. > > > > > > > > > > > > To explain the data: > > > > > > It is 5 small tab-delimited files, each with data from 100 > "sites" > > > for 5 > > > > > > factors. The example data corresponds to the bottom of the > attached > > > > > heatmap, > > > > > > so it is having clearly positive signals (blue color) at the > center > > > > > position > > > > > > for factor 1 and factor 2. > > > > > > > > > > > > I hope this might help to illustrate my project a little better. > > > > > > > > > > > > Maxim > > > > > > > > > > > > > > > > > > > > > > > > 2009/12/11 Thomas Girke <thomas.girke@ucr.edu> > > > > > > > > > > > > > The example I send before works but there was a typo in the > > > heatmap.2 > > > > > > > command > > > > > > > where mysort needs to replaced by mydata. Like this: > > > > > > > > > > > > > > heatmap.2(mydata, dendrogram="none", Rowv=F, Colv=F, > trace="none", > > > > > key=T, > > > > > > > cellnote=mysamples) > > > > > > > > > > > > > > In heatmap.2 you have the option to include a color bar on the > left > > > > > side > > > > > > > that > > > > > > > can be used to highlight clusters. See the help documentation > for > > > more > > > > > > > details. > > > > > > > > > > > > > > The dimensions of heatmap.2 plots can be controlled like for > any > > > other > > > > > plot > > > > > > > in > > > > > > > R, using the hight and width arguments, e.g. x11(height=6, > width=2) > > > or > > > > > > > pdf(...). > > > > > > > > > > > > > > To provide more specific help, you may want to send a simple > sample > > > > > data > > > > > > > set that > > > > > > > illustrates what you are trying to do exactly. Without this it > is > > > > > really > > > > > > > hard > > > > > > > to understand your problem. > > > > > > > > > > > > > > If I were you then I would perform the entire clustering > procedure > > > in > > > > > R. Is > > > > > > > there any > > > > > > > good reason not use R for this? For hierarchical clustering you > can > > > use > > > > > the > > > > > > > hclust function. > > > > > > > A relatively complete list of clustering algorithms available > in R > > > can > > > > > be > > > > > > > found on > > > > > > > the cluster task view page: > > > > > > > http://cran.at.r-project.org/web/views/Cluster.html > > > > > > > > > > > > > > Thomas > > > > > > > > > > > > > > On Fri, Dec 11, 2009 at 10:17:22PM +0100, Maxim wrote: > > > > > > > > Hi, > > > > > > > > > > > > > > > > what you suggested sounds interesting but actually I do not > > > > > understand > > > > > > > where > > > > > > > > it is going to. Actually I'll get an error doing it like this > as > > > > > mysort > > > > > > > in > > > > > > > > heatmap.2 is not defined (yet). > > > > > > > > > > > > > > > > What did work for me in meantime, is simply to plot the > complete > > > > > heatmap > > > > > > > at > > > > > > > > once. What is missing in this approach is a label on the left > or > > > > > right > > > > > > > side > > > > > > > > of the heatmap, I'd love to have a colorcoded block that > allows > > > me to > > > > > > > see, > > > > > > > > where different IDs were plotted (different IDs are actually > > > clusters > > > > > > > coming > > > > > > > > from hierarchical clustering not performed in R). > > > > > > > > > > > > > > > > Second I can do it by generating individual heatmaps for each > ID > > > > > loaded > > > > > > > from > > > > > > > > individual files, unfortunately for some IDs there are > thousand > > > rows > > > > > of > > > > > > > > data, for others only 50. But R's heatmap always produces > > > similarly > > > > > sized > > > > > > > > maps. I'd prefer to have the height of the individual > heatmaps > > > > > according > > > > > > > to > > > > > > > > the corresponding number of rows rather than automatic > scaling. > > > > > > > > > > > > > > > > Is there a way to do this in R? I found an old mail in the > > > mailing > > > > > list > > > > > > > > discussing this point, the result was to use > TreeView/Cluster, > > > but I > > > > > > > cannot > > > > > > > > get this to work without doing clustering (the data is > clustered > > > > > > > already), > > > > > > > > additionally I do not know how to do batch processing in > > > TreeView. > > > > > > > > > > > > > > > > Maxim > > > > > > > > 2009/12/10 Thomas Girke <thomas.girke@ucr.edu> > > > > > > > > > > > > > > > > > I am not sure if I understand every part of your problem > > > correctly, > > > > > > > > > but here is an example how something like this could be > done in > > > R. > > > > > > > > > Its main idea is to keep the entire data set in one matrix > and > > > use > > > > > > > > > the cell note feature of heatmap.2 for sample tracking. > > > > > > > > > > > > > > > > > > ## Sample matrix for demo purpose. If your > > > > > > > > > y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", > 1:10, > > > > > sep=""), > > > > > > > > > paste("t", 1:5, sep=""))) > > > > > > > > > > > > > > > > > > ## Sort each row by its values > > > > > > > > > mydata <- t(apply(y, 1, sort)) > > > > > > > > > > > > > > > > > > ## Obtain sample labels (column titles) for sorted rows > > > > > > > > > mysamples <- t(apply(y, 1, function(x) names(sort(x)))) > > > > > > > > > > > > > > > > > > ## Plot heatmap where the sample labels are given as cell > notes > > > for > > > > > > > > > tracking purposes > > > > > > > > > library(gplots) > > > > > > > > > heatmap.2(mydata, dendrogram="none", Rowv=F, Colv=F, > > > > > col=redgreen(75), > > > > > > > > > scale="row", trace="none", key=T, cellnote=mysamples) > > > > > > > > > > > > > > > > > > Thomas > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Dec 10, 2009 at 03:13:31PM +0100, Maxim wrote: > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'm stuck with parsing data into R for heatmap > > > representation. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The data looks like: > > > > > > > > > > > > > > > > > > > > 1 id1 x1 x2 x3 .... x20 > > > > > > > > > > > > > > > > > > > > 2 id1 x1 x2 x3 .... x20 > > > > > > > > > > > > > > > > > > > > 3 id1 x1 x2 x3 .... x20 > > > > > > > > > > > > > > > > > > > > 4 id1 x1 x2 x3 .... x20 > > > > > > > > > > > > > > > > > > > > ......... > > > > > > > > > > > > > > > > > > > > 348 id2 x1 x2 x3 .... x20 > > > > > > > > > > > > > > > > > > > > 349 id2 x1 x2 x3 .... x20 > > > > > > > > > > > > > > > > > > > > 350 id2 x1 x2 x3 .... x20 > > > > > > > > > > > > > > > > > > > > 351 id2 x1 x2 x3 .... x20 > > > > > > > > > > > > > > > > > > > > ......... > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The data is sorted for the IDs (id1,id2 .....id40) and I > like > > > to > > > > > > > produce > > > > > > > > > 40 > > > > > > > > > > heatmaps thereof, 1 heatmap per data corresponding to a > > > single > > > > > ID. > > > > > > > The > > > > > > > > > data > > > > > > > > > > that has to be plotted is 20 values (x1 to x20). There is > > > > > different > > > > > > > > > amounts > > > > > > > > > > of data for respective IDs. In the end I'd like to have > the > > > 40 > > > > > > > heatmaps > > > > > > > > > > stacked on top of each other sorted by ID and heatmap > heights > > > > > > > according > > > > > > > > > to > > > > > > > > > > the amount (number of rows) of data. Unfortunately the > > > individual > > > > > > > data > > > > > > > > > lines > > > > > > > > > > have to be sorted with respect to the maximum of the > values > > > X1 to > > > > > x20 > > > > > > > in > > > > > > > > > > individual rows. Actually this not that important as I > guess > > > this > > > > > > > might > > > > > > > > > be > > > > > > > > > > easier to realize in upstream Perl scripts producing the > > > data. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The data is available as data per ID in individual files > or > > > as a > > > > > > > sorted > > > > > > > > > file > > > > > > > > > > with the complete dataset (as shown above). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Is it possible in R to break a file as above into > distinct > > > blocks > > > > > > > > > (depending > > > > > > > > > > on ID) and then to process it (sorting according to > maximum, > > > > > > > heatmap)? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Which commands do I have to issue for the manipulation of > the > > > > > > > data.frame? > > > > > > > > > I > > > > > > > > > > tried the > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'd be glad if someone could help me finding the correct > > > > > direction > > > > > > > to > > > > > > > > > solve > > > > > > > > > > my problem! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best regards > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Maxim > > > > > > > > > > > > > > > > > > > > [[alternative HTML version deleted]] > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > Bioconductor mailing list > > > > > > > i> > > Bioconductor@stat.math.ethz.ch > > > > > > > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > > > > > > > > Search the archives: > > > > > > > > > > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [[alternative HTML version deleted]]

ADD REPLY • link 15.8 years ago Maxim ▴ 170

Entering edit mode

Maxim, Could you provide a simple example data set along with your code. This way it would be easier to reproduce what you are trying to do. I don't have your original data set anymore that you sent in an email. Adopting your code to a simple random example would be the best. Something like: y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("loc", 1:10, sep=""), paste("pos", 1:5, sep=""))) Thomas On Wed, Dec 16, 2009 at 04:01:18PM +0100, Maxim wrote: > Hi, > > after having addressed most of my questions I have a minor problem with the > indexing of my data. After I visualized all of my data in heatmaps I'd like > to plot average profiles of each cluster for each factor. I added a new > (1st) column to my data indicating the cluster ID. Right now I do this > like: > > myrownames <- paste(1:length(impDF[,1]), impDF[,2], impDF[,3], sep="_") > rownames(impDF) <- myrownames > colnames(impDF, do.NULL = TRUE, prefix = "col") > total<-nrow(impDF) > clusters <- subset (impDF , V1 == 1) > clusters <- clusters[,5:45] > clust <- as.matrix(clusters) > a<-nrow(clust) > sizefactor<- a/total > col=(a/count) > index <- matrix(1:a, nrow=count, ncol=col, byrow=T)[,-101] > > Now I'd like to bind the individual factors' data together in way that I get > a continuous profile for factors 1-5 (meanwhile 6). I do this like > > g<-clust[index[1,],] > h<-clust[index[2,],] > i<-clust[index[3,],] > j<-clust[index[4,],] > k<-clust[index[5,],] > l<-clust[index[6,],] > > m<-cbind(g,h,i,j,k,l) > > > walk <- seq(1, ncol(m), by=10) > pos<-c() > sig<-c() > for (k in 1:ncol(m)) { > pos<-append(pos, rep(walk[k], nrow(m))) > sig<-append(sig,as.vector(m[,k])) > } > > par(mar = c(5, 4, 4, 2) + 0.3) > avg <- apply(m, 2, mean) > plot( avg, col=0,lty=2, lwd =1,) > lines(avg, col=8, lwd=2, type = "l" > > Question 1: the cbind portion looks a bit complicated, how can this be > substituted by simpler code? > Question 2: each time I do the plot, I get circles plotted, irrespective of > what I choose for lty (therefore I plot with col=0 and then do lines(...). > What is the explanation for this strange behavior? > > Maxim > > > 2009/12/13 Thomas Girke <thomas.girke at="" ucr.edu=""> > > > Maxim, > > > > Certainly, the sorted heatmap example is only useful for visualization > > purposes, while the clustering example is included to show how to perform > > clustering on your data in R in general not to provide a final solution > > for your research problem. > > > > If your intention is to cluster the final clustering results of your > > factors then you could compute a similarity/distance metric among their > > clustering results using a set similarity measure such as the Jaccard > > partitioning index. This way you can cluster entire clustering data > > sets. For hierarchical clustering results you would first need to > > generate discrete clusters from the dendrogams. R's cutree function is > > very useful for this. Some examples and links to related R libraries for > > clustering partitioning results are available here: > > > > http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/R_BioCondManual .html#clustering_jaccard > > > > Similarly, you could compute membership similarities among the clusters > > obtained for all your factors using again the Jaccard partitioning index > > or the variation of information criterion. This could be used to cluster > > all the groupings obtained from your clustering results ("clustering of > > clusters"). > > > > Since I do not know enough about your data sets or the genomics technology > > you are > > using this case, I am not sure if any of this makes much sense. But perhaps > > it is useful for exploring your options... > > > > Thomas > > > > > > > > > > > > > > > > > > On Sun, Dec 13, 2009 at 10:28:11PM +0100, Maxim wrote: > > > This works nicely. > > > > > > It means I cluster the data for one factor and sort the other factors > > > according to the clustering result. But what next? > > > > > > What I will have to do is to find (cluster together) patterns that are > > > similar *between* different factors and not within the data of one factor > > > only. As I mentioned the attached data was already clustered in such a > > > manner and obviously the patterns of factor 1-3 are somewhat similar. > > > Therefore clustering makes not much sense with this data. > > > > > > I'm not sure whether I might just miss some aspect you try to explain but > > as > > > far as I got it, there is some essential thing still missing in order to > > > accomplish the clustering analysis. When you look at the attached heatmap > > > you'll find clusters, where factor 1 and 2 are showing similar patterns, > > > others for 2 and 3, again others for 1 and 2 and 3 and so forth. To sort > > out > > > this differences some additional trick has to be done. > > > > > > Despite of this my R programming skills get better and better and I think > > I > > > will soon substitute lots of my Perl-Code with R-Code, especially that > > for > > > the data container manipulation tools. > > > > > > Maxim > > > > > > > > > > > > > > > > > > > > > 2009/12/13 Thomas Girke <thomas.girke at="" ucr.edu=""> > > > > > > > Great. - The remaining parts of your analysis seem to be solvable > > > > with simple subsetting and sorting routines of data objects in R. > > > > > > > > Here are some more suggestions that might help you here: > > > > > > > > ## Accessing data components of hclust objects: > > > > names(hr) > > > > hr$labels; hr$order > > > > > > > > ## How to return object labels in the order of a hierarchical > > clustering > > > > result: > > > > hr$labels[hr$order] > > > > > > > > ## Sorting vector to join rows of same type in a data matrix (e.g. your > > > > factorX data sets) > > > > sortv <- as.vector(matrix(1:500, nrow=5, ncol=100, byrow=T)) > > > > somematrix[sortv, ] > > > > > > > > Thomas > > > > > > > > > > > > On Sun, Dec 13, 2009 at 11:38:54AM +0100, Maxim wrote: > > > > > Thank you, > > > > > > > > > > that works great. I was not aware of the "image" function. To do > > scaling > > > > > helped definitely o improve the quality of the heatmaps. In addition > > I > > > > > learned a lot about handling/importing data. > > > > > > > > > > I have to think about the clustering itself. Your suggestions works > > fine > > > > to > > > > > cluster 1 dataset. Unfortunately clustering has to be performed > > comparing > > > > > all 5 datasets with each other. That means a given row in dataset 1 > > has > > > > to > > > > > be aligned with the same rows in datasets 2-5. When I perform > > clustering > > > > of > > > > > the individual datasets each set will be clustered differently. > > > > > > > > > > Despite of this you definitely provided a good starting point to > > better > > > > > understand how to exploit R for my project. > > > > > > > > > > Maxim > > > > > > > > > > 2009/12/13 Thomas Girke <thomas.girke at="" ucr.edu=""> > > > > > > > > > > > Below are some suggestions for importing your data into R, plotting > > > > > > heatmaps and clustering them using hierarchical clustering. For the > > > > > > clustering, you have to choose a proper distance measure for your > > data > > > > > > type and of course an efficient partitioning algorithm. The task > > view > > > > > > site I sent previously provides a good overview as to what is > > > > available. > > > > > > Clustering of 7000 objects in R is not a problem for most > > clustering > > > > > > methods. Many of them, including hclust, are implemented in > > C++/Fortran > > > > > > to run decently time and memory efficient. > > > > > > > > > > > > I hope this will help to get you started. > > > > > > > > > > > > Thomas > > > > > > > > > > > > > > > > > > ## Import of data sets (appended in one data frame) > > > > > > ## Note: a row of NA values is inserted to visually separate data > > sets > > > > in > > > > > > heatmap > > > > > > filenames <- list.files(pattern="factor*") # Requires data files > > > > "factorX" > > > > > > in working dir > > > > > > impDF <- data.frame(NULL) > > > > > > for(i in filenames) { > > > > > > tmp <- read.delim(i, header=FALSE) > > > > > > impDF <- rbind(impDF, tmp, rep(NA, dim(tmp)[2])) > > > > > > } > > > > > > myrownames <- paste(1:length(impDF[,1]), impDF[,2], impDF[,3], > > sep="_") > > > > > > rownames(impDF) <- myrownames > > > > > > impDF <- impDF[,4:44] > > > > > > imp <- as.matrix(impDF) > > > > > > > > > > > > ## Alternative import into list container (not used in following > > > > example) > > > > > > filenames <- list.files(pattern="factor*") > > > > > > datalist <- lapply(filenames, function(x) read.delim(x, header=F)) > > > > > > names(datalist) <- filenames > > > > > > > > > > > > ## Plot heatmap with heatmap.2 > > > > > > library(gplots) > > > > > > heatmap.2(imp, dendrogram="none", Rowv=F, Colv=F, col=redgreen(75), > > > > > > scale="row", trace="none", key=F) > > > > > > > > > > > > ## Plot a separate heatmap for each data set using R's image() > > function > > > > > > index <- matrix(1:505, nrow=5, ncol=101, byrow=T)[,-101] > > > > > > par(mfrow=c(1,5)) > > > > > > for(i in 1:5) { > > > > > > image(scale(t(imp[rev(index[i,]), ])), col=redgreen(75), > > > > xaxt="n", > > > > > > yaxt="n", main=i) > > > > > > } > > > > > > > > > > > > ## Example for hierarchical clustering of first data set using > > hclust > > > > > > y <- imp[1:100, ] # Selects first 100 rows (=1st data set). > > > > > > d <- as.dist(1-cor(t(y))) # Creates a distance matrix using Pearson > > > > > > correlations; here you > > > > > > # want to choose a distance measure that > > is > > > > best > > > > > > for your data. > > > > > > hr <- hclust(d, method="complete") # Performs hierarchical > > clustering > > > > > > heatmap.2(y, Rowv=as.dendrogram(hr), dendrogram="row", Colv=F, > > > > > > col=redgreen(75), scale="row", trace="none") > > > > > > > > > > > > > > > > > > > > > > > > On Sat, Dec 12, 2009 at 11:25:05PM +0100, Maxim wrote: > > > > > > > Hi, > > > > > > > > > > > > > > At first: thanks for taking the time!! > > > > > > > > > > > > > > I could send some of the data and a jpeg that illustates where I > > > > would > > > > > > like > > > > > > > to go to. Unfortunately the data is large. It is genomics data > > > > measuring > > > > > > > binding of several factors to specific genomic regions. I'd like > > to > > > > > > identify > > > > > > > clusters where different factors show similar binding behaviour. > > > > > > > > > > > > > > My current dataset has data for 10 factors. I'm looking at 7000 > > > > "sites", > > > > > > > each represented by 40 datapoints (that is in 100bp steps from > > > > position > > > > > > > -2000 to +2000 relative to the sites). Each site represents a > > certain > > > > > > > genomic location and I look at the same sites for every factor. > > > > > > > > > > > > > > I wonder if this can be done straightforward in R? > > > > > > > > > > > > > > I attached an example of the data. It is the first 100 "sites" > > for 5 > > > > > > factors > > > > > > > and the corresponding heatmap (for the complete set) after > > external > > > > > > > clustering. > > > > > > > > > > > > > > Concerning doing the clustering in R: I have no clue how to do > > > > clustering > > > > > > > with such "mutli-dimensional" data within R. I'd be glad in case > > you > > > > > > could > > > > > > > point me at the right direction how to approach such a task. The > > > > > > approaches > > > > > > > in the literature appear to be quite complex and need lots of CPU > > > > power > > > > > > and > > > > > > > are scripted in C, I guess for speed reasons. I wonder whether R > > is > > > > fast > > > > > > > enough to accomplish such a task in a reasonable time. > > > > > > > > > > > > > > To explain the data: > > > > > > > It is 5 small tab-delimited files, each with data from 100 > > "sites" > > > > for 5 > > > > > > > factors. The example data corresponds to the bottom of the > > attached > > > > > > heatmap, > > > > > > > so it is having clearly positive signals (blue color) at the > > center > > > > > > position > > > > > > > for factor 1 and factor 2. > > > > > > > > > > > > > > I hope this might help to illustrate my project a little better. > > > > > > > > > > > > > > Maxim > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2009/12/11 Thomas Girke <thomas.girke at="" ucr.edu=""> > > > > > > > > > > > > > > > The example I send before works but there was a typo in the > > > > heatmap.2 > > > > > > > > command > > > > > > > > where mysort needs to replaced by mydata. Like this: > > > > > > > > > > > > > > > > heatmap.2(mydata, dendrogram="none", Rowv=F, Colv=F, > > trace="none", > > > > > > key=T, > > > > > > > > cellnote=mysamples) > > > > > > > > > > > > > > > > In heatmap.2 you have the option to include a color bar on the > > left > > > > > > side > > > > > > > > that > > > > > > > > can be used to highlight clusters. See the help documentation > > for > > > > more > > > > > > > > details. > > > > > > > > > > > > > > > > The dimensions of heatmap.2 plots can be controlled like for > > any > > > > other > > > > > > plot > > > > > > > > in > > > > > > > > R, using the hight and width arguments, e.g. x11(height=6, > > width=2) > > > > or > > > > > > > > pdf(...). > > > > > > > > > > > > > > > > To provide more specific help, you may want to send a simple > > sample > > > > > > data > > > > > > > > set that > > > > > > > > illustrates what you are trying to do exactly. Without this it > > is > > > > > > really > > > > > > > > hard > > > > > > > > to understand your problem. > > > > > > > > > > > > > > > > If I were you then I would perform the entire clustering > > procedure > > > > in > > > > > > R. Is > > > > > > > > there any > > > > > > > > good reason not use R for this? For hierarchical clustering you > > can > > > > use > > > > > > the > > > > > > > > hclust function. > > > > > > > > A relatively complete list of clustering algorithms available > > in R > > > > can > > > > > > be > > > > > > > > found on > > > > > > > > the cluster task view page: > > > > > > > > http://cran.at.r-project.org/web/views/Cluster.html > > > > > > > > > > > > > > > > Thomas > > > > > > > > > > > > > > > > On Fri, Dec 11, 2009 at 10:17:22PM +0100, Maxim wrote: > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > what you suggested sounds interesting but actually I do not > > > > > > understand > > > > > > > > where > > > > > > > > > it is going to. Actually I'll get an error doing it like this > > as > > > > > > mysort > > > > > > > > in > > > > > > > > > heatmap.2 is not defined (yet). > > > > > > > > > > > > > > > > > > What did work for me in meantime, is simply to plot the > > complete > > > > > > heatmap > > > > > > > > at > > > > > > > > > once. What is missing in this approach is a label on the left > > or > > > > > > right > > > > > > > > side > > > > > > > > > of the heatmap, I'd love to have a colorcoded block that > > allows > > > > me to > > > > > > > > see, > > > > > > > > > where different IDs were plotted (different IDs are actually > > > > clusters > > > > > > > > coming > > > > > > > > > from hierarchical clustering not performed in R). > > > > > > > > > > > > > > > > > > Second I can do it by generating individual heatmaps for each > > ID > > > > > > loaded > > > > > > > > from > > > > > > > > > individual files, unfortunately for some IDs there are > > thousand > > > > rows > > > > > > of > > > > > > > > > data, for others only 50. But R's heatmap always produces > > > > similarly > > > > > > sized > > > > > > > > > maps. I'd prefer to have the height of the individual > > heatmaps > > > > > > according > > > > > > > > to > > > > > > > > > the corresponding number of rows rather than automatic > > scaling. > > > > > > > > > > > > > > > > > > Is there a way to do this in R? I found an old mail in the > > > > mailing > > > > > > list > > > > > > > > > discussing this point, the result was to use > > TreeView/Cluster, > > > > but I > > > > > > > > cannot > > > > > > > > > get this to work without doing clustering (the data is > > clustered > > > > > > > > already), > > > > > > > > > additionally I do not know how to do batch processing in > > > > TreeView. > > > > > > > > > > > > > > > > > > Maxim > > > > > > > > > 2009/12/10 Thomas Girke <thomas.girke at="" ucr.edu=""> > > > > > > > > > > > > > > > > > > > I am not sure if I understand every part of your problem > > > > correctly, > > > > > > > > > > but here is an example how something like this could be > > done in > > > > R. > > > > > > > > > > Its main idea is to keep the entire data set in one matrix > > and > > > > use > > > > > > > > > > the cell note feature of heatmap.2 for sample tracking. > > > > > > > > > > > > > > > > > > > > ## Sample matrix for demo purpose. If your > > > > > > > > > > y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", > > 1:10, > > > > > > sep=""), > > > > > > > > > > paste("t", 1:5, sep=""))) > > > > > > > > > > > > > > > > > > > > ## Sort each row by its values > > > > > > > > > > mydata <- t(apply(y, 1, sort)) > > > > > > > > > > > > > > > > > > > > ## Obtain sample labels (column titles) for sorted rows > > > > > > > > > > mysamples <- t(apply(y, 1, function(x) names(sort(x)))) > > > > > > > > > > > > > > > > > > > > ## Plot heatmap where the sample labels are given as cell > > notes > > > > for > > > > > > > > > > tracking purposes > > > > > > > > > > library(gplots) > > > > > > > > > > heatmap.2(mydata, dendrogram="none", Rowv=F, Colv=F, > > > > > > col=redgreen(75), > > > > > > > > > > scale="row", trace="none", key=T, cellnote=mysamples) > > > > > > > > > > > > > > > > > > > > Thomas > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Dec 10, 2009 at 03:13:31PM +0100, Maxim wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'm stuck with parsing data into R for heatmap > > > > representation. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The data looks like: > > > > > > > > > > > > > > > > > > > > > > 1 id1 x1 x2 x3 .... x20 > > > > > > > > > > > > > > > > > > > > > > 2 id1 x1 x2 x3 .... x20 > > > > > > > > > > > > > > > > > > > > > > 3 id1 x1 x2 x3 .... x20 > > > > > > > > > > > > > > > > > > > > > > 4 id1 x1 x2 x3 .... x20 > > > > > > > > > > > > > > > > > > > > > > ......... > > > > > > > > > > > > > > > > > > > > > > 348 id2 x1 x2 x3 .... x20 > > > > > > > > > > > > > > > > > > > > > > 349 id2 x1 x2 x3 .... x20 > > > > > > > > > > > > > > > > > > > > > > 350 id2 x1 x2 x3 .... x20 > > > > > > > > > > > > > > > > > > > > > > 351 id2 x1 x2 x3 .... x20 > > > > > > > > > > > > > > > > > > > > > > ......... > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The data is sorted for the IDs (id1,id2 .....id40) and I > > like > > > > to > > > > > > > > produce > > > > > > > > > > 40 > > > > > > > > > > > heatmaps thereof, 1 heatmap per data corresponding to a > > > > single > > > > > > ID. > > > > > > > > The > > > > > > > > > > data > > > > > > > > > > > that has to be plotted is 20 values (x1 to x20). There is > > > > > > different > > > > > > > > > > amounts > > > > > > > > > > > of data for respective IDs. In the end I'd like to have > > the > > > > 40 > > > > > > > > heatmaps > > > > > > > > > > > stacked on top of each other sorted by ID and heatmap > > heights > > > > > > > > according > > > > > > > > > > to > > > > > > > > > > > the amount (number of rows) of data. Unfortunately the > > > > individual > > > > > > > > data > > > > > > > > > > lines > > > > > > > > > > > have to be sorted with respect to the maximum of the > > values > > > > X1 to > > > > > > x20 > > > > > > > > in > > > > > > > > > > > individual rows. Actually this not that important as I > > guess > > > > this > > > > > > > > might > > > > > > > > > > be > > > > > > > > > > > easier to realize in upstream Perl scripts producing the > > > > data. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The data is available as data per ID in individual files > > or > > > > as a > > > > > > > > sorted > > > > > > > > > > file > > > > > > > > > > > with the complete dataset (as shown above). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Is it possible in R to break a file as above into > > distinct > > > > blocks > > > > > > > > > > (depending > > > > > > > > > > > on ID) and then to process it (sorting according to > > maximum, > > > > > > > > heatmap)? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Which commands do I have to issue for the manipulation of > > the > > > > > > > > data.frame? > > > > > > > > > > I > > > > > > > > > > > tried the > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'd be glad if someone could help me finding the correct > > > > > > direction > > > > > > > > to > > > > > > > > > > solve > > > > > > > > > > > my problem! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best regards > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Maxim > > > > > > > > > > > > > > > > > > > > > > [[alternative HTML version deleted]] > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > Bioconductor mailing list > > > > > > > > i> > > Bioconductor at stat.math.ethz.ch > > > > > > > > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > > > > > > > > > Search the archives: > > > > > > > > > > > > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >

ADD REPLY • link 15.8 years ago Thomas Girke ★ 1.7k