Hi there,
I'm working with 363 samples with 10K genes. My workflow is: load data, transpose, get gene names. use hclust. Plot data once and see where abline has to be drawn. I draw plot for clustering with eye-balled abline.
I'm lost with cutheight and min size while cleaning samples. Below are code and my doubts:
exprs_data<-read.table("complete_genes_mapped",header=TRUE) data_exprs.cleaned<-as.data.frame(t(exprs_data[, -c(1)])); #remove gene column #add row names, and col names names(data_exprs.cleaned) = exprs_data$gene rownames(data_exprs.cleaned) = names(exprs_data)[-c(1)] #check data for excessive missing values and identi_cation of outlier microarray gsg = goodSamplesGenes(data_exprs.cleaned, verbose = 3); #--everything OK with mapped genes if (!gsg$allOK) { # Optionally, print the gene and sample names that were removed: if (sum(!gsg$goodGenes)>0) printFlush(paste("Removing genes:", paste(names(data_exprs.cleaned)[!gsg$goodGenes], collapse = ", "))); if (sum(!gsg$goodSamples)>0) printFlush(paste("Removing samples:", paste(rownames(data_exprs.cleaned)[!gsg$goodSamples], collapse = ", "))); # Remove the offending genes and samples from the data: data_exprs.cleaned= data_exprs.cleaned[gsg$goodSamples, gsg$goodGenes] } #Check outliers sampleTree = hclust(dist(data_exprs.cleaned ), method = "average"); #do clustering # Plot the sample tree: # The user should change the dimensions if the window is too large or too small. CairoJPEG("sample_outliers_tree.jpeg",width=1200,height=900) par(cex = 0.6); par(mar = c(0,4,2,0)) plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5,cex.axis = 1.5, cex.main = 2) abline(h=90, col = "red") dev.off()
But now comes the foggy part:
labels_min10 = cutreeStatic(sampleTree, cutHeight = 90,minSize=10) table(labels_min10) labels 0 1 2 3 3 298 34 28 labels_def = cutreeStatic(sampleTree, cutHeight = 90) #min size is 50table(labels_def)
labels 0 1 65 298
I lose 65 samples (throwing samples with label as 0) with cutheight 90 which is ~20% of input sample size with min size as 50. Don't know what to do.?
Also, I cannot decide on how the min size of cluster works here. Following are my questions and doubts:
- Does it mean I drop samples that have cluster size less than N (50,10) below cutHeight?
- What do labels 2 and 3 tell for labels_min10 ?
Tutorial link: https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/FemaleLiver-01-dataInput.pdf
1. Yes, the threshold is to remove samples that are "outliers", that too few follow the same pattern to be reliable
2.Each label is a group of samples, so there are two other groups (besides group 1) that behave differently
Hi Lluis,
Thank you very much. That helps. :)
Is it advised to keep samples that group besides in cluster 1?