Hi there,
I'm working with 363 samples with 10K genes. My workflow is: load data, transpose, get gene names. use hclust. Plot data once and see where abline has to be drawn. I draw plot for clustering with eye-balled abline.
I'm lost with cutheight and min size while cleaning samples. Below are code and my doubts:
exprs_data<-read.table("complete_genes_mapped",header=TRUE)
data_exprs.cleaned<-as.data.frame(t(exprs_data[, -c(1)])); #remove gene column
#add row names, and col names
names(data_exprs.cleaned) = exprs_data$gene
rownames(data_exprs.cleaned) = names(exprs_data)[-c(1)]
#check data for excessive missing values and identi_cation of outlier microarray
gsg = goodSamplesGenes(data_exprs.cleaned, verbose = 3);
#--everything OK with mapped genes
if (!gsg$allOK)
{
# Optionally, print the gene and sample names that were removed:
if (sum(!gsg$goodGenes)>0)
printFlush(paste("Removing genes:", paste(names(data_exprs.cleaned)[!gsg$goodGenes], collapse = ", ")));
if (sum(!gsg$goodSamples)>0)
printFlush(paste("Removing samples:", paste(rownames(data_exprs.cleaned)[!gsg$goodSamples], collapse = ", ")));
# Remove the offending genes and samples from the data:
data_exprs.cleaned= data_exprs.cleaned[gsg$goodSamples, gsg$goodGenes]
}
#Check outliers
sampleTree = hclust(dist(data_exprs.cleaned ), method = "average"); #do clustering
# Plot the sample tree:
# The user should change the dimensions if the window is too large or too small.
CairoJPEG("sample_outliers_tree.jpeg",width=1200,height=900)
par(cex = 0.6);
par(mar = c(0,4,2,0))
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5,cex.axis = 1.5, cex.main = 2)
abline(h=90, col = "red")
dev.off()
But now comes the foggy part:
labels_min10 = cutreeStatic(sampleTree, cutHeight = 90,minSize=10) table(labels_min10) labels 0 1 2 3 3 298 34 28 labels_def = cutreeStatic(sampleTree, cutHeight = 90) #min size is 50table(labels_def)labels 0 1 65 298
I lose 65 samples (throwing samples with label as 0) with cutheight 90 which is ~20% of input sample size with min size as 50. Don't know what to do.?
Also, I cannot decide on how the min size of cluster works here. Following are my questions and doubts:
- Does it mean I drop samples that have cluster size less than N (50,10) below cutHeight?
- What do labels 2 and 3 tell for labels_min10 ?
Tutorial link: https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/FemaleLiver-01-dataInput.pdf

1. Yes, the threshold is to remove samples that are "outliers", that too few follow the same pattern to be reliable
2.Each label is a group of samples, so there are two other groups (besides group 1) that behave differently
Hi Lluis,
Thank you very much. That helps. :)
Is it advised to keep samples that group besides in cluster 1?