Question: WGCNA: cleaning for sample outliers
gravatar for GENOMIC_region
11 months ago by
GENOMIC_region0 wrote:

Hi there,  

I'm working with 363 samples with 10K genes. My workflow is: load data, transpose, get gene names. use hclust. Plot data once and see where abline has to be drawn. I draw plot for clustering with eye-balled abline.

I'm lost with cutheight and min size while cleaning samples. Below are code and my doubts:

data_exprs.cleaned<[, -c(1)])); #remove gene column

#add row names, and col names
names(data_exprs.cleaned) = exprs_data$gene
rownames(data_exprs.cleaned) = names(exprs_data)[-c(1)]

#check data for excessive missing values and identi_cation of outlier microarray
gsg = goodSamplesGenes(data_exprs.cleaned, verbose = 3);

#--everything OK with mapped genes
if (!gsg$allOK)
# Optionally, print the gene and sample names that were removed:
if (sum(!gsg$goodGenes)>0)
printFlush(paste("Removing genes:", paste(names(data_exprs.cleaned)[!gsg$goodGenes], collapse = ", ")));
if (sum(!gsg$goodSamples)>0)
printFlush(paste("Removing samples:", paste(rownames(data_exprs.cleaned)[!gsg$goodSamples], collapse = ", ")));
# Remove the offending genes and samples from the data:
data_exprs.cleaned= data_exprs.cleaned[gsg$goodSamples, gsg$goodGenes]

#Check outliers
sampleTree = hclust(dist(data_exprs.cleaned ), method = "average"); #do clustering 

# Plot the sample tree: 
# The user should change the dimensions if the window is too large or too small.

par(cex = 0.6);
par(mar = c(0,4,2,0))
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5,cex.axis = 1.5, cex.main = 2)
abline(h=90, col = "red")

But now comes the foggy part:

labels_min10 = cutreeStatic(sampleTree, cutHeight = 90,minSize=10)

  0   1   2   3
  3 298  34  28

labels_def = cutreeStatic(sampleTree, cutHeight = 90) #min size is 50
  0   1
 65 298

I lose 65 samples (throwing samples with label as 0) with cutheight 90 which is ~20% of input sample size with min size as 50. Don't know what to do.?

Also, I cannot decide on how the min size of cluster works here. Following are my questions and doubts:

  1. Does it mean I drop samples that have cluster size less than N (50,10) below cutHeight? 
  2. What do labels 2 and 3 tell for labels_min10 ?

Tutorial link:


microarray wgcna gene network • 428 views
ADD COMMENTlink modified 11 months ago by Bioconductor Community ♦♦ 0 • written 11 months ago by GENOMIC_region0

1. Yes, the threshold is to remove samples that are "outliers", that too few follow the same pattern to be reliable

2.Each label is a group of samples, so there are two other groups (besides group 1) that behave differently

ADD REPLYlink written 11 months ago by Lluís Revilla Sancho530

Hi Lluis,  

Thank you very much. That helps. :)

ADD REPLYlink written 11 months ago by GENOMIC_region0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 214 users visited in the last hour