Question: WGCNA: cleaning for sample outliers
gravatar for GENOMIC_region
9 months ago by
GENOMIC_region0 wrote:

Hi there,  

I'm working with 363 samples with 10K genes. My workflow is: load data, transpose, get gene names. use hclust. Plot data once and see where abline has to be drawn. I draw plot for clustering with eye-balled abline.

I'm lost with cutheight and min size while cleaning samples. Below are code and my doubts:

data_exprs.cleaned<[, -c(1)])); #remove gene column

#add row names, and col names
names(data_exprs.cleaned) = exprs_data$gene
rownames(data_exprs.cleaned) = names(exprs_data)[-c(1)]

#check data for excessive missing values and identi_cation of outlier microarray
gsg = goodSamplesGenes(data_exprs.cleaned, verbose = 3);

#--everything OK with mapped genes
if (!gsg$allOK)
# Optionally, print the gene and sample names that were removed:
if (sum(!gsg$goodGenes)>0)
printFlush(paste("Removing genes:", paste(names(data_exprs.cleaned)[!gsg$goodGenes], collapse = ", ")));
if (sum(!gsg$goodSamples)>0)
printFlush(paste("Removing samples:", paste(rownames(data_exprs.cleaned)[!gsg$goodSamples], collapse = ", ")));
# Remove the offending genes and samples from the data:
data_exprs.cleaned= data_exprs.cleaned[gsg$goodSamples, gsg$goodGenes]

#Check outliers
sampleTree = hclust(dist(data_exprs.cleaned ), method = "average"); #do clustering 

# Plot the sample tree: 
# The user should change the dimensions if the window is too large or too small.

par(cex = 0.6);
par(mar = c(0,4,2,0))
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5,cex.axis = 1.5, cex.main = 2)
abline(h=90, col = "red")

But now comes the foggy part:

labels_min10 = cutreeStatic(sampleTree, cutHeight = 90,minSize=10)

  0   1   2   3
  3 298  34  28

labels_def = cutreeStatic(sampleTree, cutHeight = 90) #min size is 50
  0   1
 65 298

I lose 65 samples (throwing samples with label as 0) with cutheight 90 which is ~20% of input sample size with min size as 50. Don't know what to do.?

Also, I cannot decide on how the min size of cluster works here. Following are my questions and doubts:

  1. Does it mean I drop samples that have cluster size less than N (50,10) below cutHeight? 
  2. What do labels 2 and 3 tell for labels_min10 ?

Tutorial link:


microarray wgcna gene network • 337 views
ADD COMMENTlink modified 9 months ago by Bioconductor Community ♦♦ 0 • written 9 months ago by GENOMIC_region0

1. Yes, the threshold is to remove samples that are "outliers", that too few follow the same pattern to be reliable

2.Each label is a group of samples, so there are two other groups (besides group 1) that behave differently

ADD REPLYlink written 9 months ago by Lluís Revilla Sancho510

Hi Lluis,  

Thank you very much. That helps. :)

ADD REPLYlink written 9 months ago by GENOMIC_region0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 271 users visited in the last hour