How to select top 10% highly variable genes in microarray data?
Entering edit mode
Biologist ▴ 110
Last seen 3.5 years ago


I use microarray data. I'm using "oligo" R package for background correction and normalisation of expression values. After normalisation I want to calculate Z-score to generate a heatmap.

As they are around 25,000 genes with expression values in the matrix, I want to create a heatmap with only top 10% highly variable genes.

Looking for a best statistical way to select top 10% highly variable genes with which I can plot a heatmap.

With some google search I found the following one:

"normdata" is a matrix with 25,000 genes after background correction and normalisation.

        x <- apply(normdata, 1, IQR) #Calculate IQR
        y <- normdata[x > quantile(x, 0.9), ] #selecting top 10% highly variable genes

Do you think the above code is the right way to select top 10% highly variable genes?

Thank you

r microarray snp6.0 oligo • 3.7k views
Entering edit mode
Last seen 14 hours ago
United States

That's a way to do it, so long as you also account for NA values. Or you could use varFilter in the genefilter package, which will be much faster.

> z <- matrix(rnorm(1e6), ncol = 10)
> system.time(varFilter(z, var.cutoff = 0.9))
   user  system elapsed
   0.05    0.00    0.05

> fun <- function(z){y <- apply(z, 1, IQR); z[y > quantile(y, 0.9),]}
> system.time(fun(z))
   user  system elapsed
   6.08    0.00    6.14

But even with 1e5 'genes' your way only requires you to wait six extra seconds...

Entering edit mode

Dear James,

Thanks for the reply. I'm not asking about the which is faster. I'm asking whether the above given code can be used for selecting top 10% highly variable genes or not.

And one more question is - Do I need to select top 10% highly variable genes before normalisation or after normalisation?

Thank you

Entering edit mode
SamGG ▴ 300
Last seen 3 days ago


I am not an expert but IMHO your code is correct to achieve your goal.

Selection should take place AFTER normalization, but if your samples are roughly similar there should be not much difference between after or before.

Just a word concerning Z-score. It will relate the data to their dispersion in the heatmap while IQR selection will not use the dispersion at all. I always look at row centred data before using Z-score.



Login before adding your answer.

Traffic: 381 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6