Question

How to select top 10% highly variable genes in microarray data?

1

Entering edit mode

Biologist ▴ 120

@biologist-9801

Last seen 5.7 years ago

Hi,

I use microarray data. I'm using "oligo" R package for background correction and normalisation of expression values. After normalisation I want to calculate Z-score to generate a heatmap.

As they are around 25,000 genes with expression values in the matrix, I want to create a heatmap with only top 10% highly variable genes.

Looking for a best statistical way to select top 10% highly variable genes with which I can plot a heatmap.

With some google search I found the following one:

"normdata" is a matrix with 25,000 genes after background correction and normalisation.

x <- apply(normdata, 1, IQR) #Calculate IQR
y <- normdata[x > quantile(x, 0.9), ] #selecting top 10% highly variable genes

Do you think the above code is the right way to select top 10% highly variable genes?

Thank you

r microarray snp6.0 oligo • 5.3k views

ADD COMMENT • link updated 7.9 years ago by SamGG ▴ 360 • written 7.9 years ago by Biologist ▴ 120

score 0 · Answer 1 · 2018-01-09

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 3 hours ago

United States

That's a way to do it, so long as you also account for NA values. Or you could use varFilter in the genefilter package, which will be much faster.

> z <- matrix(rnorm(1e6), ncol = 10)
> system.time(varFilter(z, var.cutoff = 0.9))
   user  system elapsed
   0.05    0.00    0.05

> fun <- function(z){y <- apply(z, 1, IQR); z[y > quantile(y, 0.9),]}
> system.time(fun(z))
   user  system elapsed
   6.08    0.00    6.14

But even with 1e5 'genes' your way only requires you to wait six extra seconds...

ADD COMMENT • link 7.9 years ago James W. MacDonald 68k

0

Entering edit mode

Dear James,

Thanks for the reply. I'm not asking about the which is faster. I'm asking whether the above given code can be used for selecting top 10% highly variable genes or not.

And one more question is - Do I need to select top 10% highly variable genes before normalisation or after normalisation?

Thank you

ADD REPLY • link 7.9 years ago Biologist ▴ 120

score 0 · Answer 2 · 2018-01-09

Hi,

I am not an expert but IMHO your code is correct to achieve your goal.

Selection should take place AFTER normalization, but if your samples are roughly similar there should be not much difference between after or before.

Just a word concerning Z-score. It will relate the data to their dispersion in the heatmap while IQR selection will not use the dispersion at all. I always look at row centred data before using Z-score.

Best.