Question

How to find number of patients (percentage of given population) having high expression of gene A in given microarray data ?

0

Entering edit mode

v.yuvanesh • 0

@vyuvanesh-20922

Last seen 6.7 years ago

I have a microarray gene expression data from GEO and I would like to find how many samples has high expression of Gene A.

This is the method, I am currently following. 1) Download raw cel file 2) Use rma package to normalize 3) Perform clustering using heatmap based on Gene A 4) By looking at cluster from heatmap, I pick the samples that has high expressed, low expressed, intermediate expressed of gene A.

1) Is there any other method to identify % of samples that has highly expressed Gene A in the given dataset?

2) How to answer the same question using datasets from multiple studies?

Example: If I have 10 studies from GEO, and all are microarray data, a) Should I perform my clustering individually on each dataset and find % of samples from each study and take an average or? b) Should I merge dataset using Combat or Limma and then perform clustering to find % of samples that has gene A that are highly expressed?

microarray r clustering high expressed genes • 1.4k views

ADD COMMENT • link updated 6.7 years ago by James W. MacDonald 68k • written 6.7 years ago by v.yuvanesh • 0

score 0 · Answer 1 · 2019-05-31

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 7 hours ago

United States

It appears to me that you are over-thinking this. If you just want to know which samples are highly expressed for a single gene, why are you doing anything but looking at the expression levels of that gene? Why would you need to cluster or merge or any of that?

You could use something like cut to split it into quantiles:

> GeneA <- sample(6:12, 20, TRUE)
> cuts <- cut(GeneA, quantile(GeneA), include.lowest = TRUE)
> samples <- paste0("Sample_", 1:20)
> split(samples, cuts)
$`[6,7]`
[1] "Sample_2"  "Sample_6"  "Sample_8"  "Sample_10" "Sample_11" "Sample_12"
[7] "Sample_13" "Sample_14" "Sample_17"

$`(7,8]`
[1] "Sample_4"  "Sample_20"

$`(8,10]`
[1] "Sample_1"  "Sample_3"  "Sample_16" "Sample_18" "Sample_19"

$`(10,12]`
[1] "Sample_5"  "Sample_7"  "Sample_9"  "Sample_15"

## or if you just want low/middle/high

> cuts <- cut(GeneA, quantile(GeneA, seq(0,1,1/3)), include.lowest = TRUE)
> split(samples, cuts)
$`[6,7]`
[1] "Sample_2"  "Sample_6"  "Sample_8"  "Sample_10" "Sample_11" "Sample_12"
[7] "Sample_13" "Sample_14" "Sample_17"

$`(7,9]`
[1] "Sample_1"  "Sample_4"  "Sample_18" "Sample_19" "Sample_20"

$`(9,12]`
[1] "Sample_3"  "Sample_5"  "Sample_7"  "Sample_9"  "Sample_15" "Sample_16"

But maybe it's more complicated than that? In which case, please expound.

ADD COMMENT • link 6.7 years ago James W. MacDonald 68k

0

Entering edit mode

Thanks for your response.

My situation is, I have 2 datasets and expression range of GeneA in dataset1 is 2 to 6 and expression range of GeneA in dataset2 is 5 to 12.

if I go by quantile method, samples with expression value of 6 will be marked as high expression in dataset1, whereas in dataset2, it will be marked as low expression.

Should I consider patient with expression value of 6 as high expression or low expression ?

Is there a way, I can get a range of GeneA (minimum and maximum) ?

My primary objective is, I want to create cluster of samples that has high expression value of GeneA from multiple datasets.

ADD REPLY • link 6.7 years ago v.yuvanesh • 0

0

Entering edit mode

Expression values between data sets aren't comparable, directly. The expression numbers are just relative indications of the underlying gene expression, within an experiment, and don't really have any inherent meaning (e.g., a 2 in one experiment isn't necessarily larger or smaller than a 5 in another experiment).

I would probably just sort into say tertiles, within each experiment and call them low/medium/high.

ADD REPLY • link 6.7 years ago James W. MacDonald 68k