Question: How to find number of patients (percentage of given population) having high expression of gene A in given microarray data ?
0
6 months ago by
v.yuvanesh0 wrote:

I have a microarray gene expression data from GEO and I would like to find how many samples has high expression of Gene A.

This is the method, I am currently following. 1) Download raw cel file 2) Use rma package to normalize 3) Perform clustering using heatmap based on Gene A 4) By looking at cluster from heatmap, I pick the samples that has high expressed, low expressed, intermediate expressed of gene A.

1) Is there any other method to identify % of samples that has highly expressed Gene A in the given dataset?

2) How to answer the same question using datasets from multiple studies?

Example: If I have 10 studies from GEO, and all are microarray data, a) Should I perform my clustering individually on each dataset and find % of samples from each study and take an average or? b) Should I merge dataset using Combat or Limma and then perform clustering to find % of samples that has gene A that are highly expressed?

modified 6 months ago by James W. MacDonald52k • written 6 months ago by v.yuvanesh0
Answer: How to find number of patients (percentage of given population) having high expr
0
6 months ago by
United States
James W. MacDonald52k wrote:

It appears to me that you are over-thinking this. If you just want to know which samples are highly expressed for a single gene, why are you doing anything but looking at the expression levels of that gene? Why would you need to cluster or merge or any of that?

You could use something like cut to split it into quantiles:

> GeneA <- sample(6:12, 20, TRUE)
> cuts <- cut(GeneA, quantile(GeneA), include.lowest = TRUE)
> samples <- paste0("Sample_", 1:20)
> split(samples, cuts)
$[6,7] [1] "Sample_2" "Sample_6" "Sample_8" "Sample_10" "Sample_11" "Sample_12" [7] "Sample_13" "Sample_14" "Sample_17"$(7,8]
[1] "Sample_4"  "Sample_20"

$(8,10] [1] "Sample_1" "Sample_3" "Sample_16" "Sample_18" "Sample_19"$(10,12]
[1] "Sample_5"  "Sample_7"  "Sample_9"  "Sample_15"

## or if you just want low/middle/high

> cuts <- cut(GeneA, quantile(GeneA, seq(0,1,1/3)), include.lowest = TRUE)
> split(samples, cuts)
$[6,7] [1] "Sample_2" "Sample_6" "Sample_8" "Sample_10" "Sample_11" "Sample_12" [7] "Sample_13" "Sample_14" "Sample_17"$(7,9]
[1] "Sample_1"  "Sample_4"  "Sample_18" "Sample_19" "Sample_20"

\$(9,12]
[1] "Sample_3"  "Sample_5"  "Sample_7"  "Sample_9"  "Sample_15" "Sample_16"



But maybe it's more complicated than that? In which case, please expound.

My situation is, I have 2 datasets and expression range of GeneA in dataset1 is 2 to 6 and expression range of GeneA in dataset2 is 5 to 12.

if I go by quantile method, samples with expression value of 6 will be marked as high expression in dataset1, whereas in dataset2, it will be marked as low expression.

Should I consider patient with expression value of 6 as high expression or low expression ?

Is there a way, I can get a range of GeneA (minimum and maximum) ?

My primary objective is, I want to create cluster of samples that has high expression value of GeneA from multiple datasets.

Expression values between data sets aren't comparable, directly. The expression numbers are just relative indications of the underlying gene expression, within an experiment, and don't really have any inherent meaning (e.g., a 2 in one experiment isn't necessarily larger or smaller than a 5 in another experiment).

I would probably just sort into say tertiles, within each experiment and call them low/medium/high.