I have a microarray gene expression data from GEO and I would like to find how many samples has high expression of Gene A.
This is the method, I am currently following. 1) Download raw cel file 2) Use rma package to normalize 3) Perform clustering using heatmap based on Gene A 4) By looking at cluster from heatmap, I pick the samples that has high expressed, low expressed, intermediate expressed of gene A.
1) Is there any other method to identify % of samples that has highly expressed Gene A in the given dataset?
2) How to answer the same question using datasets from multiple studies?
Example: If I have 10 studies from GEO, and all are microarray data, a) Should I perform my clustering individually on each dataset and find % of samples from each study and take an average or? b) Should I merge dataset using Combat or Limma and then perform clustering to find % of samples that has gene A that are highly expressed?
Thanks for your response.
My situation is, I have 2 datasets and expression range of GeneA in dataset1 is 2 to 6 and expression range of GeneA in dataset2 is 5 to 12.
if I go by quantile method, samples with expression value of 6 will be marked as high expression in dataset1, whereas in dataset2, it will be marked as low expression.
Should I consider patient with expression value of 6 as high expression or low expression ?
Is there a way, I can get a range of GeneA (minimum and maximum) ?
My primary objective is, I want to create cluster of samples that has high expression value of GeneA from multiple datasets.
Expression values between data sets aren't comparable, directly. The expression numbers are just relative indications of the underlying gene expression, within an experiment, and don't really have any inherent meaning (e.g., a 2 in one experiment isn't necessarily larger or smaller than a 5 in another experiment).
I would probably just sort into say tertiles, within each experiment and call them low/medium/high.