Question: Algorithm for between and within microarray normalisation (for meta-analysis).
gravatar for StephK
2.8 years ago by
StephK70 wrote:

I apologise if this is not the right place to ask, please advise.

The aim of my experiment is to look at healthy ageing gene expression changes. Thus, I want all of the data sets from GEO that studied gene expression at different ages (either comparing young to old, or as a time series), and then I want to remove all of the non-wild type/non-healthy/mutants etc, so that I am left with gene expression data for healthy individuals in human, rat and mouse at different time points.

So my specific question is how to correctly build a meta-data-set from different data sets in GEO, and do all necessary filtering/normalising etc to make the expression data from the samples comparable both within and between data sets. So this is what I think I need to do:

  1. In NCBI; GeoDatasets, I searched age[subset_variable_type], extracting 209 ageing-related files. I download the summary file. In the summary file, I can see the GSEXXXX number for each GDSXXXX number.

  2. Using a python script and FTP, I can pull down the CEL files for each GDS file (e.g. one FTP address would be here: /geo/series/GSE52nnn/GSE52550/suppl, I cd to suppl, and get GSE52550_RAW.tar). When I untar the RAW file, there is a CEL file for each sample.

  3. I tried to look at the CEL file in excel, but I think it is not human-readable.

  4. Next I want to process the data to normalise the data within each data set, between samples. I have read that I can do this by reading all of the CEL files into the rma() package in R bioconductor Affy package.

  5. Then I will do a hierarchical clustering and PCA to identify potential outliers in each data set using hclust() and prcomp() in R. Judging from the brief reading I did of the software, I think for this, I read in all of the GSMXXXX files for one particular GSEXXXX data set.

  6. The GDS_full.soft files have all of the probes (ID_ref) and the Entrez Gene Name (GeneID). Make an excel sheet with three columns: probe name, gene name and the inter-quartile range of expression values among all of the probe IDs per data set. For each gene, you select the probe that has the largest IQD across all of the samples per data set.

  7. Using this, I can make an excel spreadsheet, on the Y axis is each gene, on the X axis is each of the samples, and each cell is a normalised expression value. now at this point, I have the data normalised within each data set. So I have 209 gene/sample matrices, but I have not tried to combine the data between different data sets.

So now, if I do the above steps separately for each data set, are the data sets comparable with each other? i.e. can I compare the expression value of gene 1 in data set 1 with the expression value of gene 1 in data set 2? In that case, is the next step to make a matrix; on the y axis I have a list of all the genes, and on the x axis I have a list of all of the samples from all of the data sets combined? Then each cell is either an expression value, or if a particular gene wasn't in a particular data set, I can just assign "-" to that cell for all of the affected samples. What quality filtering to I do next to check the integrity of this data set?

Basically, I would appreciate if someone could (1) confirm that I am processing each of the data sets correctly, (2) Confirm/Tell me how to combine all of the different data sets together and (3) since I want to remove some samples from each data set (i.e. I only want the healthy individuals from each data set), should I do all of the quality filtering on the full data sets and then remove the samples at the end, or remove the samples at the start before I quality filter the rest of the data set?


ADD COMMENTlink modified 4 months ago by Bioconductor Community ♦♦ 0 • written 2.8 years ago by StephK70
Answer: Algorithm for between and within microarray normalisation (for meta-analysis).
gravatar for alexvpickering
2.8 years ago by
alexvpickering110 wrote:

My recommendation is that you use the crossmeta package, which I wrote specifically for the purpose of doing cross-platform and cross-species meta-analyses of public microarray data from GEO.

If you try to do it from scratch, you will run into a multitude of issues including:

  1. Annotation:
    •  Different species use different identifiers for homologous genes.
    •  Raw data will give you the intensity of the measured probes. These probes must be mapped to genes.
    •  Single probes can map to multiple genes. Many probes can map to the same gene.
    •  Probe identifiers are different between platforms. As such, different probe-to-gene maps are needed for each platform.
  2. Raw Data Formats:
    • You mentioned CEL files, which are from Affymetrix. These files need to be parsed in order to get intensity values for each probe. Ilumina and Agilent (the other major manufacturers) use a different format for their raw data.
  3. Differential Expression:
    • Raw intensity values within the same group are comparable within a single study (after background correction, and normalisation). However, intensity values for similar groups across studies are unlikely to be comparable. It's more likely that differences in expression between two groups (treatment vs control) will be comparable between studies.
  4. Meta-analysis:
    • Once you have differences in expression for all genes in all studies, then you need to combine these results somehow. One approach (my preference) is to combine effect size values (see here).
    • Not all genes will be measured in each study. Do you analyse only those genes common to all studies (a requirement of most currently available software for meta-analysis)? This would discard most of your data. With crossmeta, you can specify the fraction of studies in which a gene must have been measured in order to be included in the analysis.

crossmeta intelligently handles all the above issues, making meta-analysis straightforward. The basic workflow is:


# studies from GEO
gse_names  <- c("GSE9601", "GSE15069")

# get raw data for specified studies

# load and annotate raw data
esets <- load_raw(gse_names)

# perform differential expression analysis
anals <- diff_expr(esets)

# perform meta-analysis
es <- es_meta(anals)

# contribute your results
contribute(anals, "subject_of_analysis")


If crossmeta is useful to you, please contribute your meta-analysis signature (I wrote crossmeta without pay - return the favor!). Your contribution will be used to build a public database of microarray meta-analyses.

You can also checkout my blog, where I have some posts that evaluate the benefits achieved through meta-analysis with crossmeta as well as some potential applications of the signature that you obtain (longevity drugs!).

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by alexvpickering110

Hi Alex,

Can you please indicate if Crossmeta could be used to compare gene expression in a selection of samples of different series.

I need to study GE between two classes A (4 samples) and B (6 samples). The samples of class A belong to 4 different series (GSE44711, GSE22526, GSE25906). The samples of class B belong to GSE12767.

Thanks Kaouther

ADD REPLYlink written 10 weeks ago by

Hi Kaouther,

I don't think that will work. There has to be both control and test samples within each study.

ADD REPLYlink written 10 weeks ago by alexvpickering110
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 286 users visited in the last hour