Question

What type of data normalization for multiple microarrays?

0

Entering edit mode

Ewelina Dratkiewicz • 0

@ewelina-dratkiewicz-11421

Last seen 9.2 years ago

Poland

Hello,

I'm new to R and I have a little problem with data analysis. I was asked to create a correlation plot for 2 genes expressed in melanoma cells. I downloaded data from GEO (for 14 data sets, 2 types of similar microaarays), made Expression Sets, normalized with RMA, substracted data for 2 genes and compiled it into one matrix. To every sample I assigned two traits - cell type (normal, primary, metastasis and so on) and number of data set it was substracted from. Then I created simple plot to observe how my data looks like (without multiple sets normalization Spearman's correlation coefficient is above 0,5 with really low p-value). Now I would like to remove any differences between data sets - if I understood it correctly I should remove batch effect with e.g. ComBat. And here's my question - should I assume one batch equals one data set (or one data set contains more batches (differences in data collection dates and so on))? Is ComBat or SVA the best method for this particular case? And should I perform this normalization on whole data matrices (how?) and extract data for my 2 genes of interest?

I'm sorry if my post is a little chaotic but I'm still learning how to use R. I will be really greatful for your advice.

normalization microarray combat • 3.3k views

ADD COMMENT • link updated 9.3 years ago by manimaran_1975 ▴ 30 • written 9.3 years ago by Ewelina Dratkiewicz • 0

score 2 · Answer 1 · 2016-09-07

Hi, Please check out the new Shiny App R-package called BatchQC, which will let you easily do what you want. You can adjust for Batch using ComBat or SVA and compare the results, all with a click of a few buttons. Please check out the following application note in Bioinformatics journal that we just published: “BatchQC: interactive software for evaluating sample and batch effects in genomic data” Solaiappan Manimaran, Heather Marie Selby, Kwame Okrah, Claire Ruberman, Jeffrey T. Leek, John Quackenbush, Benjamin Haibe-Kains, Hector Corrada Bravo and W. Evan Johnson http://bioinformatics.oxfordjournals.org/content/early/2016/08/30/bioinformatics.btw538 BatchQC is a software tool that streamlines batch preprocessing and evaluation by providing interactive diagnostics, visualizations, and statistical analyses to explore the extent to which batch variation impacts the data. BatchQC diagnostics help determine whether batch adjustment needs to be done, and how correction should be applied before proceeding with a downstream analysis. BatchQC can also apply existing adjustment tools and allow users to evaluate their benefits interactively. BatchQC is available from Bioconductor at the following link: http://bioconductor.org/packages/BatchQC Best, Mani (Solaiappan Manimaran) From: Ewelina Dratkiewicz [bioc] [mailto:noreply@bioconductor.org] Sent: Monday, September 5, 2016 8:27 AM To: manimaran_1975@hotmail.com Subject: [bioc] What type of data normalization for multiple microarrays? Activity on a post you are following on support.bioconductor.org<https: support.bioconductor.org=""> User Ewelina Dratkiewicz<https: support.bioconductor.org="" u="" 11421=""/> wrote Question: What type of data normalization for multiple microarrays?<https: support.bioconductor.org="" p="" 86786=""/>: Hello, I'm new to R and I have a little problem with data analysis. I was asked to create a correlation plot for 2 genes expressed in melanoma cells. I downloaded data from GEO (for 14 data sets, 2 types of similar microaarays), made Expression Sets, normalized with RMA, substracted data for 2 genes and compiled it into one matrix. To every sample I assigned two traits - cell type (normal, primary, metastasis and so on) and number of data set it was substracted from. Then I created simple plot to observe how my data looks like (without multiple sets normalization Spearman's correlation coefficient is above 0,5 with really low p-value). Now I would like to remove any differences between data sets - if I understood it correctly I should remove batch effect with e.g. ComBat. And here's my question - should I assume one batch equals one data set (or one data set contains more batches (differences in data collection dates and so on))? Is ComBat or SVA the best method for this particular case? And should I perform this normalization on whole data matrices (how?) and extract data for my 2 genes of interest? I'm sorry if my post is a little chaotic but I'm still learning how to use R. I will be really greatful for your advice. ________________________________ Post tags: normalization, microarray, combat You may reply via email or visit What type of data normalization for multiple microarrays?

score 1 · Answer 2 · 2016-09-06

Hello,

People assume of "what is batch" depending on their capabilities and/or precaution (it can be chip scan date*; dataset, microarray platform). Batch adjustments may significantly bias group (phenotype) differences if your study design is unbalanced (groups are not evenly distributed across batches). From your description I deduce that you will have highly unbalanced design. There is many combination of steps you could take and none of them does guarantee success.

It will be necessary for you to monitor your data (by using plots) after each important step.

On the assumption that you have unbalanced design and two platforms I would suggest one of the ways:

- normalize each platform (with RMA); use BrainArray mappings

- use Combat** to merge data from both platforms (I assume that phenotypes on both platforms are more or less equal in therms of quantity) - here the platform is the batch.

or

- normalize each platform (with RMA); use BrainArray mappings

- apply Combat for each platform separately (define "batch" as "data set" again be aware of phenotype composition of your "batches")

- perform scaling between platforms to merge data

or

You may try my suggestions without batch correction (eg. normalize and than scale). Sometimes the use of batch correction might be more harmful to data than not using it.

Cheers,

Pawel

*You can extract scan date from each .cel file

**Perform batch correction on the whole matrices. It is easer to get batch corrected matrix in Combat (that what you want for correlation analysis). SVA works good in differential expression setting and is considered to be more "hard-core" to the data than Combat which makes it dangerous in the hands of inexperienced user.

score 0 · Answer 3 · 2016-09-06

0

Entering edit mode

polemiraza ▴ 70

@polemiraza-11428

Last seen 4.1 years ago

Poland

Cześć Ewelina,

Please clarify:

Substracted means extracted?

14 datasets means 14 independent experiments (consisted of some number of samples, each)?

What kind of platforms they represent?

Best,

Pawel

ADD COMMENT • link 9.3 years ago polemiraza ▴ 70

0

Entering edit mode

Hello,

yes, sorry for bad word choice - I meant extracted (according to names of probes corresponding to my genes of interest). And yes, these are 14 independent experiments (maybe only 2 of them are performed by the same group of scientists), each containing from 6 to about 80 samples. All experiments were performed using Affymetrix Human Genome U133A Array (GPL96) or it's derivatives (Plus, 2.0).

Hope it clafiries it a little bit,

Ewelina

ADD REPLY • link 9.3 years ago Ewelina Dratkiewicz • 0