Search
Question: What type of data normalization for multiple microarrays?
0
gravatar for Ewelina Dratkiewicz
2.3 years ago by
Poland
Ewelina Dratkiewicz0 wrote:

Hello,

I'm new to R and I have a little problem with data analysis. I was asked to create a correlation plot for 2 genes expressed in melanoma cells. I downloaded data from GEO (for 14 data sets, 2 types of similar microaarays), made Expression Sets, normalized with RMA, substracted data for 2 genes and compiled it into one matrix. To every sample I assigned two traits - cell type (normal, primary, metastasis and so on) and number of data set it was substracted from. Then I created simple plot to observe how my data looks like (without multiple sets normalization Spearman's correlation coefficient is above 0,5 with really low p-value). Now I would like to remove any differences between data sets - if I understood it correctly I should remove batch effect with e.g. ComBat. And here's my question - should I assume one batch equals one data set (or one data set contains more batches (differences in data collection dates and so on))? Is ComBat or SVA the best method for this particular case? And should I perform this normalization on whole data matrices (how?) and extract data for my 2 genes of interest?

I'm sorry if my post is a little chaotic but I'm still learning how to use R. I will be really greatful for your advice. 

ADD COMMENTlink modified 2.3 years ago by manimaran_197530 • written 2.3 years ago by Ewelina Dratkiewicz0
2
gravatar for manimaran_1975
2.3 years ago by
United States
manimaran_197530 wrote:
Hi, Please check out the new Shiny App R-package called BatchQC, which will let you easily do what you want. You can adjust for Batch using ComBat or SVA and compare the results, all with a click of a few buttons. Please check out the following application note in Bioinformatics journal that we just published: “BatchQC: interactive software for evaluating sample and batch effects in genomic data” Solaiappan Manimaran, Heather Marie Selby, Kwame Okrah, Claire Ruberman, Jeffrey T. Leek, John Quackenbush, Benjamin Haibe-Kains, Hector Corrada Bravo and W. Evan Johnson http://bioinformatics.oxfordjournals.org/content/early/2016/08/30/bioinformatics.btw538 BatchQC is a software tool that streamlines batch preprocessing and evaluation by providing interactive diagnostics, visualizations, and statistical analyses to explore the extent to which batch variation impacts the data. BatchQC diagnostics help determine whether batch adjustment needs to be done, and how correction should be applied before proceeding with a downstream analysis. BatchQC can also apply existing adjustment tools and allow users to evaluate their benefits interactively. BatchQC is available from Bioconductor at the following link: http://bioconductor.org/packages/BatchQC Best, Mani (Solaiappan Manimaran) From: Ewelina Dratkiewicz [bioc] [mailto:noreply@bioconductor.org] Sent: Monday, September 5, 2016 8:27 AM To: manimaran_1975@hotmail.com Subject: [bioc] What type of data normalization for multiple microarrays? Activity on a post you are following on support.bioconductor.org<https: support.bioconductor.org=""> User Ewelina Dratkiewicz<https: support.bioconductor.org="" u="" 11421=""/> wrote Question: What type of data normalization for multiple microarrays?<https: support.bioconductor.org="" p="" 86786=""/>: Hello, I'm new to R and I have a little problem with data analysis. I was asked to create a correlation plot for 2 genes expressed in melanoma cells. I downloaded data from GEO (for 14 data sets, 2 types of similar microaarays), made Expression Sets, normalized with RMA, substracted data for 2 genes and compiled it into one matrix. To every sample I assigned two traits - cell type (normal, primary, metastasis and so on) and number of data set it was substracted from. Then I created simple plot to observe how my data looks like (without multiple sets normalization Spearman's correlation coefficient is above 0,5 with really low p-value). Now I would like to remove any differences between data sets - if I understood it correctly I should remove batch effect with e.g. ComBat. And here's my question - should I assume one batch equals one data set (or one data set contains more batches (differences in data collection dates and so on))? Is ComBat or SVA the best method for this particular case? And should I perform this normalization on whole data matrices (how?) and extract data for my 2 genes of interest? I'm sorry if my post is a little chaotic but I'm still learning how to use R. I will be really greatful for your advice. ________________________________ Post tags: normalization, microarray, combat You may reply via email or visit What type of data normalization for multiple microarrays?
ADD COMMENTlink written 2.3 years ago by manimaran_197530

Thank you very match for this great suggestion! I will try to apply it to my data.

Best,

Ewelina

ADD REPLYlink written 2.3 years ago by Ewelina Dratkiewicz0
1
gravatar for polemiraza
2.3 years ago by
polemiraza60
polemiraza60 wrote:

Hello,

People assume of "what is batch" depending on their capabilities and/or precaution (it can be chip scan date*; dataset, microarray platform). Batch adjustments may significantly bias  group (phenotype) differences if your study design is unbalanced (groups are not evenly distributed across batches). From your description I deduce that you will have highly unbalanced design. There is many combination of steps you could take and none of them does  guarantee success.

It will be necessary for you to monitor your data (by using plots) after each important step.

On the assumption that you have unbalanced design and two platforms I would suggest one of the ways:

- normalize  each platform  (with RMA); use BrainArray mappings

- use Combat**  to merge data from both platforms (I assume that phenotypes on both platforms are more or less equal in therms of quantity) - here the platform is the batch.

or

- normalize  each platform (with RMA); use BrainArray mappings

- apply Combat for each platform separately (define "batch" as "data set" again be aware of phenotype composition of your "batches")

- perform scaling between platforms to merge data

or

You may try my suggestions without batch correction (eg. normalize and than scale). Sometimes the use of batch correction might be more harmful to data than not using it.

Cheers,

Pawel

*You can extract scan date from each .cel file

**Perform batch correction on the whole matrices. It is easer to get batch corrected matrix in Combat (that what you want for correlation analysis). SVA works  good in differential expression setting and is considered to be more "hard-core" to the data than Combat which makes it dangerous in the hands of inexperienced user.

 

 

ADD COMMENTlink written 2.3 years ago by polemiraza60

Thank you! I will try to apply your advice to my data and see what happens.

Best regards,

Ewelina

ADD REPLYlink written 2.3 years ago by Ewelina Dratkiewicz0

Can I ask another question? Because I tried to find some tools to merge my data but: VirtualArray is outdated and I'm unable to install this package and InsilicoDB is taking ages to process only one dataset. Can you recommend anything?

ADD REPLYlink written 2.2 years ago by Ewelina Dratkiewicz0

Try

https://www.bioconductor.org/packages/release/bioc/html/inSilicoMerging.html

Best,

Pawel

ADD REPLYlink written 2.2 years ago by polemiraza60
0
gravatar for polemiraza
2.3 years ago by
polemiraza60
polemiraza60 wrote:

Cześć Ewelina,

Please clarify:

Substracted means extracted?

14 datasets means 14 independent experiments (consisted of some number of samples, each)?

What kind of platforms they represent?

Best,

Pawel

ADD COMMENTlink written 2.3 years ago by polemiraza60

Hello,

yes, sorry for bad word choice - I meant extracted (according to names of probes corresponding to my genes of interest). And yes, these are 14 independent experiments (maybe only 2 of them are performed by the same group of scientists), each containing from 6 to about 80 samples. All experiments were performed using Affymetrix Human Genome U133A Array (GPL96) or it's derivatives (Plus, 2.0). 

Hope it clafiries it a little bit,

Ewelina

ADD REPLYlink written 2.3 years ago by Ewelina Dratkiewicz0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 388 users visited in the last hour