Combining data from multiple illumina microarray experiments
2
0
Entering edit mode
chris86 ▴ 390
@chris86-8408
Last seen 22 months ago
UCL, United Kingdom

Hi

We have two experiments run on illumina arrays done by the same company. I want to combine the data. I normally use Limmas functions (read.ilmn, and neqc) to deal with normalisation of the data. However if I combine these two experiments what is the best way of going about doing this. I am aware they need background correcting and quantile normalisation using the negative and positive control probes on the arrays.

edit - platform is the same, but need to compare healthy in one group with diseased in the other group. why it was done like this - well i didnt have any say in it - i cant go into that. the company we use is highly efficient and everything is standardised so batch should not be much of a problem.

Best,

Chris

normalization limma • 648 views
2
Entering edit mode
@gordon-smyth
Last seen 9 hours ago
WEHI, Melbourne, Australia

You can simply combine the datasets before normalization like this:

x1 <- read.ilmn( files for run1 )
x2 <- read.ilmn( files for run2 )
x <- cbind(x1,x2)
y <- neqc(x)

etc. This assumes that the probes for both runs are identical and in the same order.

0
Entering edit mode

Thanks that is what I wanted.

0
Entering edit mode

My results from this look way off (too much DE). I deleted probes in the control files that did not match in both control probe exports, could this have this effect? Unless this is a batch issue.

0
Entering edit mode

You said above "My results from this look way off (too much DE)"---that's sounds very arbitary at all---what do you mean too "much DE" ? By which cutoffs or comparisons ? What is your biological question of interest ? Did you also created any diagnostic plots, such as a MDS plot to see if your samples are separated based on your biological question ? And there are indeed differences in expression to be suggested ? Or if there is any batch effect present ? (which could also be invastigated from a hierarchical clustering of your samples, a PCA plot etc.). A brief review of your code used might be more appropriate to further help or suggestions.

1
Entering edit mode
svlachavas ▴ 780
@svlachavas-7225
Last seen 11 hours ago
Germany/Heidelberg/German Cancer Resear…

Dear Chris,

actually the the task of combining 2 or more datasets is always challenging, and also with many parameters or assumptions to take into account !! Firstly, the illumina arrays are on the same platform, or different ? Secondly, the phenotype information or the experimental design is the same ? For istance, the same cell line used or different? Then, it would also be the presence of a possible batch effect, that you will have to take into account. Thus, firstly -in my personal opinion- the safest way is to analyze separately the two datasets, and then:

1) Compare your final DE lists, to identify common genuine DE genes among your "similar comparisons";

2) Identify common pathways or ontologies "enriched" in both datasets;

3) Also perform other exploratory plots of the DE statistics, such as a scatter plot of Log-FCs.

Then, it might be interesting for other reasons (i.e. for gaining theoretically more statistical power) to combine both datasets. But still, there are a lot of issues that you will have to take into account, in order even for the combination to be valid. I hope that helps.

Best,

Efstathios

0
Entering edit mode

Sorry I should have said the platform is identical we have so far not seen batch effects come out of this service we are provided with - although between entirely different experiments we cannot establish this because one group is diseased with arthritis and the other, is healthy. we need to do a direct comparison between healthy and arthritis, the reasons why the experiment was done like this are complex and I suppose we would have to ignore any possible batch effects and do it anyway.

0
Entering edit mode

Well, you mean that the one dataset is comprised only by patient samples, while the other is of healthy individuals ? To ignore generally a batch effect would be erroneous, but anyway--and in any case---you should create some EDA plots, like an MDS plot, a hierarchical clustering of your samples, etc to inspect how your groups are clustered.

0
Entering edit mode

Yes that is right. As far as I understand I will not be able to tell the difference between a batch effect and healthy/disease status anyway because they are confounded? I am trying to integrate the files together from genome studio at the moment - not cluster, pcas etc. thanks.

0
Entering edit mode

Well, this is weird to create and run in different experiments the healthy and the diseased individuals-anyway, a first possible naive solution, is to just merge both datsets on the common probeIDs-after normalization and based that you have the same platform-and then perform some EDA plots to examine the possible batch effect. It is not something great, but again it might be something

0
Entering edit mode

we were not intending to compare them originally. there is a long story behind it. it is all a bit annoying for me!