Merge different datasets and perform differential expression analysis in limma
Entering edit mode
svlachavas ▴ 780
Last seen 2 days ago
Germany/Heidelberg/German Cancer Resear…

Dear Bioconductor community,

after some older posts of individually analyzing 2 affymetrix microarray colorectal datasets in R/Bioconductor, we decided to combine and merge the two datasets, in order to get one unified dataset to perform the same analysis, which is the gene expression alteration of cancer vs control paired samples in each patient. Im familiar with literature and also other posts that mention that combining different datasets it is not wise and can cause different problems in downstream analysis-but as the two platforms are both affymetrix(hgu133a & hgu133plus2), we used the inSilicoMerging R package ( to combine the two datasets based on the common probesets after the normalization of each dataset and perform batch effect correction with Combat method which is also performed from the package. Finally,the merged expression set is:

ExpressionSet (storageMode: lockedEnvironment)
assayData: 22277 features, 60 samples 
  element names: exprs 
protocolData: none
  sampleNames: St_1_WL57.CEL St_2_WL57.CEL ... 1554_03_Gemmer_1.CEL (60 total)
  varLabels: Disease Meta_factor Study
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation: hgu133plus2 hgu133a 

                                   Disease Meta_factor       Study
St_1_WL57.CEL           Normal           0               hgu133plus2
St_2_WL57.CEL           Cancer           0               hgu133plus2
St_N_EC59.CEL           Normal           0               hgu133plus2
St_T_EC59.CEL           Cancer           0                hgu133plus2
St_N_EJ58.CEL           Normal           0                 hgu133plus2
St_T_EJ58.CEL           Cancer           0                 hgu133plus2
St_N_FD58.CEL           Normal           0                hgu133plus2
St_T_FD58.CEL           Cancer           0                hgu133plus2
St_N_FG39.CEL           Normal           1                hgu133plus2
St_T_FG39.CEL           Cancer           1                 hgu133plus2
St_1_GG52.CEL           Normal           1                hgu133plus2
St_2_GG52.CEL           Cancer           1                hgu133plus2
St_N_HJ33.CEL           Normal           0                 hgu133plus2
St_T_HJ33.CEL           Cancer           0                  hgu133plus2
St_N_HK57.CEL           Normal           0                 hgu133plus2
St_T_HK57.CEL           Cancer           0                 hgu133plus2
St_N_KA49.CEL           Normal           0                 hgu133plus2
St_T_KA49.CEL           Cancer           0                 hgu133plus2
St_N_KW40.CEL           Normal           0                hgu133plus2
St_T_KW40.CEL           Cancer           0                hgu133plus2
St_1_LH53.CEL           Normal           0                  hgu133plus2
St_2_LH53.CEL           Cancer           0                  hgu133plus2
St_N_LW46.CEL           Normal           0                 hgu133plus2
St_T_LW46.CEL           Cancer           0                 hgu133plus2
St_N_NH45.CEL           Normal           0                 hgu133plus2
St_T_NH45.CEL           Cancer           0                 hgu133plus2
St_N_PH42.CEL           Normal           0                 hgu133plus2
St_T_PH42.CEL           Cancer           0                  hgu133plus2
St_1_SSch60.CEL         Normal           0                hgu133plus2
St_2_SSch60.CEL         Cancer           0                hgu133plus2
St_N_WH37.CEL           Normal           0                hgu133plus2
St_T_WH37.CEL           Cancer           0                hgu133plus2
St_N_WM60.CEL           Normal           0               hgu133plus2
St_T_WM60.CEL           Cancer           0                hgu133plus2
7_Tesch.CEL             Normal           0                    hgu133a
8_Tesch.CEL             Cancer           0                    hgu133a
0687_04_Blauth.CEL      Normal           0                hgu133a
0687_04_Blauth_1.CEL    Cancer           0              hgu133a
0863_03_Schmidt.CEL     Normal           0              hgu133a
0863_03_Schmidt_1.CEL   Cancer           0            hgu133a
0948_04_Leiber.CEL      Normal           0                hgu133a
0948_04_Leiber_1.CEL    Cancer           0              hgu133a
1043_04_Nagel.CEL       Normal           0               hgu133a
1043_04_Nagel_1.CEL     Cancer           0             hgu133a
1103_03_Braun.CEL       Normal           1              hgu133a
1103_03_Braun_1.CEL     Cancer           1             hgu133a
1234_06_Floersch.CEL    Normal           0             hgu133a
1234_06_Floersch_1.CEL  Cancer           0            hgu133a
1235_06_Hey.CEL         Normal           0               hgu133a
1235_06_Hey_1.CEL       Cancer           0              hgu133a
1236_06_Liebich.CEL     Normal           1               hgu133a
1236_06_Liebich_1.CEL   Cancer           1              hgu133a
1410_03_Urbaniak.CEL    Normal           1             hgu133a
1410_03_Urbaniak_1.CEL  Cancer           1            hgu133a
1430_04_Patschke.CEL    Normal           0             hgu133a
1430_04_Patschke_1.CEL  Cancer           0           hgu133a
1518_03_Dege.CEL        Normal           0               hgu133a
1518_03_Dege_1.CEL      Cancer           0              hgu133a
1554_03_Gemmer.CEL      Normal           1            hgu133a
1554_03_Gemmer_1.CEL    Cancer           1          hgu133a

To check visually the effect of the batch effect correction of the merged eset i used the function plotMDS,

and with some exceptions, the majority of the samples group together-that is, normal samples from both studies and cancer samples from both studies.

Thus, my main question is for the construction of design matrix to perform a paired analysis:

condition <- factor(eset.2$Disease, levels=c("Normal","Cancer")) 
pairs <- factor(rep(1:30, each = 2))
design <- model.matrix(~condition +pairs)
fit <- lmFit(eset.2,design)

fit2 <- eBayes(fit, trend=TRUE) # also have performed an initial non-specific intensity filtering(final number of probesets 17192

Or should i also include the different study variable as the batch variable indicator into the linear model ? :

batch <- factor(eset.2$Study)

design <- model.matrix(~condition +pairs +batch)

fit <- lmFit(eset.2,design)

fit2 <- eBayes(fit, trend=TRUE)

I can provide via dropbox the output of plotMDS for further confirmation

Any suggestion or help would be grateful !!!

insilicomerging batcheffect limma differential gene expression bioconductor • 2.9k views
Entering edit mode
Aaron Lun ★ 27k
Last seen 8 hours ago
The city by the bay

I can't speak for inSilicoMerging; but as for your linear model, you don't need to include the study variable as a blocking factor. This is because any differences between batches will be absorbed into the differences between pairs, i.e., the effect of the different chip will be included in the estimate of the patient-specific effect. Inclusion of batch would be redundant and result in a design matrix that is not of full rank, as we can't distinguish between effect of the patient and that of the chip.

Entering edit mode

I see. So in your opinion (and please correct me if i still havent understand fully)the batch variable implementation is unecessary in the model matrix even there are different chips, as far the study is paired and then we could not distinguish these effects

Finally, regarding the general approach: if i also perform some more diagnostic plots like PCA, or a correlation dendrogram and i dont see a serious batch effect regarding the different chip, should i follow with limma and the first design matrix without the batch variable ?

Entering edit mode

You should use the first design even if there is a large batch effect between chips, as this will be fully absorbed into the patient-specific effects.

Entering edit mode

Thank you for your consideration on this matter !!!


Login before adding your answer.

Traffic: 522 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6