Question

Merge different datasets and perform differential expression analysis in limma

2

Entering edit mode

svlachavas ▴ 830

@svlachavas-7225

Last seen 6 months ago

Germany/Heidelberg/German Cancer Resear…

Dear Bioconductor community,

after some older posts of individually analyzing 2 affymetrix microarray colorectal datasets in R/Bioconductor, we decided to combine and merge the two datasets, in order to get one unified dataset to perform the same analysis, which is the gene expression alteration of cancer vs control paired samples in each patient. Im familiar with literature and also other posts that mention that combining different datasets it is not wise and can cause different problems in downstream analysis-but as the two platforms are both affymetrix(hgu133a & hgu133plus2), we used the inSilicoMerging R package (http://www.bioconductor.org/packages/release/bioc/html/inSilicoMerging.html) to combine the two datasets based on the common probesets after the normalization of each dataset and perform batch effect correction with Combat method which is also performed from the package. Finally,the merged expression set is:

eset_COMBAT
ExpressionSet (storageMode: lockedEnvironment)
assayData: 22277 features, 60 samples
element names: exprs
protocolData: none
phenoData
sampleNames: St_1_WL57.CEL St_2_WL57.CEL ... 1554_03_Gemmer_1.CEL (60 total)
varLabels: Disease Meta_factor Study
varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation: hgu133plus2 hgu133a

pData(eset.2)
Disease Meta_factor Study
St_1_WL57.CEL Normal 0 hgu133plus2
St_2_WL57.CEL Cancer 0 hgu133plus2
St_N_EC59.CEL Normal 0 hgu133plus2
St_T_EC59.CEL Cancer 0 hgu133plus2
St_N_EJ58.CEL Normal 0 hgu133plus2
St_T_EJ58.CEL Cancer 0 hgu133plus2
St_N_FD58.CEL Normal 0 hgu133plus2
St_T_FD58.CEL Cancer 0 hgu133plus2
St_N_FG39.CEL Normal 1 hgu133plus2
St_T_FG39.CEL Cancer 1 hgu133plus2
St_1_GG52.CEL Normal 1 hgu133plus2
St_2_GG52.CEL Cancer 1 hgu133plus2
St_N_HJ33.CEL Normal 0 hgu133plus2
St_T_HJ33.CEL Cancer 0 hgu133plus2
St_N_HK57.CEL Normal 0 hgu133plus2
St_T_HK57.CEL Cancer 0 hgu133plus2
St_N_KA49.CEL Normal 0 hgu133plus2
St_T_KA49.CEL Cancer 0 hgu133plus2
St_N_KW40.CEL Normal 0 hgu133plus2
St_T_KW40.CEL Cancer 0 hgu133plus2
St_1_LH53.CEL Normal 0 hgu133plus2
St_2_LH53.CEL Cancer 0 hgu133plus2
St_N_LW46.CEL Normal 0 hgu133plus2
St_T_LW46.CEL Cancer 0 hgu133plus2
St_N_NH45.CEL Normal 0 hgu133plus2
St_T_NH45.CEL Cancer 0 hgu133plus2
St_N_PH42.CEL Normal 0 hgu133plus2
St_T_PH42.CEL Cancer 0 hgu133plus2
St_1_SSch60.CEL Normal 0 hgu133plus2
St_2_SSch60.CEL Cancer 0 hgu133plus2
St_N_WH37.CEL Normal 0 hgu133plus2
St_T_WH37.CEL Cancer 0 hgu133plus2
St_N_WM60.CEL Normal 0 hgu133plus2
St_T_WM60.CEL Cancer 0 hgu133plus2
7_Tesch.CEL Normal 0 hgu133a
8_Tesch.CEL Cancer 0 hgu133a
0687_04_Blauth.CEL Normal 0 hgu133a
0687_04_Blauth_1.CEL Cancer 0 hgu133a
0863_03_Schmidt.CEL Normal 0 hgu133a
0863_03_Schmidt_1.CEL Cancer 0 hgu133a
0948_04_Leiber.CEL Normal 0 hgu133a
0948_04_Leiber_1.CEL Cancer 0 hgu133a
1043_04_Nagel.CEL Normal 0 hgu133a
1043_04_Nagel_1.CEL Cancer 0 hgu133a
1103_03_Braun.CEL Normal 1 hgu133a
1103_03_Braun_1.CEL Cancer 1 hgu133a
1234_06_Floersch.CEL Normal 0 hgu133a
1234_06_Floersch_1.CEL Cancer 0 hgu133a
1235_06_Hey.CEL Normal 0 hgu133a
1235_06_Hey_1.CEL Cancer 0 hgu133a
1236_06_Liebich.CEL Normal 1 hgu133a
1236_06_Liebich_1.CEL Cancer 1 hgu133a
1410_03_Urbaniak.CEL Normal 1 hgu133a
1410_03_Urbaniak_1.CEL Cancer 1 hgu133a
1430_04_Patschke.CEL Normal 0 hgu133a
1430_04_Patschke_1.CEL Cancer 0 hgu133a
1518_03_Dege.CEL Normal 0 hgu133a
1518_03_Dege_1.CEL Cancer 0 hgu133a
1554_03_Gemmer.CEL Normal 1 hgu133a
1554_03_Gemmer_1.CEL Cancer 1 hgu133a

To check visually the effect of the batch effect correction of the merged eset i used the function plotMDS,

and with some exceptions, the majority of the samples group together-that is, normal samples from both studies and cancer samples from both studies.

Thus, my main question is for the construction of design matrix to perform a paired analysis:

condition <- factor(eset.2$Disease, levels=c("Normal","Cancer"))
pairs <- factor(rep(1:30, each = 2))
design <- model.matrix(~condition +pairs)
fit <- lmFit(eset.2,design)

fit2 <- eBayes(fit, trend=TRUE) # also have performed an initial non-specific intensity filtering(final number of probesets 17192

Or should i also include the different study variable as the batch variable indicator into the linear model ? :

batch <- factor(eset.2$Study)

design <- model.matrix(~condition +pairs +batch)

fit <- lmFit(eset.2,design)

fit2 <- eBayes(fit, trend=TRUE)

I can provide via dropbox the output of plotMDS for further confirmation

Any suggestion or help would be grateful !!!

insilicomerging batcheffect limma differential gene expression bioconductor • 4.0k views

ADD COMMENT • link updated 8.8 years ago by Aaron Lun ★ 28k • written 8.8 years ago by svlachavas ▴ 830

score 2 · Answer 1 · 2015-07-08

2

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 2 hours ago

The city by the bay

I can't speak for inSilicoMerging; but as for your linear model, you don't need to include the study variable as a blocking factor. This is because any differences between batches will be absorbed into the differences between pairs, i.e., the effect of the different chip will be included in the estimate of the patient-specific effect. Inclusion of batch would be redundant and result in a design matrix that is not of full rank, as we can't distinguish between effect of the patient and that of the chip.

ADD COMMENT • link 8.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

I see. So in your opinion (and please correct me if i still havent understand fully)the batch variable implementation is unecessary in the model matrix even there are different chips, as far the study is paired and then we could not distinguish these effects

Finally, regarding the general approach: if i also perform some more diagnostic plots like PCA, or a correlation dendrogram and i dont see a serious batch effect regarding the different chip, should i follow with limma and the first design matrix without the batch variable ?

ADD REPLY • link 8.8 years ago svlachavas ▴ 830

2

Entering edit mode

You should use the first design even if there is a large batch effect between chips, as this will be fully absorbed into the patient-specific effects.