Limma: advice needed for best approach for joint paired and unpaired analyses in unbalanced intervention study
Entering edit mode
Guido Hooiveld ★ 3.2k
Last seen 19 hours ago
Wageningen University, Wageningen, the …


I am struggling with the analysis of an unbalanced intervention study, and would appreciate any advice on this.

I got a dataset from a coworker in which he analyzed gene expression in tissue biopsies from healthy (N=14) and obese (N=29) subjects at baseline (B), and after 2 interventions that were performed in the obese subjects only [Treat1 and Treat2]. In total my dataset consists of 69 arrays, comprised of 14 healthy baseline arrays, and for the obese subjects 13 complete array pairs (baseline vs after) for Treat1, 13 complete pairs for Treat2, and 3 incomplete pairs (2 arrays for obese subjects baseline only, and 1 array for Treat2 only).


Of interest are:

  1. genes differentially expressed at baseline between all 14 healthy samples vs all 28 obese baseline samples.

  2. genes differentially expressed in obese subjects after Treatment1 (after vs before in subset of 13 ppl (=26 arrays) + 2 unpaired baseline arrays.

  3. genes differentially expressed in obese subjects after Treatment2 (after vs before in subset of 13 ppl +26 arrays) + 1 unpaired Treat2 array + 2 unpaired baseline arrays)

  4. genes that respond different between the 2 treatments (“delta” of the 2 “delta’s”, i.e. comparison 2 vs 3).


In this experiment I thus would like to make comparisons within and between subjects.

AFAIK i do have two options:

  1. analyze using a paired design, i.e. block on SubjectID

  2. analyze according the multilevel example in the limma user guide (thus using the function duplicateCorrelation).

However, what would be the best approach in mycase?

Advantage of A is that paired (sub-)structure of the design is optimally used, disadvantage is that design is not of full rank and that some coefficients are thus not estimable. The latter is not really an issue because it are two coefs for two subjects?

Advantage of B is that design is of full rank, but the paired substructure is not optimally used, and hence is less powerful?


Any insights are appreciated!




> targets
                              Filename Status   Time SubjectID
1        DELTA_G131_A01_1_PP002_T0.CEL   lean   base        s2
2        DELTA_G131_A03_3_PP003_T0.CEL   lean   base        s3
3        DELTA_G131_A05_5_PP004_T0.CEL   lean   base        s4
4      DELTA_G131_A07_7_PP005_T0_2.CEL   lean   base        s5
5        DELTA_G131_A09_9_PP006_T0.CEL   lean   base        s6
6       DELTA_G131_A11_11_PP007_T0.CEL   lean   base        s7
7       DELTA_G131_B01_15_PP009_T0.CEL   lean   base        s9
8       DELTA_G131_B03_17_PP010_T0.CEL   lean   base       s10
9       DELTA_G131_B05_23_PP013_T0.CEL   lean   base       s13
10      DELTA_G131_B07_27_PP015_T0.CEL   lean   base       s15
11   DELTA_G131_B09_29_PP101_BL_T0.CEL  obese baseT2      s101
12   DELTA_G131_B11_31_PP101_FU_T0.CEL  obese     T2      s101
13   DELTA_G131_C01_33_PP102_BL_T0.CEL  obese baseT1      s102
14   DELTA_G131_C03_35_PP102_FU_T0.CEL  obese     T1      s102
15   DELTA_G131_C05_41_PP104_BL_T0.CEL  obese baseT1      s104
16   DELTA_G131_C07_43_PP104_FU_T0.CEL  obese     T1      s104
17   DELTA_G131_C09_45_PP105_BL_T0.CEL  obese baseT2      s105
18   DELTA_G131_C11_47_PP105_FU_T0.CEL  obese     T2      s105
19   DELTA_G131_D01_49_PP106_BL_T0.CEL  obese baseT1      s106
20   DELTA_G131_D03_51_PP106_FU_T0.CEL  obese     T1      s106
21   DELTA_G131_D05_53_PP107_BL_T0.CEL  obese baseT2      s107
22   DELTA_G131_D07_55_PP107_FU_T0.CEL  obese     T2      s107
23   DELTA_G131_D09_63_PP110_BL_T0.CEL  obese baseT2      s110
24   DELTA_G131_D11_65_PP110_FU_T0.CEL  obese     T2      s110
25   DELTA_G131_E01_67_PP111_BL_T0.CEL  obese baseT1      s111
26   DELTA_G131_E03_69_PP111_FU_T0.CEL  obese     T1      s111
27   DELTA_G131_E05_71_PP112_BL_T0.CEL  obese baseT2      s112
28   DELTA_G131_E07_73_PP112_FU_T0.CEL  obese     T2      s112
29   DELTA_G131_E09_75_PP113_BL_T0.CEL  obese baseT1      s113
30   DELTA_G131_E11_77_PP113_FU_T0.CEL  obese     T1      s113
31   DELTA_G131_F01_82_PP115_BL_T0.CEL  obese baseT1      s115
32   DELTA_G131_F03_84_PP115_FU_T0.CEL  obese     T1      s115
33   DELTA_G131_F09_90_PP117_BL_T0.CEL  obese baseT2      s117
34   DELTA_G131_F11_92_PP117_FU_T0.CEL  obese     T2      s117
35   DELTA_G131_G01_94_PP118_BL_T0.CEL  obese baseT1      s118
36   DELTA_G131_G03_96_PP118_FU_T0.CEL  obese     T1      s118
37  DELTA_G131_G05_104_PP121_BL_T0.CEL  obese baseT1      s121
38  DELTA_G131_G07_106_PP121_FU_T0.CEL  obese     T1      s121
39  DELTA_G131_G09_108_PP122_BL_T0.CEL  obese baseT2      s122
40  DELTA_G131_G11_110_PP122_FU_T0.CEL  obese     T2      s122
41  DELTA_G131_H01_112_PP123_BL_T0.CEL  obese baseT2      s123
42  DELTA_G131_H03_114_PP123_FU_T0.CEL  obese     T2      s123
43  DELTA_G131_H05_116_PP124_BL_T0.CEL  obese baseT1      s124
44  DELTA_G131_H07_118_PP124_FU_T0.CEL  obese     T1      s124
45  DELTA_G131_H09_128_PP127_BL_T0.CEL  obese baseT2      s127
46  DELTA_G131_H11_130_PP127_FU_T0.CEL  obese     T2      s127
47      DELTA_G132_A05_13_PP008.T0.CEL   lean   base        s8
48      DELTA_G132_A09_19_PP011.T0.CEL   lean   base       s11
49      DELTA_G132_B07_21_PP012.T0.CEL   lean   base       s12
50      DELTA_G132_C05_25_PP014.T0.CEL   lean   base       s14
51   DELTA_G132_C09_37_PP103.BL.T0.CEL  obese baseT2      s103
52 DELTA_G132_D07_39_PP103.FU.T0_2.CEL  obese     T2      s103
53   DELTA_G132_E05_57_PP108.BL.T0.CEL  obese baseT1      s108
54   DELTA_G132_E09_59_PP108.FU.T0.CEL  obese     T1      s108
55  DELTA_G132_F07_120_PP125.BL.T0.CEL  obese baseT1      s125
56  DELTA_G132_G05_122_PP125.FU.T0.CEL  obese     T1      s125
57  DELTA_G132_G09_124_PP126.BL.T0.CEL  obese baseT1      s126
58  DELTA_G132_H07_126_PP126.FU.T0.CEL  obese     T1      s126
59   DELTA_G133_A05_61_PP109.BL.T0.CEL  obese baseT1      s109
60   DELTA_G133_A09_79_PP114.BL.T0.CEL  obese     T2      s114
61   DELTA_G133_B09_98_PP119.BL.T0.CEL  obese baseT2      s119
62  DELTA_G133_C07_100_PP119.FU.T0.CEL  obese     T2      s119
63  DELTA_G133_D05_102_PP120.BL.T0.CEL  obese baseT2      s120
64  DELTA_G133_D09_132_PP128.BL.T0.CEL  obese baseT2      s128
65  DELTA_G133_E07_134_PP128.FU.T0.CEL  obese     T2      s128
66  DELTA_G133_F05_136_PP129.BL.T0.CEL  obese baseT1      s129
67  DELTA_G133_F09_138_PP129.FU.T0.CEL  obese     T1      s129
68   DELTA_G133_G07_86_PP116.BL.T0.CEL  obese baseT2      s116
69   DELTA_G133_H05_88_PP116.FU.T0.CEL  obese     T2      s116
> # in yellow the 3 incomplete pairs

> Status <- factor(targets$Status)
> Time <- factor(targets$Time)
> Subject <- factor(targets$SubjectID)
> TS <- as.factor(paste(Status, Time, sep="."))
> TS
 [1] lean.base    lean.base    lean.base    lean.base    lean.base    lean.base  
 [7] lean.base    lean.base    lean.base    lean.base    obese.baseT2 obese.T2   
[13] obese.baseT1 obese.T1     obese.baseT1 obese.T1     obese.baseT2 obese.T2   
[19] obese.baseT1 obese.T1     obese.baseT2 obese.T2     obese.baseT2 obese.T2   
[25] obese.baseT1 obese.T1     obese.baseT2 obese.T2     obese.baseT1 obese.T1   
[31] obese.baseT1 obese.T1     obese.baseT2 obese.T2     obese.baseT1 obese.T1   
[37] obese.baseT1 obese.T1     obese.baseT2 obese.T2     obese.baseT2 obese.T2   
[43] obese.baseT1 obese.T1     obese.baseT2 obese.T2     lean.base    lean.base  
[49] lean.base    lean.base    obese.baseT2 obese.T2     obese.baseT1 obese.T1   
[55] obese.baseT1 obese.T1     obese.baseT1 obese.T1     obese.baseT1 obese.T2   
[61] obese.baseT2 obese.T2     obese.baseT2 obese.baseT2 obese.T2     obese.baseT1
[67] obese.T1     obese.baseT2 obese.T2   
Levels: lean.base obese.baseT1 obese.baseT2 obese.T1 obese.T2

Option A; include Subject in model, but design is not full rank...

> design <- model.matrix((~0+TS+Subject))
> is.fullrank(design)
> nonEstimable(design)
[1] "Subjects128" "Subjects129"

Option B; use duplicateCorrelation()

> design <- model.matrix(~0+TS) #thus not including the subjects
> is.fullrank(design)
[1] TRUE
> nonEstimable(design)
> corfit <- duplicateCorrelation(x.norm,design,block=Subject)

Define contrasts

> cont.matrix <- makeContrasts(
+ ObeseBLvsLeanBL = (TSobese.baseT1 + TSobese.baseT2 )/2 - TSlean.base,  #combine all obese baseline samples
+ EffectT1 = (TSobese.T1 - TSobese.baseT1),
+ EffectT2 = (TSobese.T2 - TSobese.baseT2),
+ deltaDelta =  ( (TSobese.T2 - TSobese.baseT2) -  (TSobese.T1 - TSobese.baseT1)),
+ levels=design)
> cont.matrix
Levels           ObeseBLvsLeanBL EffectT1 EffectT2 deltaDelta
  TSlean.base               -1.0        0        0          0
  TSobese.baseT1             0.5       -1        0          1
  TSobese.baseT2             0.5        0       -1         -1
  TSobese.T1                 0.0        1        0         -1
  TSobese.T2                 0.0        0        1          1

Option A:

> fit <- lmFit(x.norm,design)
Coefficients not estimable: Subjects128 Subjects129
Warning message:
Partial NA coefficients for 19654 probe(s)
> fit2 <-, cont.matrix)
> fit2 <- eBayes(fit2, trend=TRUE)
Loading required package: splines
> topTable(fit2)
Error in topTableF(fit, number = number, genelist = genelist, adjust.method = adjust.method,  :
  F-statistics not found in fit
> topTable(fit2,coef=1)
               logFC      AveExpr         t      P.Value adj.P.Val          B
388595_at -1.2992596 -0.045251715 -5.402161 9.033393e-06 0.1775423 2.24502824
93463_at   1.0838054 -0.033891183  4.762181 5.223624e-05 0.3696189 1.05278388
10589_at  -0.7425038  0.002192174 -4.711510 5.950177e-05 0.3696189 0.96356396
5075_at   -0.7834098  0.017699312 -4.424018 1.317521e-04 0.3696189 0.41041130
8394_at   -0.5483462 -0.007113591 -4.396152 1.398632e-04 0.3696189 0.36951056
5111_at   -0.6544413  0.044793358 -4.389222 1.425396e-04 0.3696189 0.35623703
150280_at -0.8668590  0.016500401 -4.368414 1.533131e-04 0.3696189 0.30437320
3850_at   -0.7982704  0.000636201 -4.313728 1.779197e-04 0.3696189 0.20002697
29100_at   0.5633564 -0.028783088  4.260558 2.025360e-04 0.3696189 0.10966118
4286_at   -0.7714238  0.035521729 -4.250192 2.114479e-04 0.3696189 0.07875228


Option B (note that the statistical signifcance for coef=1 is much 'better' than above):

> design <- model.matrix(~0+TS)
> fit <- lmFit(x.norm,design,block=Subject,correlation=corfit$consensus)
> fit2 <-, cont.matrix)
> fit2 <- eBayes(fit2, trend=TRUE)
> topTable(fit2)
             ObeseBLvsLeanBL     EffectT1    EffectT2    deltaDelta       AveExpr
5768_at           0.27094472  0.102441091 -0.03123944 -0.1336805291  0.0131168620
100129455_at     -0.26193499  0.368394971  0.10007335 -0.2683216244 -0.0002613564
11316_at          0.16576403  0.131316402  0.13082179 -0.0004946125  0.0003524714
2357_at           0.35247065 -0.003587640 -0.03853153 -0.0349438859 -0.0179288959
57533_at          0.12907861  0.041385510 -0.17499073 -0.2163762354 -0.0194017923
9645_at           0.28792718  0.004117408 -0.10544675 -0.1095641611  0.0226279000
55022_at          0.40774391  0.046476910 -0.09294465 -0.1394215643  0.0688917460
81553_at          0.25856635  0.010798014 -0.07423535 -0.0850333635 -0.0243488464
91544_at          0.17854136  0.078838131  0.01881281 -0.0600253259  0.0441462794
22891_at         -0.06227555 -0.029604714 -0.20926173 -0.1796570190 -0.0234380020
                     F      P.Value  adj.P.Val
5768_at      12.948352 8.272455e-07 0.01625868
100129455_at  9.932042 1.574822e-05 0.08406391
11316_at      9.915626 1.601338e-05 0.08406391
2357_at       9.850647 1.710876e-05 0.08406391
57533_at      9.177903 3.417815e-05 0.10274447
9645_at       9.125528 3.608937e-05 0.10274447
55022_at      8.835055 4.887136e-05 0.10274447
81553_at      8.640098 5.998237e-05 0.10274447
91544_at      8.619696 6.128604e-05 0.10274447
22891_at      8.541327 6.657034e-05 0.10274447
> topTable(fit2,coef=1)
             logFC     AveExpr        t      P.Value  adj.P.Val        B
5768_at  0.2709447  0.01311686 5.176671 2.102617e-06 0.02330378 4.581468
2357_at  0.3524707 -0.01792890 5.144169 2.383376e-06 0.02330378 4.471316
9645_at  0.2879272  0.02262790 5.017521 3.872730e-06 0.02330378 4.044732
81553_at 0.2585663 -0.02434885 4.845884 7.418486e-06 0.02330378 3.473728
1848_at  0.4078608  0.03655012 4.831124 7.841526e-06 0.02330378 3.425026
54472_at 0.1947691  0.06115485 4.796373 8.932633e-06 0.02330378 3.310631
1462_at  0.2264794 -0.01457601 4.785774 9.293975e-06 0.02330378 3.275814
55022_at 0.4077439  0.06889175 4.780315 9.485613e-06 0.02330378 3.257895
4688_at  0.2567551  0.01126808 4.714461 1.212391e-05 0.02647592 3.042476
51099_at 0.3188907 -0.05336680 4.553293 2.196215e-05 0.04159022 2.521259

limma paired unpaired duplicateCorrelation • 2.8k views
Entering edit mode
Aaron Lun ★ 27k
Last seen 1 hour ago
The city by the bay

For aims 2 to 4, blocking on the SubjectID might be more appropriate as it directly accounts for subject-specific effects on expression. You can then estimate the fold-change between baseline and treated samples for each treatment, based on the difference within the pair for the relevant set of obese subjects. This means that you can get a sensible answer for your contrasts Effect1, Effect2 and deltaDelta.

Of course, you're not using any information from the unpaired obese samples, but that'll be inevitable in a paired-sample design. I think that the loss of information from these few samples is an acceptable cost of this strategy, given that that the paired-sample design will handle the subject-specific expression more cleanly than using duplicateCorrelation.

However, aim 1 is a bit different. If you try to use the blocking approach, you'll end up with fewer DE genes. This is because a subject-specific factor will still be present in the null model for each obese subject, and can absorb some (if not all) of the differences in expression between the normal and obese baseline groups. This is obviously undesirable if you want to see differences between those groups.

Thus, for aim 1, it may be more appropriate to use duplicateCorrelation in a one-way layout with five groups defined by TS. In fact, this could be cut down to four groups if you're willing to put the baseline samples for all obese subjects in a single group (e.g., groups become lean baseline, obese baseline, obese treated 1 and obese treated 2). This isn't quite the same as your ObeseBLvsLeanBL contrast, as the four-group design affects the variance estimation as well as the final comparison.

Entering edit mode

Thanks Aaron for your insights, much appreciated.

To be clear, do you suggest to split the analysis in 2 parts; i.e. one analysis for aim 1 using duplicateCorrelation, and a second analysis blocking by SubjectID for aim 2-4? And in both cases use the complete data set (all arrays)?

Entering edit mode

That's right. Of course, even if you do use the complete data set, the unpaired obese samples won't be useful for the SubjectID-blocked analysis and will be automatically ignored.

Entering edit mode
bcbio_uk ▴ 10
Last seen 5.1 years ago
United Kingdom

Hi Aaron and Guido,

I have a similar issue in my dataset. My experimental design is similar to the "9.7 Multi-Level Experiments" design in the Limma user guide ( Except I have some subject that are paired and some that are unpaired within diseased and within healthy and it looks something like this.

Sample Subject Condition Tissue
1 1 Diseased A
2 1 Diseased B
3 2 Diseased A
4 2 Diseased B
5 3 Diseased A
6 4 Diseased A
7 5 Healthy B
8 6 Healthy A
9 6 Healthy B


I'm interested in finding what are the differences between Tissue A and Tissue B in the disease setting i.e. Diseased.A-Diseased.B and compare these results to Healthy.A-Healthy.B. I used duplicateCorrelation for the subjects and pretty much followed the code from the limma manual used in 9.7. It ran without errors and gave me some interesting results. 

What I'm trying to understand is: 

1) Is the use of duplicateCorrelation valid in this setting?

2) Is the code utilising the paired structure of the data, while also considering the samples that don't have a corresponding pair?

3) Or is it simply only considering the samples that have pairs?


Many thanks for all the help.



Entering edit mode

Yes, the use of duplicateCorrelation is valid in this setting.

I'm guessing that your analysis is using a four-group design, i.e., Diseased/A, Diseased/B, Healthy/A and Healthy/B. This means that information will be used from all samples during model fitting. The paired nature of the data will also be taken into account when you use duplicateCorrelation, as correlations will be computed between paired samples from the same subject. In short, your unpaired samples should be ignored during correlation calculations but not during model fitting.

P.S. For future reference, your question could, perhaps, be posted as a separate thread. You'd probably get more coverage that way.

Entering edit mode

Hi Aaron,



Thanks very much for your quick reply. My analysis is indeed using the four-group design as you pointed out. This is exactly the kind of analysis I was after, so I could utilise all of my samples, but still keep the paired structure for samples that do have their corresponding pairs. 

I will remember to post as a separate thread in the future.

Thanks again!



Login before adding your answer.

Traffic: 207 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6