Limma: technical and biological replicates in proteomics data
Entering edit mode
Last seen 3.2 years ago

Dear All,

I have a question about taking into account technical replication in my model. I will greatly appreciate any help from your side!

About my data set:

  • These are proteomics data (originating from TMT-labeling experiment), so each row corresponds to a particular protein and columns contain log-transformed protein intensities for each replicate.

  • There are in total 8 biological replicates, 4 treated and 4 control.

  • 4 samples (2 treated and 2 control) were processed and measured in one experiment and 4 other samples in another experiment, thus creating a batch effect that has to be taken into account.

  • Furthermore, each  sample was measured twice on two different mass spectrometers using two different acquisition methods. First method (MS2) is more sensitive resulting in a greater number of proteins quantified as compared to another method (MS3) which can quantify less proteins but with higher accuracy. It means, in each biological replicate > 70% proteins will have intensities originating from MS2 and MS3 acquisition methods, and < 30% proteins being quantified in either of those two methods.

I would like to incorporate the technical replication in my model taking into account the batch effect but I am not sure how to properly set up the design matrix for this. 

Say, I have following levels in the data:


    Acquisition  |    MS2                        |    MS3                        |   
    Experiment   |    1          |    2          |    1          |    2          |
    Treatment    |  Tr   |  Con  |  Tr   | Con   |  Tr   | Con   |  Tr   |  Con  |
    Replicate    | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |

tr <- as.factor(rep(c(2,2,1,1), 4))                                 # treatment, 2 = Treated, 1 = Control 
ex <- as.factor(c(rep(1,4), rep(2,4), rep(1,4), rep(2,4)))          # experiment 
ms <- as.factor(c(rep(1,8), rep(2,8)))                              # acquisition method, 1 = MS2, 2 = MS3


Is it correct to define block variable and design matrix as follows?


block <- c(1:8, 1:8)

design <- model.matrix(~ ex + ms + tr)
dupcor = duplicateCorrelation(dat, design = design,  block = block)
fit <- lmFit(dat, design, block = block, correlation = dupcor$consensus)

I am not sure if I have to put both ex and ms into the model and if I have to use the design matrix in duplicateCorrelation function. Could you please correct me if I am wrong?


Thank you very much for your help!

Best wishes,



limma proteomics replicates • 1.5k views
Entering edit mode
Aaron Lun ★ 27k
Last seen 3 hours ago
The city by the bay

Your set-up looks fine to me. To address your specific concerns:

  • Yes, you should put ex and ms in the model. If you don't put in ex, systematic differences between experiments would affect the correlation estimated by duplicateCorrelation, i.e., the correlation wouldn't just capture the fact that replicates derive from the same sample. Not putting in ms would fail to model any systematic differences between the machines, leading to inflated variances and loss of power.
  • Yes, you need to give the design matrix to duplicateCorrelation. Otherwise, the effects of the treatment and other factors would be absorbed into the correlation estimate.

Of course, this assumes that the mean-variance relationship is comparable across machines. I don't know whether this is a strong assumption for MS data. If it is, an alternative strategy might be to analyze the results from each machine separately, and then perform a meta-analysis to combine the p-values for each protein afterwards. This would be more robust to systematic differences in the mean-variance relationship between machines, which would probably interfere with the empirical Bayes shrinkage in limma. Note that the p-values would be correlated between machines, which affects how you can combine them; I would suggest either Simes' method (DE in either machine) or Berger's intersection-union test (DE in both machines).

Entering edit mode

Thank you very much for your reply!
I think one can expect a decent amount of correlation between the two acquisition methods, so I think it should be possible to incorporate this into a model. However, thank you for the suggestions, I will see if I can use them for other data sets :)

Best wishes,



Login before adding your answer.

Traffic: 295 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6