Hello, Would someone please help me verify if i'm on the right track in correcting for technical replicates using limma and also if there is a better approach to use (duplicateCorrelation() vs lmer()) for my protein expression data.
I've got 8 disease patients per group (DiseaseA, DiseaseB). There are 2 biological samples(_1,_2) with 4 technical replicates each (rep1-4).
Quite confused on what to include in the block when handling technical replicates - is it the column - patientID or replicates_1 or replicates_2 below. Please note I have manually created replicate_2 column
> sample.sheet
group patientID replicates_1 replicates_2
258736519273_1 Disease_A Disease_A_1 1 1
258736519272_4 Disease_A Disease_A_1 2 2
258736519276_1 Disease_A Disease_A_1 3 3
258736519273_6 Disease_A Disease_A_1 4 4
258736519273_2 Disease_A Disease_A_2 1 1
258736519272_1 Disease_A Disease_A_2 2 2
258736519275_3 Disease_A Disease_A_2 3 3
258736519272_8 Disease_A Disease_A_2 4 4
258736519273_5 Disease_B Disease_B_1 1 5
258736519273_3 Disease_B Disease_B_1 2 6
258736519274_2 Disease_B Disease_B_1 3 7
258736519275_7 Disease_B Disease_B_1 4 8
258736519274_3 Disease_B Disease_B_2 1 5
258736519274_7 Disease_B Disease_B_2 2 6
258736519275_1 Disease_B Disease_B_2 3 7
258736519272_2 Disease_B Disease_B_2 4 8
258736519273_8 Healthy Healthy_1 1 9
258736519273_4 Healthy Healthy_1 2 10
258736519272_5 Healthy Healthy_1 3 11
258736519274_4 Healthy Healthy_1 4 12
258736519274_8 Healthy Healthy_2 1 9
258736519276_3 Healthy Healthy_2 2 10
258736519272_7 Healthy Healthy_2 3 11
258736519274_5 Healthy Healthy_2 4 12
group <- factor(sample.sheet$group);
patient <- factor(sample.sheet$patientID); **# unsure whether to use patientID or replicates_1 or replicates_2**
#> table(sample.sheet$group)
#Disease_A Disease_B Healthy
# 8 8 8
#> table(sample.sheet$patientID)
#Disease_A_1 Disease_A_2 Disease_B_1 Disease_B_2 Healthy_1 Healthy_2
# 4 4 4 4 4 4
# Create a design matrix for your comparison (e.g., control vs treatment)
design <- model.matrix(~0 + group);
colnames(design) <- gsub('group', '', colnames(design));
# Use duplicateCorrelation to estimate within-patient correlation
corfit <- duplicateCorrelation(exp.data.subset, design, block = patient);
fit <- lmFit(
exp.data.subset,
design,
block = patient,
correlation = corfit$consensus.correlation
);
# create contrast matrix
contrast.matrix <- makeContrasts(
'Disease_AvsHealthy' = Disease_A-Healthy,
'Disease_BvsHealthy' = Disease_B-Healthy,
'Disease_A-HealthyvsDisease_B-Healthy' = (Disease_A-Healthy) - (Disease_B-Healthy),
levels = design
);
fit2 <- contrasts.fit(fit, contrast.matrix);
fit2 <- eBayes(fit2, trend = TRUE); **# robust = TRUE, trend = TRUE? what is more appropriate for proteomics **
Hi Gordon, Thank you very much for the prompt response.
If i understand correctly, wouldnt this remove the need to use duplicateCorrelation(). Since averaging condenses these replicates into a single value per biological replicate, would that mean we no longer need to account for correlations between them? Thanks for clarifying!
Gordon Smyth Thank you again for your response, and apologies as I think I might still be confused by your suggestion "The technical samples should be averaged. Then use duplicateCorrelation() with block=PatientID." I'd prefer to retain the technical replicates instead of averaging them, as I'm aiming to capture the nuances in each set (there seems to be quite a lot of variation between 2 of my biological replicates in one condition on the PCA plot).
Am I correct to assume my initial approach of using duplicateCorrelation() with block = patient is appropriate in this case (i.e., allowing the function to calculate the consensus correlation of technical replicates)? Or is there another setup you'd recommend to account for both biological and technical repeats effectively?
Thanks once more for your insights!
I am not familiar with the scientific and procedural details of your experiment, so I cannot tell you exactly how to analyse it. In particular, it is not clear to me what the replicates (
replicates_1
andreplicates_2
) represent. You have called them "technical replicates", but I suspect that they might not be. If they really were technical replicates, then you wouldn't be wanting to capture their nuances.I can only give you general principles and limma syntax. Repeated measurements on the same patients would be handled by duplicateCorrelation with block=PatientID. Technical replicates, by definition, do not contain any biological signal. If you really did have technical replicates, then they would be averaged. Repeated measurements are not technical replicates.
Your experiment appears to have some sort of blocking/pairing structure associated with replicate. Different mass spec runs perhaps? Different samples from each patient? I don't know.
Gordon Smyth Thank you for getting back to me and apologies for the confusion
replicates_1
column indicates for a patient (cell line) (e.g.: Disease_A_1 the 4 repeated measurements labelled 1, 2, 3, 4). Please disregard thereplicates_2
column as I manually created this to clarify the correct application of duplicateCorrelation().Thanks for the further explanation. It seems you should simply use
block=PatientID
, so your original analysis is correct.The
replicate_1
column does not really mean anything. In reality, every sample is a new culture. The column would only have a meaning if you were using the same four culture dishes over and over again, or something like that.However, the PCA plot strongly suggests that there is little or no significant DE in the data, regardless of what statistical test you might use.
I will also add that running lmer on this data, as you mention at the start of your question, would be exactly equivalent to averaging the replicates for each patientID and then running ordinary anova.
Hi Gordon Smyth, Thank you very much for the clarification, including the details about lmer() and for confirming that the original analysis is correct.
Interestingly, despite the PCA result, I'm still identifying a small number of biologically relevant differential genes in DiseaseA and DiseaseB with
|logFC| > 1 and Q < 0.05
, alongside with meaningful pathways.For this dataset, I was also recommended to consider a
|logFC|
threshold in the range of0.2-0.5
.Thanks again for your help!
I discourage the use of logFC cutoffs, especially larger cutoffs. There is no scientific or statistical justification for arbitrary combinations of logFC and FDR cutoffs in limma. See https://www.biostars.org/p/9603855/ or DecideTest vs topTAGs in edgeR.
Also suggest you try plotMDS(exp.data) instead of the PCA plot function you are currently using.
Thank you very much for the clarification and recommendations. I will prioritise my hits with FDR ranking and check plotMDS too.
Hi Gordon Smyth,
After consulting with the SomaScan team, I've been reflecting on whether limma's assumptions are fully compatible with SomaScan proteomics data, and I'd greatly appreciate your insights on the following concerns:
As a side note, the SomaScan team followed an lmer and emmeans approach:
log(apt) ~ group + (1|SubjectId)
However, this method resulted in no FDR-significant genes in our data, unlike limma. Interestingly, limma and lmer identified distinct sets of genes, although both approaches revealed overlapping pathways. To address these differences, I'm considering a consensus approach for experimental validation.
Here's what I've tried in limma so far, and I'd like your feedback on whether these are appropriate:
VSN Normalization with Limma
Double Voom Normalization