Hello, Would someone please help me verify if i'm on the right track in correcting for technical replicates using limma and also if there is a better approach to use (duplicateCorrelation() vs lmer()) for my protein expression data.
I've got 8 disease patients per group (DiseaseA, DiseaseB). There are 2 biological samples(_1,_2) with 4 technical replicates each (rep1-4).
Quite confused on what to include in the block when handling technical replicates - is it the column - patientID or replicates_1 or replicates_2 below. Please note I have manually created replicate_2 column
> sample.sheet
group patientID replicates_1 replicates_2
sample_1 Disease_A Disease_A_1 1 1
sample_2 Disease_A Disease_A_1 2 2
sample_3 Disease_A Disease_A_1 3 3
sample_4 Disease_A Disease_A_1 4 4
sample_5 Disease_A Disease_A_2 1 1
sample_6 Disease_A Disease_A_2 2 2
sample_7 Disease_A Disease_A_2 3 3
sample_8 Disease_A Disease_A_2 4 4
sample_9 Disease_B Disease_B_1 1 5
sample_10 Disease_B Disease_B_1 2 6
sample_11 Disease_B Disease_B_1 3 7
sample_12 Disease_B Disease_B_1 4 8
sample_13 Disease_B Disease_B_2 1 5
sample_14 Disease_B Disease_B_2 2 6
sample_15 Disease_B Disease_B_2 3 7
sample_16 Disease_B Disease_B_2 4 8
sample_17 Healthy Healthy_1 1 9
sample_18 Healthy Healthy_1 2 10
sample_19 Healthy Healthy_1 3 11
sample_20 Healthy Healthy_1 4 12
sample_21 Healthy Healthy_2 1 9
sample_22 Healthy Healthy_2 2 10
sample_23 Healthy Healthy_2 3 11
sample_24 Healthy Healthy_2 4 12
group <- factor(sample.sheet$group);
patient <- factor(sample.sheet$patientID); **# unsure whether to use patientID or replicates_1 or replicates_2**
#> table(sample.sheet$group)
#Disease_A Disease_B Healthy
# 8 8 8
#> table(sample.sheet$patientID)
#Disease_A_1 Disease_A_2 Disease_B_1 Disease_B_2 Healthy_1 Healthy_2
# 4 4 4 4 4 4
# Create a design matrix for your comparison (e.g., control vs treatment)
design <- model.matrix(~0 + group);
colnames(design) <- gsub('group', '', colnames(design));
# Use duplicateCorrelation to estimate within-patient correlation
corfit <- duplicateCorrelation(exp.data.subset, design, block = patient);
fit <- lmFit(
exp.data.subset,
design,
block = patient,
correlation = corfit$consensus.correlation
);
# create contrast matrix
contrast.matrix <- makeContrasts(
'Disease_AvsHealthy' = Disease_A-Healthy,
'Disease_BvsHealthy' = Disease_B-Healthy,
'Disease_A-HealthyvsDisease_B-Healthy' = (Disease_A-Healthy) - (Disease_B-Healthy),
levels = design
);
fit2 <- contrasts.fit(fit, contrast.matrix);
fit2 <- eBayes(fit2, trend = TRUE); **# robust = TRUE, trend = TRUE? what is more appropriate for proteomics **
Hi Gordon, Thank you very much for the prompt response.
If i understand correctly, wouldnt this remove the need to use duplicateCorrelation(). Since averaging condenses these replicates into a single value per biological replicate, would that mean we no longer need to account for correlations between them? Thanks for clarifying!
Gordon Smyth Thank you again for your response, and apologies as I think I might still be confused by your suggestion "The technical samples should be averaged. Then use duplicateCorrelation() with block=PatientID." I'd prefer to retain the technical replicates instead of averaging them, as I'm aiming to capture the nuances in each set (there seems to be quite a lot of variation between 2 of my biological replicates in one condition on the PCA plot).
Am I correct to assume my initial approach of using duplicateCorrelation() with block = patient is appropriate in this case (i.e., allowing the function to calculate the consensus correlation of technical replicates)? Or is there another setup you'd recommend to account for both biological and technical repeats effectively?
Thanks once more for your insights!
I am not familiar with the scientific and procedural details of your experiment, so I cannot tell you exactly how to analyse it. In particular, it is not clear to me what the replicates (
replicates_1
andreplicates_2
) represent. You have called them "technical replicates", but I suspect that they might not be. If they really were technical replicates, then you wouldn't be wanting to capture their nuances.I can only give you general principles and limma syntax. Repeated measurements on the same patients would be handled by duplicateCorrelation with block=PatientID. Technical replicates, by definition, do not contain any biological signal. If you really did have technical replicates, then they would be averaged. Repeated measurements are not technical replicates.
Your experiment appears to have some sort of blocking/pairing structure associated with replicate. Different mass spec runs perhaps? Different samples from each patient? I don't know.
Gordon Smyth Thank you for getting back to me and apologies for the confusion
replicates_1
column indicates for a patient (cell line) (e.g.: Disease_A_1 the 4 repeated measurements labelled 1, 2, 3, 4). Please disregard thereplicates_2
column as I manually created this to clarify the correct application of duplicateCorrelation().Thanks for the further explanation. It seems you should simply use
block=PatientID
, so your original analysis is correct.The
replicate_1
column does not really mean anything. In reality, every sample is a new culture. The column would only have a meaning if you were using the same four culture dishes over and over again, or something like that.However, the PCA plot strongly suggests that there is little or no significant DE in the data, regardless of what statistical test you might use.
I will also add that running lmer on this data, as you mention at the start of your question, would be exactly equivalent to averaging the replicates for each patientID and then running ordinary anova.
Hi Gordon Smyth, Thank you very much for the clarification, including the details about lmer() and for confirming that the original analysis is correct.
Interestingly, despite the PCA result, I'm still identifying a small number of biologically relevant differential genes in DiseaseA and DiseaseB with
|logFC| > 1 and Q < 0.05
, alongside with meaningful pathways.For this dataset, I was also recommended to consider a
|logFC|
threshold in the range of0.2-0.5
.Thanks again for your help!
I discourage the use of logFC cutoffs, especially larger cutoffs. There is no scientific or statistical justification for arbitrary combinations of logFC and FDR cutoffs in limma. See https://www.biostars.org/p/9603855/ or DecideTest vs topTAGs in edgeR.
Also suggest you try plotMDS(exp.data) instead of the PCA plot function you are currently using.
Thank you very much for the clarification and recommendations. I will prioritise my hits with FDR ranking and check plotMDS too.
Hi Gordon Smyth,
I've been reflecting on whether limma's assumptions are fully compatible with SomaScan proteomics data, and I'd greatly appreciate your insights on the following concerns:
the SomaScan team followed an lmer and emmeans approach. Using lmer resulted in no FDR-significant genes in our data, unlike limma. Interestingly, limma and lmer identified distinct sets of genes, although both approaches revealed overlapping pathways. To address these differences, I'm considering a consensus approach for experimental validation.
Here's what I've tried in limma so far, and I'd like your feedback on whether these are appropriate:
VSN Normalization with Limma
Double Voom Normalization
I have never analysed SomaScan data, so I cannot make any specific recommendations about how to analyse such data. I have however analysed data from Olink, which is SomaScan's main competitor, and have found that the use of limma and edgeR improves enormously on the statistical analysis recommended by Olink itself. I have also read a journal article by the SomaScan team comparing their data to Olink's. My experience over the past 25 years is that commercial companies and biomedical engineers have quite conservative views of how data should be analysed, and they don't take into account the statistical advances of the last two decades.
limma does not assume equal variances between bins. Olink data is also conducted in bins, and it caused no problems for limma's empirical Bayes procedure.
I doubt there are substantial differences in variance between bins that is not accounted by the mean-variance trend but, if there were, it could be handled by precision weights.
limma models the mean-variance trend by setting
trend=TRUE
in theeBayes()
step. This is one of the features that makes limma so much more powerful that running individual protein analyses withlm
orlmer
.I don't think they appreciate the importance of using empirical Bayes for high-throughput data.
There is no need for variance stabilization. As already mentioned, limma has no trouble handling the mean-variance relationship.
voom() is only for counts, but you have not mentioned anything in your question about having count data.
You have to choose normalization methods that are appropriate for the scale of the data. If you have counts, then you shouldn't use normalizeVSN() (or anything like it). If you don't have counts, you can't use voom(). normalizeVSN() etc is only for data on the unlogged scale whereas lmFit() is only for data after logging. The nature of the data you have from SomaScan is unclear to me.
Hi Gordon Smyth, Thank you very much for your response. The normalization process involves multiple steps and they provide a normalised file FILENAME.{hybNorm}.{medNormInt}.{plateScale.medNormSMP}.adat in Relative Flourescence Units (added references).
I've noticed some publications applying
normalizeVSN
, which only seemed to yield slightly more differentially expressed results in my analysis. For the normalised dataset provided, I appliedlog2(exp.data + 1)
for limma analysis, but for approaches like vsn or voom, I reverted to unlogged data before analysis.The readout in relative fluorescent units (RFU). So, from what I understand, using
normalizeVSN
isn't strictly necessary sincetrend=TRUE
in limma accounts for the unequal variances, and we should not usevoom
hereReferences