limma technical repeat correction
1
0
Entering edit mode
john • 0
@cb255338
Last seen 1 hour ago
United Kingdom

Hello, Would someone please help me verify if i'm on the right track in correcting for technical replicates using limma and also if there is a better approach to use (duplicateCorrelation() vs lmer()) for my protein expression data.

I've got 8 disease patients per group (DiseaseA, DiseaseB). There are 2 biological samples(_1,_2) with 4 technical replicates each (rep1-4).

Quite confused on what to include in the block when handling technical replicates - is it the column - patientID or replicates_1 or replicates_2 below. Please note I have manually created replicate_2 column


> sample.sheet
                   group   patientID replicates_1 replicates_2
258736519273_1 Disease_A Disease_A_1            1            1
258736519272_4 Disease_A Disease_A_1            2            2
258736519276_1 Disease_A Disease_A_1            3            3
258736519273_6 Disease_A Disease_A_1            4            4
258736519273_2 Disease_A Disease_A_2            1            1
258736519272_1 Disease_A Disease_A_2            2            2
258736519275_3 Disease_A Disease_A_2            3            3
258736519272_8 Disease_A Disease_A_2            4            4
258736519273_5 Disease_B Disease_B_1            1            5
258736519273_3 Disease_B Disease_B_1            2            6
258736519274_2 Disease_B Disease_B_1            3            7
258736519275_7 Disease_B Disease_B_1            4            8
258736519274_3 Disease_B Disease_B_2            1            5
258736519274_7 Disease_B Disease_B_2            2            6
258736519275_1 Disease_B Disease_B_2            3            7
258736519272_2 Disease_B Disease_B_2            4            8
258736519273_8   Healthy   Healthy_1            1            9
258736519273_4   Healthy   Healthy_1            2           10
258736519272_5   Healthy   Healthy_1            3           11
258736519274_4   Healthy   Healthy_1            4           12
258736519274_8   Healthy   Healthy_2            1            9
258736519276_3   Healthy   Healthy_2            2           10
258736519272_7   Healthy   Healthy_2            3           11
258736519274_5   Healthy   Healthy_2            4           12

group <- factor(sample.sheet$group);
patient <- factor(sample.sheet$patientID); **# unsure whether to use patientID or replicates_1 or replicates_2**

#> table(sample.sheet$group)
#Disease_A Disease_B   Healthy
#        8         8         8

#> table(sample.sheet$patientID)
#Disease_A_1 Disease_A_2 Disease_B_1 Disease_B_2   Healthy_1   Healthy_2
#          4           4           4           4           4           4

# Create a design matrix for your comparison (e.g., control vs treatment)
design <- model.matrix(~0 + group);
colnames(design) <-  gsub('group', '', colnames(design));

# Use duplicateCorrelation to estimate within-patient correlation
corfit <- duplicateCorrelation(exp.data.subset, design, block = patient);

fit <- lmFit(
    exp.data.subset,
    design,
    block = patient,
    correlation = corfit$consensus.correlation
    );

# create contrast matrix
contrast.matrix <- makeContrasts(
    'Disease_AvsHealthy' = Disease_A-Healthy,
    'Disease_BvsHealthy' = Disease_B-Healthy,
    'Disease_A-HealthyvsDisease_B-Healthy' = (Disease_A-Healthy) - (Disease_B-Healthy),
    levels = design
);

fit2 <- contrasts.fit(fit, contrast.matrix);
fit2 <- eBayes(fit2, trend = TRUE); **# robust = TRUE, trend = TRUE? what is more appropriate for proteomics **
limma lmer duplicateCorrelation technicalrepeats • 1.3k views
ADD COMMENT
0
Entering edit mode
@gordon-smyth
Last seen 4 hours ago
WEHI, Melbourne, Australia

Technical samples should be averaged. Then use duplicateCorrelation() with block=PatientID.

ADD COMMENT
0
Entering edit mode

Hi Gordon, Thank you very much for the prompt response.

If i understand correctly, wouldnt this remove the need to use duplicateCorrelation(). Since averaging condenses these replicates into a single value per biological replicate, would that mean we no longer need to account for correlations between them? Thanks for clarifying!

         sampleID     group   patientID
1  258736519273_1 Disease_A Disease_A_1
5  258736519273_2 Disease_A Disease_A_2
9  258736519273_5 Disease_B Disease_B_1
13 258736519274_3 Disease_B Disease_B_2
17 258736519273_8   Healthy   Healthy_1
21 258736519274_8   Healthy   Healthy_2
ADD REPLY
0
Entering edit mode

Gordon Smyth Thank you again for your response, and apologies as I think I might still be confused by your suggestion "The technical samples should be averaged. Then use duplicateCorrelation() with block=PatientID." I'd prefer to retain the technical replicates instead of averaging them, as I'm aiming to capture the nuances in each set (there seems to be quite a lot of variation between 2 of my biological replicates in one condition on the PCA plot).

Am I correct to assume my initial approach of using duplicateCorrelation() with block = patient is appropriate in this case (i.e., allowing the function to calculate the consensus correlation of technical replicates)? Or is there another setup you'd recommend to account for both biological and technical repeats effectively?

Thanks once more for your insights!

ADD REPLY
0
Entering edit mode

I am not familiar with the scientific and procedural details of your experiment, so I cannot tell you exactly how to analyse it. In particular, it is not clear to me what the replicates (replicates_1 and replicates_2) represent. You have called them "technical replicates", but I suspect that they might not be. If they really were technical replicates, then you wouldn't be wanting to capture their nuances.

I can only give you general principles and limma syntax. Repeated measurements on the same patients would be handled by duplicateCorrelation with block=PatientID. Technical replicates, by definition, do not contain any biological signal. If you really did have technical replicates, then they would be averaged. Repeated measurements are not technical replicates.

Your experiment appears to have some sort of blocking/pairing structure associated with replicate. Different mass spec runs perhaps? Different samples from each patient? I don't know.

ADD REPLY
0
Entering edit mode

Gordon Smyth Thank you for getting back to me and apologies for the confusion

  • replicates_1 column indicates for a patient (cell line) (e.g.: Disease_A_1 the 4 repeated measurements labelled 1, 2, 3, 4). Please disregard the replicates_2 column as I manually created this to clarify the correct application of duplicateCorrelation().
  • To provide context, we have 3 groups : Healthy, Disease_A and Disease_B. Each group includes 2 different cell lines. On the day of experiment, each cell line was plated in 4 separate culture dishes. The samples were collected and processed individually in the lab. In the end we have 8 samples per group which were analysed on somascan proteomics platform in a single run and we are using their provided normalised file as our starting point of analysis. I've attached the PCA plot below for clarity. Please let me know what you think. enter image description here
ADD REPLY
1
Entering edit mode

Thanks for the further explanation. It seems you should simply use block=PatientID, so your original analysis is correct.

The replicate_1 column does not really mean anything. In reality, every sample is a new culture. The column would only have a meaning if you were using the same four culture dishes over and over again, or something like that.

However, the PCA plot strongly suggests that there is little or no significant DE in the data, regardless of what statistical test you might use.

I will also add that running lmer on this data, as you mention at the start of your question, would be exactly equivalent to averaging the replicates for each patientID and then running ordinary anova.

ADD REPLY
0
Entering edit mode

Hi Gordon Smyth, Thank you very much for the clarification, including the details about lmer() and for confirming that the original analysis is correct.

Interestingly, despite the PCA result, I'm still identifying a small number of biologically relevant differential genes in DiseaseA and DiseaseB with |logFC| > 1 and Q < 0.05, alongside with meaningful pathways.

For this dataset, I was also recommended to consider a |logFC| threshold in the range of 0.2-0.5.

Thanks again for your help!

ADD REPLY
1
Entering edit mode

I discourage the use of logFC cutoffs, especially larger cutoffs. There is no scientific or statistical justification for arbitrary combinations of logFC and FDR cutoffs in limma. See https://www.biostars.org/p/9603855/ or DecideTest vs topTAGs in edgeR.

Also suggest you try plotMDS(exp.data) instead of the PCA plot function you are currently using.

ADD REPLY
0
Entering edit mode

Thank you very much for the clarification and recommendations. I will prioritise my hits with FDR ranking and check plotMDS too.

ADD REPLY
0
Entering edit mode

Hi Gordon Smyth,

After consulting with the SomaScan team, I've been reflecting on whether limma's assumptions are fully compatible with SomaScan proteomics data, and I'd greatly appreciate your insights on the following concerns:

  1. Variance Borrowing Across Genes: The SomaScan assay are run in three separate dilution bins, and each bin is normalized individually, separate from the other dilution bins. This approach seems to violate the assumption of equivalent variance across genes, which I think underpins limma's eBayes step. Is there a way to adjust limma to handle these non-equivalent variances between dilution bins?
  2. Non-Uniform Variance (S-shaped Dilution Curves): SomaScan dilution curves are S-shaped, so the variance of proteins near both the limit of detection (LoD) and near saturation are higher than those in the linear range. Do you have suggestions for addressing this issue within the limma framework?

As a side note, the SomaScan team followed an lmer and emmeans approach: log(apt) ~ group + (1|SubjectId)

However, this method resulted in no FDR-significant genes in our data, unlike limma. Interestingly, limma and lmer identified distinct sets of genes, although both approaches revealed overlapping pathways. To address these differences, I'm considering a consensus approach for experimental validation.

Here's what I've tried in limma so far, and I'd like your feedback on whether these are appropriate:

  1. VSN Normalization with Limma

    exp.data.subset <- normalizeVSN(exp.data.subset);
    group <- factor(metadata.subset$group);
    patient <- factor(metadata.subset$patientID);
    # Create a design matrix for your comparison 
    design <- model.matrix(~0 + group);
    colnames(design) <-  gsub('group', "", colnames(design));
    # Use duplicateCorrelation to estimate within-patient correlation
    corfit <- duplicateCorrelation(exp.data.subset, design, block = patient);
    fit <- lmFit(
     exp.data.subset,
     design,
     block = patient,
     correlation = corfit$consensus.correlation
     );
    
  2. Double Voom Normalization

    y <- DGEList(counts = exp.data.subset);
    group <- factor(metadata.subset$group);
    patient <- factor(metadata.subset$patientID);
    # Create a design matrix for your comparison 
    design <- model.matrix(~0 + group);
    colnames(design) <-  gsub('group', "", colnames(design));
    # normalise using double voom with TMM (Section 18.1.9 from LimmaUserGuide.pdf)
    y <- calcNormFactors(y);
    v <- voom(y, design, plot = TRUE);
    # Use duplicateCorrelation to estimate within-patient correlation
    corfit <- duplicateCorrelation(v, design, block = patient);
    # call voom again
    v <- voom(y, design, plot = TRUE, block = patient, correlation = corfit$consensus.correlation);
    corfit <- duplicateCorrelation(v, design, block = patient);
    fit <- lmFit(
       v,
       design,
       block = patient,
       correlation = corfit$consensus.correlation
       );
    
ADD REPLY

Login before adding your answer.

Traffic: 591 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6