limma technical repeat correction
1
0
Entering edit mode
john • 0
@cb255338
Last seen 13 hours ago
United Kingdom

Hello, Would someone please help me verify if i'm on the right track in correcting for technical replicates using limma and also if there is a better approach to use (duplicateCorrelation() vs lmer()) for my protein expression data.

I've got 8 disease patients per group (DiseaseA, DiseaseB). There are 2 biological samples(_1,_2) with 4 technical replicates each (rep1-4).

Quite confused on what to include in the block when handling technical replicates - is it the column - patientID or replicates_1 or replicates_2 below. Please note I have manually created replicate_2 column


> sample.sheet
                   group   patientID replicates_1 replicates_2
258736519273_1 Disease_A Disease_A_1            1            1
258736519272_4 Disease_A Disease_A_1            2            2
258736519276_1 Disease_A Disease_A_1            3            3
258736519273_6 Disease_A Disease_A_1            4            4
258736519273_2 Disease_A Disease_A_2            1            1
258736519272_1 Disease_A Disease_A_2            2            2
258736519275_3 Disease_A Disease_A_2            3            3
258736519272_8 Disease_A Disease_A_2            4            4
258736519273_5 Disease_B Disease_B_1            1            5
258736519273_3 Disease_B Disease_B_1            2            6
258736519274_2 Disease_B Disease_B_1            3            7
258736519275_7 Disease_B Disease_B_1            4            8
258736519274_3 Disease_B Disease_B_2            1            5
258736519274_7 Disease_B Disease_B_2            2            6
258736519275_1 Disease_B Disease_B_2            3            7
258736519272_2 Disease_B Disease_B_2            4            8
258736519273_8   Healthy   Healthy_1            1            9
258736519273_4   Healthy   Healthy_1            2           10
258736519272_5   Healthy   Healthy_1            3           11
258736519274_4   Healthy   Healthy_1            4           12
258736519274_8   Healthy   Healthy_2            1            9
258736519276_3   Healthy   Healthy_2            2           10
258736519272_7   Healthy   Healthy_2            3           11
258736519274_5   Healthy   Healthy_2            4           12

group <- factor(sample.sheet$group);
patient <- factor(sample.sheet$patientID); **# unsure whether to use patientID or replicates_1 or replicates_2**

#> table(sample.sheet$group)
#Disease_A Disease_B   Healthy
#        8         8         8

#> table(sample.sheet$patientID)
#Disease_A_1 Disease_A_2 Disease_B_1 Disease_B_2   Healthy_1   Healthy_2
#          4           4           4           4           4           4

# Create a design matrix for your comparison (e.g., control vs treatment)
design <- model.matrix(~0 + group);
colnames(design) <-  gsub('group', '', colnames(design));

# Use duplicateCorrelation to estimate within-patient correlation
corfit <- duplicateCorrelation(exp.data.subset, design, block = patient);

fit <- lmFit(
    exp.data.subset,
    design,
    block = patient,
    correlation = corfit$consensus.correlation
    );

# create contrast matrix
contrast.matrix <- makeContrasts(
    'Disease_AvsHealthy' = Disease_A-Healthy,
    'Disease_BvsHealthy' = Disease_B-Healthy,
    'Disease_A-HealthyvsDisease_B-Healthy' = (Disease_A-Healthy) - (Disease_B-Healthy),
    levels = design
);

fit2 <- contrasts.fit(fit, contrast.matrix);
fit2 <- eBayes(fit2, trend = TRUE); **# robust = TRUE, trend = TRUE? what is more appropriate for proteomics **
limma lmer duplicateCorrelation technicalrepeats • 724 views
ADD COMMENT
0
Entering edit mode
@gordon-smyth
Last seen 49 minutes ago
WEHI, Melbourne, Australia

Technical samples should be averaged. Then use duplicateCorrelation() with block=PatientID.

ADD COMMENT
0
Entering edit mode

Hi Gordon, Thank you very much for the prompt response.

If i understand correctly, wouldnt this remove the need to use duplicateCorrelation(). Since averaging condenses these replicates into a single value per biological replicate, would that mean we no longer need to account for correlations between them? Thanks for clarifying!

         sampleID     group   patientID
1  258736519273_1 Disease_A Disease_A_1
5  258736519273_2 Disease_A Disease_A_2
9  258736519273_5 Disease_B Disease_B_1
13 258736519274_3 Disease_B Disease_B_2
17 258736519273_8   Healthy   Healthy_1
21 258736519274_8   Healthy   Healthy_2
ADD REPLY
0
Entering edit mode

Gordon Smyth Thank you again for your response, and apologies as I think I might still be confused by your suggestion "The technical samples should be averaged. Then use duplicateCorrelation() with block=PatientID." I'd prefer to retain the technical replicates instead of averaging them, as I'm aiming to capture the nuances in each set (there seems to be quite a lot of variation between 2 of my biological replicates in one condition on the PCA plot).

Am I correct to assume my initial approach of using duplicateCorrelation() with block = patient is appropriate in this case (i.e., allowing the function to calculate the consensus correlation of technical replicates)? Or is there another setup you'd recommend to account for both biological and technical repeats effectively?

Thanks once more for your insights!

ADD REPLY
0
Entering edit mode

I am not familiar with the scientific and procedural details of your experiment, so I cannot tell you exactly how to analyse it. In particular, it is not clear to me what the replicates (replicates_1 and replicates_2) represent. You have called them "technical replicates", but I suspect that they might not be. If they really were technical replicates, then you wouldn't be wanting to capture their nuances.

I can only give you general principles and limma syntax. Repeated measurements on the same patients would be handled by duplicateCorrelation with block=PatientID. Technical replicates, by definition, do not contain any biological signal. If you really did have technical replicates, then they would be averaged. Repeated measurements are not technical replicates.

Your experiment appears to have some sort of blocking/pairing structure associated with replicate. Different mass spec runs perhaps? Different samples from each patient? I don't know.

ADD REPLY
0
Entering edit mode

Gordon Smyth Thank you for getting back to me and apologies for the confusion

  • replicates_1 column indicates for a patient (cell line) (e.g.: Disease_A_1 the 4 repeated measurements labelled 1, 2, 3, 4). Please disregard the replicates_2 column as I manually created this to clarify the correct application of duplicateCorrelation().
  • To provide context, we have 3 groups : Healthy, Disease_A and Disease_B. Each group includes 2 different cell lines. On the day of experiment, each cell line was plated in 4 separate culture dishes. The samples were collected and processed individually in the lab. In the end we have 8 samples per group which were analysed on somascan proteomics platform in a single run and we are using their provided normalised file as our starting point of analysis. I've attached the PCA plot below for clarity. Please let me know what you think. enter image description here
ADD REPLY
1
Entering edit mode

Thanks for the further explanation. It seems you should simply use block=PatientID, so your original analysis is correct.

The replicate_1 column does not really mean anything. In reality, every sample is a new culture. The column would only have a meaning if you were using the same four culture dishes over and over again, or something like that.

However, the PCA plot strongly suggests that there is little or no significant DE in the data, regardless of what statistical test you might use.

I will also add that running lmer on this data, as you mention at the start of your question, would be exactly equivalent to averaging the replicates for each patientID and then running ordinary anova.

ADD REPLY
0
Entering edit mode

Hi Gordon Smyth, Thank you very much for the clarification, including the details about lmer() and for confirming that the original analysis is correct.

Interestingly, despite the PCA result, I'm still identifying a small number of biologically relevant differential genes in DiseaseA and DiseaseB with |logFC| > 1 and Q < 0.05, alongside with meaningful pathways.

For this dataset, I was also recommended to consider a |logFC| threshold in the range of 0.2-0.5.

Thanks again for your help!

ADD REPLY
0
Entering edit mode

I discourage the use of logFC cutoffs, especially larger cutoffs. There is no scientific or statistical justification for arbitrary combinations of logFC and FDR cutoffs in limma. See https://www.biostars.org/p/9603855/ .

Also suggest you try plotMDS(exp.data) instead of the PCA plot function you are currently using.

ADD REPLY

Login before adding your answer.

Traffic: 467 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6