Combining repeated measurements and duplicateCorrelation in LIMMA
1
2
Entering edit mode
Jeff ▴ 20
@acaa643d
Last seen 11 days ago

I have two populations (normal and diseased). For each subject I have two measurements (baseline and 3 months later).

My primary interest is the difference in expression between normal and diseased subjects, regardless of time-point.

Using the LIMMA user guide I have used sections 9.5.2 and 9.7 to guide me.

Questions:

1. Looking at the code I wrote below, I have a feeling the contrasts matrix I created is not actually measuring the difference in expression between normal and diseased subjects. As I said earlier, my primary interest is the difference in expression between normal and diseased subjects, regardless of time-point. I think the contrasts matrix I currently have set-up will end up telling me the genes that respond differently in expression between healthy and diseased subjects over time - this is not what i want - I just want to know which genes are differently expressed between healthy and diseased regardless of the changing of time. I was hoping to get advice on correct way of conducting the analysis for my purposes.

2. Is it possible to use a continuous phenotype and still perform the analysis? For instance I may not be satisfied with normal vs diseased and may want to use for instance viral cells/mL as my primary predictor. In that case can I just run these lines of code?

Treat <- targets$viral_concentration design <- model.matrix(~Treat) corfit <- duplicateCorrelation(eset,design,block=targets$Subject)

fit <- lmFit(eset,design,block=targets$Subject,correlation=corfit$consensus)

fit <- eBayes(fit)

topTable(fit, coef=”Treat”)

...

Subject          Condition      time_point
1                 disease        0_months
1                 disease        3_months
2                 normal         0_months
2                 normal         3_months
3                 disease        0_months
3                 disease        3_months
...

#From 9.7 in the LIMMA user guide
Treat <- factor(paste(targets$Condition,targets$time_point,sep="."))
design <- model.matrix(~0+Treat)
colnames(design) <- levels(Treat)

#estimate the correlation between measurements made on the same subject
corfit <- duplicateCorrelation(eset,design,block=targets$Subject) corfit$consensus

# inter-subject correlation is input into the linear model fit:
fit <- lmFit(eset,design,block=targets$Subject,correlation=corfit$consensus)

# Comparing diseased and normal subjects, using 9.5.2 in LIMMA user guide,
**##I have a feeling this is the incorrect way of comparing diseased and normal subjects**##**
cm <- makeContrasts( DiseasedvsNormal = (disease.3_months-disease.0_months) - (normal.3_months-normal.0_months), levels=design)

# compute these contrasts and moderated t-tests
fit2 <- contrasts.fit(fit, cm)
fit2 <- eBayes(fit2)

topTable(fit2, coef="DiseasedvsNormal")
limma • 200 views
0
Entering edit mode

Hi everyone, Im the OP, I believe I have figured out the answer to question 1 and question 2. However, I would still appreciate input from LIMMA experts out there.

Question 1. If I'm interested in differential expression between healthy and diseased patients regardless of the time-point, my analysis would simply become an ordinary two group comparison but I would still incorporate patient as a random effect via duplicateCorrelation. So

Treat <- factor(targets$Condition) design <- model.matrix(~0+Treat) corfit <- duplicateCorrelation(eset,design,block=targets$Subject)

fit <- lmFit(eset,design,block=targets$Subject,correlation=corfit$consensus)

cm <- makeContrasts( DiseasedvsNormal = disease-normal, levels=design)

fit2 <- contrasts.fit(fit, cm)

fit2 <- eBayes(fit2)

topTable(fit2, coef="DiseasedvsNormal")

Question 2 - I have found an interesting answer from Aaron Lun in the past (Further clarification on when not to use duplicateCorrelation with technical replicates (RNA-seq)), whereby if I want to use a continuous predictor and duplicateCorrelation this situation resembles "Repeated samples with different interesting predictors". i guess I just want expert opinion if my code below is methodologically sound.

Treat <- targets$viral_concentration design <- model.matrix(~Treat) corfit <- duplicateCorrelation(eset,design,block=targets$Subject)

fit <- lmFit(eset,design,block=targets$Subject,correlation=corfit$consensus)

fit <- eBayes(fit)

topTable(fit, coef=”Treat”)

1
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States

It's not clear what you mean by differential expression regardless of the time point. That could mean different things. But anyway, I am not sure I see a reason to use duplicateCorrelation or to block on subject for this study.

This part:

cm <- makeContrasts( DiseasedvsNormal = (disease.3_months-disease.0_months) - (normal.3_months-normal.0_months), levels=design)

Is an interaction, and is intended to find genes that are affected differently by time depending on the subject status. And this automatically accounts for subject-specific differences (because you are subtracting the two timepoints, within subject). This definitely doesn't select genes that are different at any time point - it finds genes where the differences in the parentheses are different. This can encompass many things. No change for disease, big change for normal, no change for normal, big change for disease, smaller changes for each disease status with different directions, etc. You could do

cm <- makeContrasts(DiseasedvsNormal = (disease.3_months + disease.0_months)/2 - (normal.3_months + normal.0_months)/2, levels = design)

Which will test that the average expression in diseased is different than the average in normal, which is one way of finding differential expression regardless of time point. Or you could do

cm <- makeContrasts(DiseasedvsNormal3mo = disease.3_months  - normal.3_months,  DiseasedvsNormal0mo = disease.0_months -  normal.0_months, levels = design)

which will test for any differences at either time. And then you could just select genes that are significant in either contrast, or do an F-test for both contrasts simultaneously. Although the latter tends to be less powerful and not exactly the same as the former.

And finally you could do what you suggest in your reply, which is to ignore time completely, and use duplicateCorrelation to account for repeated measures. If you do this you are assuming that time really doesn't make any difference, because you will end up treating the diseased and normal samples at different times as if they are indistinguishable. Which may be a reasonable assumption, but I would tend to check that assumption first.

0
Entering edit mode

I want to move forward with your second suggestion by comparing the average expression in diseased and normal. However, will I need to block for subject when I run the analysis this way? It's not clear to me LIMMA will automatically block for subject.

I have similar confusion for your comment on interaction analysis as well ("And this automatically accounts for subject-specific differences (because you are subtracting the two timepoints, within subject"). After reading sections 9.5.1 and 9.5.2 in the LIMMA user guide, It's not clear to me that LIMMA will automatically account for subject specific differences if I run an interaction analysis as shown in section 9.5.2.

0
Entering edit mode

I think the easiest thing to do if you want to follow suggestion 2 is to use duplicateCorrelation, because you are averaging between repeated measurements. You might be able to block on subject like in section 3.5 of the edgeR User's Guide, but I would have to think about that more. Maybe Gordon will be along later (tomorrow morning his time) and have something further to say.

For your second question, if you do a paired analysis, algebraically what you do is compute the sum of the within-subject differences. As an example, if you wanted to know if there were differences between the two times for the diseased group, algebraically what you would do is ((subject1_3mo - subject1_0mo) + (subject2_3m0 - subject2_0mo) + ...)/N, in which case you are removing any subject-specific differences, and just comparing the changes between time points. Like if Subject1 had inherently much higher gene expression than the others it wouldn't matter because it would be high at both timepoints, and you only end up with the difference between those time points.

In the interaction you are doing the same exact thing - you subtract the two timepoints, within each subject and then compare the mean of the subject-corrected differences. Or you don't really do that exactly, but algebraically it's the same thing.