Question: edgeR - how to build design for matched case-control study with repeated measurement
0
14 months ago by
1832278510 wrote:

Please help me build a design that accounts for both the matched case-control design (there were 1 case [disease=1] and 2 controls [disease=0] in each pair [having the same matchid], matched for age and gender) and the correlation between multiple measurements at different timepoints (time 1, 2, 3) in a given subject. My objective is to compare the microbial differential abundance between cases and controls over all timepoints (not treating each timepoint separately). Thanks a lot!

My dataset is as following:

personid

matchid

time

disease

1

1

1

1

1

1

2

1

1

1

3

1

2

1

1

0

2

1

2

0

2

1

3

0

3

1

1

0

3

1

2

0

3

1

3

0

4

2

1

1

4

2

2

1

4

2

3

1

5

2

1

0

5

2

2

0

5

2

3

0

6

2

1

0

6

2

2

0

6

2

3

0

7

3

1

1

7

3

2

1

7

3

3

1

8

3

1

0

8

3

2

0

8

3

3

0

9

3

1

0

9

3

2

0

9

3

3

0

If I use edgeR, is the design correct?

design <- model.matrix(~matchid+personid+disease+time+disease:time)

(There are paired examples and repeated measurement examples in the edgeRUsersGuide. However, in repeated measurement examples, repeated measurements at different timepoints were obtained from independent subjects. By contrast, in my study, repeated measurements at different timepoints were obtained from the same subjects.)

edger • 220 views
modified 14 months ago by Aaron Lun24k • written 14 months ago by 1832278510
Answer: edgeR - how to build design for matched case-control study with repeated measure
1
14 months ago by
Aaron Lun24k
Cambridge, United Kingdom
Aaron Lun24k wrote:

Unfortunately, setting up the design to answer your question is much more complicated than you might think, because personid is nested within disease. This means that you cannot estimate the log-fold change between disease and healthy individuals and block on the person effect at the same time (which would be bad if there are strong patient-to-patient effects, which are probably present in any kind of human patient data).

There are two solutions to this, but one of them involves subsetting the dataset by timepoint and is not appropriate here given that you want to use information from all timepoints at once. So this leaves us with the only other solution, which is to switch to limma and voom. Basically, set up your design matrix like so:

personid <- factor(rep(1:9, each=3))
matchid <- factor(rep(1:3, each=9))
time <- rep(1:3, 3)
disease <- factor(rep(rep(c(1,0,0), each=3), 3))
design <- model.matrix(~matchid + time + disease)


... and run an analysis using duplicateCorrelation with the blocking factor set to personid. This will use information from all timepoints, account for any person-to-person correlations and allow you to compare directly between disease states.

• Note the factor, which is important to ensure that the numbers are not treated as real-valued covariates.
• You're assuming that the effect of time on microbial abundance is linear.
• personid is nested within matchid, and conventionally (i.e., when blocking on all factors in the design matrix) you wouldn't need the latter. In this case, though, I've used both of them, to avoid violating homogeneity assumptions in duplicateCorrelation when there are systematic differences between matchid levels.
ADD COMMENTlink modified 14 months ago • written 14 months ago by Aaron Lun24k

Thanks a lot for your professional and quick reply!! I appreciate it very much. There are 3 more questions:

1) If I need to adjust for some confounding factors such as smoking and drinking. Can I just add these variables into the model.matrix? Like design <- model.matrix(~matchid + time +smoking+drinking+ disease) (using duplicateCorrelation with the blocking factor set to personid).

2) Incomplete data further complicated this study. There were heterogeneity in the number and timing of measurements from different subjects in our study (for example, some pairs had only one control [not 2 controls]; some pairs had only 2 measurements [not 3 measurements]). Is the above design still suitable for this data?

3) As I am not familiar with limma, is it suitable for analyzing 16s data (in this study, microbiome composition and specific bacterial abundances were determined through bacterial 16S rRNA gene sequencing)? Compared to edgeR and Deseq2 (more frequently used for the microbiome data in the published papers), does limma have any disadvantages, although it can take the complicated design into account?

Thanks again for your precious time!

1) Yes, you can do that, provided the extra variables are not confounded with disease. If they are confounded with matchid, then they do not need to be included at all.

2) Yes, that should be fine; information from all "pairs" (triplets, really) and time points will be used to estimate the disease effect, so a few missing observations should not cause a problem.

3) I don't know, but if you can model the mean-variance trend accurately, you should be fine. See if the plot produced by voom is similar to what you would get for RNA-seq data.

Thank you very much!

For the 1st question, actually, the extra variables such as smoking may correlate with disease (these variables are not confounded with matchid), thus we want to obtain a pure correlation between microbiome and disease by adjusting these variables (For example, in multivariate logistic regression models, we can both add interested variables and confounders into the multivariate model).Is it OK for me to add these confounders (may correlate with disease) into the design matrix?