Multi-factorial, longitudinal disease progression analysis with unbalanced patient sampling - duplicateCorrelation or blocking?
Entering edit mode
Last seen 4 days ago


I have quite a complex experimental design from a retrospective study of human disease progression and would like some advice on making appropriate contrasts with limma-voom.

We have two cohorts of patients, A and B, sampled longitudinally (4 times on average). Cohort B develop disease at a later timepoint. For the purposes of DE analysis, samples have been grouped into 4 time intervals. The categorisation is such that some patients have more than one sample in a given time interval, some will have one sample at each interval, some will have samples missing at certain time intervals and some only have one sample across the entire study.

Cohort        Timepoint            Samples         Repeats
    B               1                 10              2
    B               2                 10              2
    B               3                  6              0
    B               4                 13              7
    A               1                 30              6
    A               2                 28              2
    A               3                 25              0
    A               4                 25             10

While the neither cohort has a disease at early time intervals, they cannot be combined as we are interested in early differences that precede disease onset at later time intervals. We would like to make the following comparisons:

  1. Comparing against the earliest time interval within each cohort, i.e., cohort A timepoint 2 vs cohort A timepoint 1, cohort A timepoint 3 vs cohort A timepoint 1 etc.

  2. Comparing between cohorts at each time interval, i.e., cohort B timepoint 1 vs cohort A timepoint 1, cohort B timepoint 2 vs cohort A timepoint 2 etc.

My approach for the above comparisons has been:

  1. design <- model.matrix(~0 + cohort_time + subjectID), which is not full rank (due to unbalanced sampling) but accounts for subject-specific effects on expression, then the standard voom, lmfit, and eBayes

  2. design <- model.matrix(~0 + cohort_time), voom, duplicateCorrelation blocking on SubjectID to estimate intra-patient correlation, then voom again including corfit and blocking on SubjectID, repeat duplicateCorrelation using second voom object, lmfit including second corfit and blocking on SubjectID, and eBayes.

Do the above strategies adequately account for 1. unbalanced sampling between time intervals within cohorts and 2. intra-patient variation when some patients are overrepresented in each cohort?

I would like to maximise our power to detect DE by making use of the number of samples we have, as some of the current groupings may be underpowered. I would therefore like to know whether the above strategies can be robustly applied when time intervals are combined as follows:

  1. Combining late time intervals in cohort B (timepoint 3 and 4, proximal to disease progression) and early time intervals in cohort B (timepoint 1 and 2, distal to disease progression) to compare late vs early - design <- model.matrix(~0 + cohort_combinedtime + time + SubjectID), to account for subject-specific and time effects on expression.

  2. Combining cohort A in the same manner to compare cohortB_early vs cohortA_early and cohortB_late vs cohortA_late - design <- model.matrix(~0 + cohort_combinedtime + time), to duplicateCorrelation blocking on SubjectID to estimate intra-patient correlation.

  3. Potentially also contrasting the change across time between cohorts - (cohortB_late - cohortB_early) - (cohortA_late - cohortA_early).

If the strategy for 3. and 4. is acceptable, what is the best way to deal with 5.?

Thanks in advance!

rna-seq duplicateCorrelation limma voom • 309 views
Entering edit mode
Last seen 1 hour ago
WEHI, Melbourne, Australia

Including subjectID in the design matrix always accounts for unbalanced sampling and patient variation but subjects with incomplete records will only contribute to comparisons between time points that they are observed for. Subjects with only one record will not contribute to the analysis at all. It is a safe but conservative statistical approach.

If you want to make use of all the data then it is better to use voomLmFit() with block=subjectID. That will treat subject as a random effect and will recover as much information as possible from the incomplete subject records. voomLmFit will automatically estimate the intra-subject correlation and will iterate the analysis to get the voom weights so there is no need for you to do so.

If you want to combine time intervals or cohort subsets, then that should be done using contrasts in the usual way. Just use (time3+time4)/2 in the contrast. You should not refit the linear model with an artificial combinedtime factor. That would misrepresent the data to limma and will result in an efficient analysis.

Entering edit mode

Thank you very much, Gordon. I'll give voomLmFit a go!


Login before adding your answer.

Traffic: 818 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6