Hello,

I have quite a complex experimental design from a retrospective study of human disease progression and would like some advice on making appropriate contrasts with limma-voom.

We have two cohorts of patients, A and B, sampled longitudinally (4 times on average). Cohort B develop disease at a later timepoint. For the purposes of DE analysis, samples have been grouped into 4 time intervals. The categorisation is such that some patients have more than one sample in a given time interval, some will have one sample at each interval, some will have samples missing at certain time intervals and some only have one sample across the entire study.

```
Cohort Timepoint Samples Repeats
B 1 10 2
B 2 10 2
B 3 6 0
B 4 13 7
A 1 30 6
A 2 28 2
A 3 25 0
A 4 25 10
```

While the neither cohort has a disease at early time intervals, they cannot be combined as we are interested in early differences that precede disease onset at later time intervals. We would like to make the following comparisons:

Comparing against the earliest time interval within each cohort, i.e., cohort A timepoint 2 vs cohort A timepoint 1, cohort A timepoint 3 vs cohort A timepoint 1 etc.

Comparing between cohorts at each time interval, i.e., cohort B timepoint 1 vs cohort A timepoint 1, cohort B timepoint 2 vs cohort A timepoint 2 etc.

My approach for the above comparisons has been:

design <- model.matrix(~0 + cohort_time + subjectID), which is not full rank (due to unbalanced sampling) but accounts for subject-specific effects on expression, then the standard voom, lmfit, contrasts.fit and eBayes

design <- model.matrix(~0 + cohort_time), voom, duplicateCorrelation blocking on SubjectID to estimate intra-patient correlation, then voom again including corfit and blocking on SubjectID, repeat duplicateCorrelation using second voom object, lmfit including second corfit and blocking on SubjectID, contrasts.fit and eBayes.

Do the above strategies adequately account for 1. unbalanced sampling between time intervals within cohorts and 2. intra-patient variation when some patients are overrepresented in each cohort?

I would like to maximise our power to detect DE by making use of the number of samples we have, as some of the current groupings may be underpowered. I would therefore like to know whether the above strategies can be robustly applied when time intervals are combined as follows:

Combining late time intervals in cohort B (timepoint 3 and 4, proximal to disease progression) and early time intervals in cohort B (timepoint 1 and 2, distal to disease progression) to compare late vs early - design <- model.matrix(~0 + cohort_combinedtime + time + SubjectID), to account for subject-specific and time effects on expression.

Combining cohort A in the same manner to compare cohortB_early vs cohortA_early and cohortB_late vs cohortA_late - design <- model.matrix(~0 + cohort_combinedtime + time), to duplicateCorrelation blocking on SubjectID to estimate intra-patient correlation.

Potentially also contrasting the change across time between cohorts - (cohortB_late - cohortB_early) - (cohortA_late - cohortA_early).

If the strategy for 3. and 4. is acceptable, what is the best way to deal with 5.?

Thanks in advance!

Thank you very much, Gordon. I'll give voomLmFit a go!