If weight loss and time are well correlated, you're in trouble. With your experimental design, there's no way to distinguish between the effect of time - due to aging or whatever - and the effect of weight loss. The correct way to do it would be to have a control group without any diet, which would provide a baseline for the time effect.

Nonetheless, you might be able to get something out of this data if time and weight loss are not well correlated (e.g., some individuals lose more weight than others, or weight loss is not linear with respect to time.) You can then put both factors into the model to identify significant effects associated with weight loss that are conditional on the time effect. This reduces the chance that the resulting DE genes are driven by the confounding time effect. However, this comes at the cost of a loss of power, which is more severe as time and weight loss become more correlated.

Note that if time and weight loss are perfectly correlated, then you can't put them in the same model together, as the coefficients will not be estimable. And `TP:perc_weightlost`

doesn't make sense, just use an additive model.

## Edit:

I hadn't appreciated that you were modeling time as a factor, not a covariate. That's fine, and probably the correct approach, but it changes my answer a bit. Let's mock up a sample table for demonstration purposes.

```
patient <- gl(8, 2)
time <- rep(c("T0", "T1"), 8) # Two time points, for simplicity.
set.seed(100)
weight0 <- rnorm(8, mean=70)
weight1 <- weight0 - runif(8, 0, 10)
wloss <- as.vector(rbind(0, 1 - weight0/weight1)) # interleaved
```

Your second proposed design was almost correct. I would instead do:

```
design <- model.matrix(~ 0 + patient + time + time:wloss)
design <- design[,-10] # remove the all-zero column.
```

The first 8 coefficients are patient blocking factors, representing the fitted expression of each patient at time zero. The next coefficient is the average log-fold change due to time within each patient. The final coefficient represents the "effect" (i.e., association) of expression with weight loss at time point 1. As I mentioned before, the last two coefficients are likely to be almost completely confounding, which will reduce your power to detect changes compared to a more carefully designed experiment. This is the **correct behavior**, because otherwise you would get false positives associated with weight loss that are actually caused by time.

The above example is fairly easy to extend to three time points. Note that if `age`

and `sex`

are the same for all samples derived from a single patient, they will be redundant with the patient blocking factors - you don't need them. Same for sequencing center, if all samples from a single patient were sequenced at the same location.

Hi Aaron,

Thank you so much for the guidance. Yes, some individuals lose more weight than others and weight loss is not linear with respect to time, plus between timepoints 2 and 3, the diet is different than between timepoints 1 and 2. I realise all this complicates matters and I am quite confused on how to model the data.

If I “put both factors into the model to identify significant effects associated with weight loss that are conditional on the time effect” would it be a simple additive model with a contrast matrix? I'm not quite sure how to do a contrast matrix with a continuous and categorical variable. Or can I make the timepoint the intercept?

I am also unsure about my independent variable (percentage weight lost). The reason I use percentage weight lost is because the heavier one is, the easier it is to lose weight. So, kgs lost is not comparable across the individuals. But for me to use percentage weight lost, I have made the value zero at TP1. In hind-sight, I am thinking this is wrong, because then the model cannot find the association between percentage weight lost and gene expression at TP1. Is this correct?

If I use weight (kgs) at TP1, TP2, TP3, how do I model the effects of starting weight?

Thank you so much, Mahes Muniandy

See my updated answer. Don't worry about the fact that the weight loss is zero at the first time point. There's no need for the model to account for this, because there's nothing there!

Now, the appropriate model would be different if you wanted to account for the association of expression with the starting weight for each patient, in which case you would need to put in weights at the first time point. But this makes the model more complicated (i.e., the coefficients become more difficult to interpret) and only helps if the weight-related association at the first time point disappears at later time points (otherwise the patient blocking factors would be sufficient).

Hi Aaron, Thank you once again. I have tried the below model as suggested and as expected I get fewer significant genes compared to not having the time points in the model.

Just one more question, if I want to find the effects of age and sex between the individuals, can I extend this to the multi-level model with duplicateCorrelation? Many Thanks, Mahes

Yes, if you wanted to do that, block on

`patient`

in`duplicateCorrelation()`

and put in`age`

and`sex`

as additive terms in the design matrix. Those things don't seem that interesting to me, though; perhaps an interaction between`wloss`

and`sex`

would be more fun.