29 days ago by

Cambridge, United Kingdom

If weight loss and time are well correlated, you're in trouble. With your experimental design, there's no way to distinguish between the effect of time - due to aging or whatever - and the effect of weight loss. The correct way to do it would be to have a control group without any diet, which would provide a baseline for the time effect.

Nonetheless, you might be able to get something out of this data if time and weight loss are not well correlated (e.g., some individuals lose more weight than others, or weight loss is not linear with respect to time.) You can then put both factors into the model to identify significant effects associated with weight loss that are conditional on the time effect. This reduces the chance that the resulting DE genes are driven by the confounding time effect. However, this comes at the cost of a loss of power, which is more severe as time and weight loss become more correlated.

Note that if time and weight loss are perfectly correlated, then you can't put them in the same model together, as the coefficients will not be estimable. And `TP:perc_weightlost`

doesn't make sense, just use an additive model.

## Edit:

I hadn't appreciated that you were modeling time as a factor, not a covariate. That's fine, and probably the correct approach, but it changes my answer a bit. Let's mock up a sample table for demonstration purposes.

```
patient <- gl(8, 2)
time <- rep(c("T0", "T1"), 8) # Two time points, for simplicity.
set.seed(100)
weight0 <- rnorm(8, mean=70)
weight1 <- weight0 - runif(8, 0, 10)
wloss <- as.vector(rbind(0, 1 - weight0/weight1)) # interleaved
```

Your second proposed design was almost correct. I would instead do:

```
design <- model.matrix(~ 0 + patient + time + time:wloss)
design <- design[,-10] # remove the all-zero column.
```

The first 8 coefficients are patient blocking factors, representing the fitted expression of each patient at time zero. The next coefficient is the average log-fold change due to time within each patient. The final coefficient represents the "effect" (i.e., association) of expression with weight loss at time point 1. As I mentioned before, the last two coefficients are likely to be almost completely confounding, which will reduce your power to detect changes compared to a more carefully designed experiment. This is the **correct behavior**, because otherwise you would get false positives associated with weight loss that are actually caused by time.

The above example is fairly easy to extend to three time points. Note that if `age`

and `sex`

are the same for all samples derived from a single patient, they will be redundant with the patient blocking factors - you don't need them. Same for sequencing center, if all samples from a single patient were sequenced at the same location.