Question

Modeling a combination of synchronized and unsynchronized time points?

0

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 8 months ago

Scripps Research, La Jolla, CA

I have a dataset that consists of repeated RNA-seq samples from a set of individuals at multiple time points. The early time points are all uniform: every individual was sampled at 0, 7, 14, and 28 days. But after that, different individuals were sampled at roughly periodic intervals, but all on different days. An MDS plot of the data indicates that time point effects explain at least the first 2 principal coordinates of the data:

PCoA Plot colored by time point

In the MDS plot, each line represents the trajectory of an individual over time, and the points represent each of the samples for that individual. The color represents the range of the first 30 days (i.e. all the synchronized time points) and each time point after that is roughly another month. (POD stands for "post-operation day".) As you can see, the general trend is starting on the left, going right, then up, then slowly drifting left again. There are other experimental factors as well, but modelling them is more straightforward, and they don't show up in PC1 or 2, so I'm just focusing on time point for this question.

Anyway, I mostly care about the early time points (0 through 28), but I still want to include the later time points in my model to improve the variance estimation, as is generally recommended. My problem is that the early time points are more easily modelled as a factor of discrete time points, while the later time points pretty much have to be modelled as a continuous numeric variable, probably using the natural spline basis method. I'd like to avoid having to use the natural spline approach for the entire time point variable, since I want to make specific contrasts between specific early time points, and this is difficult to impossible with a spline basis. So is it valid to split the time variable into two stages like this and model them using different methods? How would I evaluate if such a design is a good fit for the data?

Here is the table of all the samples' individual ID and time point, if you want to look at it: https://dl.dropboxusercontent.com/u/1581949/table.xlsx

design matrix spline ns limma • 1.0k views

ADD COMMENT • link updated 8.5 years ago by Aaron Lun ★ 28k • written 8.5 years ago by Ryan C. Thompson ★ 7.9k

score 0 · Answer 1 · 2015-11-06

You've got plenty of residual d.f., and I don't think you need to worry about squeezing out more from the later time points if you're not going to use them for DE testing. For example, you've got around 30 patients, so that's about 90 d.f. for variance estimation if you use an additive model (i.e., ~Individual + Timepoint) including only the samples for time points 0 to 28. I think that'd be more than enough.

If you still want to use the later time points, one issue that comes to mind is whether you want individual-specific splines, i.e., separate spline coefficients for each individual rather than a common basis matrix for all individuals. This may be advisable if each individual starts to respond differently at the later times. However, if you're going down that road, you'll probably need to use splines with different d.f. for each patient, as the number of time points isn't consistent. Any DE testing will also be limited to testing for a time effect within patients, as it'd be hard to match up coefficients for different time points between patients.

Statistically speaking, I don't think there's any inherent problem with modelling the time points differently. It's just whether it's worth the effort of doing so. As for the fit; you can try modelling with and without the later time points, and see whether you get more DE genes. You can also check the size of the variances between the models; I'd expect it to increase if the spline fit was poor.