Question

edgeR: time series analysis

0

Entering edit mode

BharathAnanth ▴ 80

@bharathananth-10049

Last seen 5.8 years ago

Hi

I have RNA-seq time course data consisting of 11 individual time points. I however do not have replicates for each time point. I am trying to fit a simple linear model of the form to detect oscillations:

time <- seq(2,22,by=2)

in.phase <- cos(2*pi/22*time)
out.phase <- sin(2*pi/22*time)

design <- model.matrix(~in.phase + out.phase)

My question is can my large residual degrees of freedom compensate for my lack of biological replicates at each time point. In other words, can I use the standard pipeline with estimateDisp(y, design, robust=TRUE) to process my data or do I need to (a) choose a reasonable BCV value (as suggested in the manual) (b) only estimate trended dispersion?

Following the standard pipeline, I was wondering if the oscillating genes (are obviously also the ones with lot of sample to sample variability in my case) get assigned larger than "reasonable" tag wise dispersion? I do not have problems with identifying them with the standard pipeline, but I am trying to understand what assumptions I am making.

Thank you.

edger time course • 2.0k views

ADD COMMENT • link updated 7.2 years ago by Gordon Smyth 50k • written 7.2 years ago by BharathAnanth ▴ 80

score 2 · Answer 1 · 2017-02-20

As long as your model has non-zero residual d.f., you can estimate a dispersion for each gene. With time series, the general assumption is that expression follows some smooth trend with respect to time - deviations from that trend can be used for dispersion estimation. Obviously, the more residual d.f. you have, the more precise your dispersion estimates are, and the more reliable your downstream analyses will be. This is easiest to achieve with more replicates, as it avoids the need to make strong assumptions about your response to time.

In your case, you've applied the cosine and sine functions under the assumption that one cycle takes exactly 22 time units. I can't remember all my trigonometric identities, but I don't think that linear sums of these functions can be used to represent situations where cycles are faster or slower. If a gene had a different cycling time, its expression profile with respect to time would not be modelled well, resulting in an inflated dispersion.