Search
Question: edgeR: time series analysis
0
20 months ago by
BharathAnanth30 wrote:

Hi

I have RNA-seq time course data consisting of 11 individual time points. I however do not have replicates for each time point. I am trying to fit a simple linear model of the form to detect oscillations:

time <- seq(2,22,by=2)

in.phase <- cos(2*pi/22*time)
out.phase <- sin(2*pi/22*time)

design <- model.matrix(~in.phase + out.phase)

My question is can my large residual degrees of freedom compensate for my lack of biological replicates at each time point. In other words, can I use the standard pipeline with estimateDisp(y, design, robust=TRUE) to process my data or do I need to (a) choose a reasonable BCV value (as suggested in the manual) (b) only estimate trended dispersion?

Following the standard pipeline, I was wondering if the oscillating genes (are obviously also the ones with lot of sample to sample variability in my case) get assigned larger than "reasonable" tag wise dispersion? I do not have problems with identifying them with the standard pipeline, but I am trying to understand what assumptions I am making.

Thank you.

modified 20 months ago by Gordon Smyth35k • written 20 months ago by BharathAnanth30
2
20 months ago by
Aaron Lun21k
Cambridge, United Kingdom
Aaron Lun21k wrote:

As long as your model has non-zero residual d.f., you can estimate a dispersion for each gene. With time series, the general assumption is that expression follows some smooth trend with respect to time - deviations from that trend can be used for dispersion estimation. Obviously, the more residual d.f. you have, the more precise your dispersion estimates are, and the more reliable your downstream analyses will be. This is easiest to achieve with more replicates, as it avoids the need to make strong assumptions about your response to time.

In your case, you've applied the cosine and sine functions under the assumption that one cycle takes exactly 22 time units. I can't remember all my trigonometric identities, but I don't think that linear sums of these functions can be used to represent situations where cycles are faster or slower. If a gene had a different cycling time, its expression profile with respect to time would not be modelled well, resulting in an inflated dispersion.

ADD COMMENTlink modified 20 months ago • written 20 months ago by Aaron Lun21k