I have used edgeR and limoRhyde to predict differential expression in a circadian experiment. Below ZT_cos and ZT_sin is sin and cos transformations of the time variable as calculated by limoRhyde.
The model is of the form model.matrix(~ 0 + Genotype + Genotype:(ZT_cos + ZT_sin))
which results in a model matrix that looks like:
Control KO Control:ZT_cos Control:ZT_sin KO:ZT_cos KO:ZT_sin 1 0 0.82 0.5 0 0 0 1 0 0 0.82 0.5
etc...
After running the differential expression analysis I would like to calculate the expected expression level at an arbitrary timepoint, I tried generating ZT_cos and ZT_sin for a large number of timepoints and then calculating expected expression in the control condition as:
expression = logFC_Control + logFC_Control:ZT_sin * ZT_sin + logFC_Control:ZT_cos * ZT_cos
The problem is that my model is made without an intercept as this makes differential expression more straightforward (the actual setup is more complicated than indicated above). This means that logFC_KO and logFC_Control have values that are meaningful in relation to each other but not on their own. By playing around with the results I noticed that all that is missing is a constant to get what seems like a meaningful fit - but I cannot figure out what the constant is. I tried the average offset, but that quite far off.
I guess my question boils down to:
Given a model without an intercept, what is the value that the logFC's are calculated relative to?
Yes, that is what I was hoping to model - good to hear that it is correct. My question was more: "how do I use the estimated logFCs to calculate the expected CPMs at a time point I did not observe?"
Today I realised I could calculate the expression as AveLogCPM + (logFC_control - logFC_KO) + time_effect for the expression in the control condition. I believe the first two terms are the constant factor I was looking for above.
A second question: If I have multiple conditions, how do I calculate how an individual condition deviate from the average expression? Is it AveLogCPM + logFC_cond1 - mean(logFC_cond1 + ... + logFC_condN)?
There's no need to use the log-fold changes for what you want to do. You have the coefficients of the fitted GLM, so you can just compute a linear combination of terms for the given genotype. Specifically:
where
Control
,Control:ZT_cos
andControl:ZT_sin
are the estimated values of the coefficients of the same name for a given gene, and X and Y are the (co)sine values for the time point that you want. It's simple to do the same for KO.This will give you a per-observation value equivalent to (i) dividing the corresponding count by the library size, assuming the offsets are log-library sizes; and (ii) log-transforming. To obtain log-CPMs, simply add
log(1e6)
to the computed value - you can then divide bylog(2)
orlog(10)
or whatever you want the base of the log-transform to be.Great! That was exactly what I wanted. I tried using the model coefficients at first but I did not realise I needed to scale them so they were all off.