Hello,
I am trying to use edgeR with a custom offset matrix (i.e., a different offset value in the GLM for each gene). I'm doing something like this:
y = DGEList(counts=toc,group=groups)
design = model.matrix(~groups)
y$offsets = too #Where too is a matrix of the same dimensions as toc
y = estimateDisp(y,design)
fit = glmFit(y,design)
tst = glmTreat(fit,lfc=lfc_cut)
This works fine and seems to give sensible results. However, I'm unclear if the tagwise dispersion values calculated by estimateDisp use just the offset matrix, the lib.size values or some combination of the two. Essentially, if I pass estimateDisp a DGEList with a matrix offset entry, will these values be used in calculating the common and tagwise dispersions? If not, how should I calculate dispersion when a matrix offset is needed?
Would it ever make sense, then, to have the
cpm
function (or an analogous function, whatever it might be called) to have the option to usey$offset
if it is available, instead of always relying on thelib.size
andnorm.factors
?Possibly, depending on what you're going to use the CPMs for. If you're going to do something semi-quantitative (e.g., clustering on the log-CPMs), then it makes sense to incorporate the appropriate normalization information stored in
y$offset
. However, I usually just usecpm
to get some values to plot, in which case the simplicity of the standard definition is more appealing. There is the extra complication that the offsets should be of roughly the same absolute scale as the log-library sizes, otherwise the output can't be interpreted as CPMs. This makes it difficult to guarantee that sensible values are returned bycpm
in the presence of arbitrary user-defined offset matrices. (The last point isn't much a problem with the rest of edgeR, where the important thing is the relative size of the offsets between samples.)Just to emphasize the last part of what Aaron has said. edgeR places no constraints on what users can input as the matrix of offsets. In fact one can add an additive constant to the offset values for each gene without changing any of the downstream results, except for the intercept coefficient estimate for each gene. This means that absolute sizes of the offsets have no meaning. Hence we cannot compute cpm (a quantity whose size has an absolute meaning) from offsets (whose size has no meaning).
I suppose we could go down a whole new development path of interpreting offsets in a stricter fashion, but no one has presented us with good reason why this would be worth spending time on.
Thank you both for the help. I have indeed been using y$offsets rather than offset, but as Gordon says it seems to be working fine. At least, the co-effecients produced by the fit are consistent with what I would expect if the offsets were being applied, but I will correct the error to be on the safe side though.
I started worrying about the offsets not being used in calculating dispersion when looking at the code for estimateDisp.default. To me it seems like the trended dispersion calculation uses the output of aveLogCPM, which uses only lib.size and not the offsets. I understand that this estimate is then shrunk towards the common.dispersion value in produced the tagwise.dispersion values, which clearly does use the offsets, but I was confused as to why the trended.dispersion estimates did not need to explicitly know about the offsets.
Gordon's explanation of the cpm values not being meaningful in combination with an arbitrary offset explains why this is the case. Given this is the case, am I better off using only the common dispersion estimates (and ignoring the tagwise.dispersion values) when using the negative binomial model to test for DE between groups?
Are you better off just using common dispersion? No! That would be throwing the baby out with the bath water.
The offsets are used to compute all the dispersions including the dispersion trend. The only quantity they are not used for is the aveLogCPM, but this is not something to be concerned about. If the offsets were used for aveLogCPM it would be make so little difference you would hardly notice it in most cases. You are worrying about something that is not important.
Thank you Gordon. I'll stop worrying about things I don't understand well enough to worry about.
The actual dispersion calculations all use the offsets if you provide them. The aveLogCPM calculation does not use the offsets, but this is only used to arrange the genes in order of abundance when fitting the dispersion.
Actually I think edgeR will recognise y$offsets. It will look for y$offset but R will return y$offsets instead because of automatic name completion in R.