Question

edgeR: Tagwise dispersion when using offset matrix

0

Entering edit mode

my4 ▴ 10

@my4-10129

Last seen 6.4 years ago

Hello,

I am trying to use edgeR with a custom offset matrix (i.e., a different offset value in the GLM for each gene). I'm doing something like this:

y = DGEList(counts=toc,group=groups)
design = model.matrix(~groups)
y$offsets = too #Where too is a matrix of the same dimensions as toc
y = estimateDisp(y,design)
fit = glmFit(y,design)
tst = glmTreat(fit,lfc=lfc_cut)

This works fine and seems to give sensible results. However, I'm unclear if the tagwise dispersion values calculated by estimateDisp use just the offset matrix, the lib.size values or some combination of the two. Essentially, if I pass estimateDisp a DGEList with a matrix offset entry, will these values be used in calculating the common and tagwise dispersions? If not, how should I calculate dispersion when a matrix offset is needed?

edger • 2.4k views

ADD COMMENT • link updated 7.4 years ago by Aaron Lun ★ 28k • written 7.4 years ago by my4 ▴ 10

score 3 · Answer 1 · 2016-11-17

3

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 9 hours ago

The city by the bay

edgeR won't recognise y$offsets, you should be using y$offset. If you do so, the offsets will be used and the normalization factors/library sizes will be ignored. In general, this is the case for most edgeR functions, where the former is prioritized over the latter when it is available. If you want to convince yourself, run it again after setting the offsets to NULL and you should see a difference in the results (provided your offsets are not just log-transformed effective library sizes). Also check out ?getOffset, which is called by estimateDisp.

ADD COMMENT • link 7.4 years ago Aaron Lun ★ 28k

1

Entering edit mode

Would it ever make sense, then, to have the cpm function (or an analogous function, whatever it might be called) to have the option to use y$offset if it is available, instead of always relying on the lib.size and norm.factors?

ADD REPLY • link 7.4 years ago Steve Lianoglou ★ 13k

1

Entering edit mode

Possibly, depending on what you're going to use the CPMs for. If you're going to do something semi-quantitative (e.g., clustering on the log-CPMs), then it makes sense to incorporate the appropriate normalization information stored in y$offset. However, I usually just use cpm to get some values to plot, in which case the simplicity of the standard definition is more appealing. There is the extra complication that the offsets should be of roughly the same absolute scale as the log-library sizes, otherwise the output can't be interpreted as CPMs. This makes it difficult to guarantee that sensible values are returned by cpm in the presence of arbitrary user-defined offset matrices. (The last point isn't much a problem with the rest of edgeR, where the important thing is the relative size of the offsets between samples.)

ADD REPLY • link 7.4 years ago Aaron Lun ★ 28k

1

Entering edit mode

Just to emphasize the last part of what Aaron has said. edgeR places no constraints on what users can input as the matrix of offsets. In fact one can add an additive constant to the offset values for each gene without changing any of the downstream results, except for the intercept coefficient estimate for each gene. This means that absolute sizes of the offsets have no meaning. Hence we cannot compute cpm (a quantity whose size has an absolute meaning) from offsets (whose size has no meaning).

I suppose we could go down a whole new development path of interpreting offsets in a stricter fashion, but no one has presented us with good reason why this would be worth spending time on.

ADD REPLY • link 7.4 years ago Gordon Smyth 50k

0

Entering edit mode

Thank you both for the help. I have indeed been using y$offsets rather than offset, but as Gordon says it seems to be working fine. At least, the co-effecients produced by the fit are consistent with what I would expect if the offsets were being applied, but I will correct the error to be on the safe side though.

I started worrying about the offsets not being used in calculating dispersion when looking at the code for estimateDisp.default. To me it seems like the trended dispersion calculation uses the output of aveLogCPM, which uses only lib.size and not the offsets. I understand that this estimate is then shrunk towards the common.dispersion value in produced the tagwise.dispersion values, which clearly does use the offsets, but I was confused as to why the trended.dispersion estimates did not need to explicitly know about the offsets.

Gordon's explanation of the cpm values not being meaningful in combination with an arbitrary offset explains why this is the case. Given this is the case, am I better off using only the common dispersion estimates (and ignoring the tagwise.dispersion values) when using the negative binomial model to test for DE between groups?

ADD REPLY • link 7.4 years ago my4 ▴ 10

2

Entering edit mode

Are you better off just using common dispersion? No! That would be throwing the baby out with the bath water.

The offsets are used to compute all the dispersions including the dispersion trend. The only quantity they are not used for is the aveLogCPM, but this is not something to be concerned about. If the offsets were used for aveLogCPM it would be make so little difference you would hardly notice it in most cases. You are worrying about something that is not important.

ADD REPLY • link 7.4 years ago Gordon Smyth 50k

0

Entering edit mode

Thank you Gordon. I'll stop worrying about things I don't understand well enough to worry about.

ADD REPLY • link 7.4 years ago my4 ▴ 10

0

Entering edit mode

The actual dispersion calculations all use the offsets if you provide them. The aveLogCPM calculation does not use the offsets, but this is only used to arrange the genes in order of abundance when fitting the dispersion.

ADD REPLY • link 7.4 years ago Ryan C. Thompson ★ 7.9k

1

Entering edit mode

Actually I think edgeR will recognise y$offsets. It will look for y$offset but R will return y$offsets instead because of automatic name completion in R.

ADD REPLY • link 7.4 years ago Gordon Smyth 50k