6 months ago by

Cambridge, United Kingdom

Firstly, I should point out that using pseudo-counts (i.e., "classic" *edgeR*) is not the recommended way to do things anymore. Use `estimateDisp`

instead of the old dispersion estimation functions, and use `glmQLFit`

and `glmQLFTest`

instead of `glmFit`

and `glmLRT`

.

Now, let's correct some of your misconceptions. The log-fold change of the row means is not guaranteed to match the reported `logFC`

from *edgeR*'s GLM-based functions:

```
set.seed(0)
y <- matrix(rnbinom(10000, mu=10, size=10), ncol=10)
groups <- gl(2, 5)
design <- model.matrix(~groups)
fit <- glmFit(y, design, dispersion=0.1) # glmFit(), as dispersion is known.
res <- glmLRT(fit)
adj <- cpm(y) # accounting for slight differences in library size.
custom <- log2(rowMeans(adj[,6:10])/rowMeans(adj[,1:5]))
summary(res$table$logFC-custom) # non-zero.
```

This is partly because the estimate of the "average expression" in a negative binomial model is only proportional to the mean count when the library sizes are equal across samples. This is never true in real data sets. Computing library size-normalized values with `cpm`

doesn't help - this is a phenomenon related to the mean-variance relationship of count data. So using `rowMeans`

is already a step in the wrong direction. In addition, `glmFit()`

automatically adds a `prior.count=0.125`

to avoid undefined log-fold changes when one or the other group has zero counts, which further shifts the reported value from your simple estimate.

If we wanted to get average expression values that match up to the log-fold changes, I would prefer:

```
design2 <- model.matrix(~0 + groups)
fit2 <- glmFit(y, design2, dispersion=0.1)
res2 <- glmLRT(fit2, contrast=c(-1, 1))
g1 <- fit2$coefficients[,1]/log(2) # coefficients are natural log.
g2 <- fit2$coefficients[,2]/log(2)
custom <- g2 - g1
summary(res2$table$logFC - custom) # basically all-zero
```

Here, `g1`

and `g2`

represent the log-average expression in each group, adjusted by the library size. You might add the log-transformed mean library size to each of these values to get them back to the same magnitude as the raw counts, which should make things a bit easier to interpret in whatever downstream application you have planned.

Okay, so how does this relate to your example? The way you've set up your `design`

means that the `"Tumor"`

and `"Control"`

coefficients already represent the equivalents to `g1`

and `g2`

in my example above. This is because your log-fold change is being computed by `cont.matrix`

as `"Tumor - Control"`

, so you can just do the same thing as what I've done with `g1`

and `g2`

.

```
batch <- factor(rep(1:5, 2))
design3 <- model.matrix(~0 + groups + batch)
fit3 <- glmFit(y, design3, dispersion=0.1)
res3 <- glmLRT(fit3, contrast=c(-1, 1, 0, 0, 0, 0))
control <- fit3$coefficients[,1]/log(2)
tumor <- fit3$coefficients[,2]/log(2)
custom <- tumor - control
summary(res3$table$logFC - custom) # basically all-zero
```

The slight caveat is that these values are not really "averages", they represent the fitted values for each condition in the first level of `batch`

. This is probably fine but if you really need absolute values that are more "average-y", you could do some *ad hoc* adjustments:

```
actual.ave <- aveLogCPM(y)
dummy.ave <- (control + tumor)/2
diff <- actual.ave - dummy.ave
control <- control + diff
tumor <- tumor + diff
```

I should stress that you are lucky in the sense that your design can still be parametrized as a difference between groups. When you're dealing with GLMs, there may not be a concept of a group at all. For example, I could reparametrize your design:

```
model.matrix(~batch + factor_names)
```

... and now there are no "groups". The last coefficient is the log-fold change and is estimated directly during the model fit, without going through an intermediate step of estimating an average abundance per group.