Question: Average read counts and batch-effect correction in edgeR
0
4 months ago by
Roman 10
Roman 10 wrote:

Hi everyone, I am using edgeR to identify differentially expressed transcripts based on a simple RNA-seq setup (tumor vs control samples).

The DGELRT object with results provided by edgeR contains, among others, logFC values calculated for the contrast that I used (Tumor vs Control). I additionally need the average read counts for both sample groups, used to calculate the logFC values. For a model matrix based on a single factor I can do that by averaging pseudo.counts from the DGEList. However if I use a model matrix with an additional factor for batch-effect correction this method no longer provides average reads used to calculate logFC values in the DGELRT object. Could you please guide me to an appropriate way on how to either get the batch adjusted reads of just the average read counts used to get the logFC values?

Below is a simplified version of my code:

#create DGEList object
CountsTableEdge <- DGEList(counts=CountsTable, genes=rownames(CountsTable), group=factor_names)

#normalize and calculate dispersions
CountsTableEdge <- calcNormFactors(CountsTableEdge, method="TMM")
CountsTableEdge <- estimateCommonDisp(CountsTableEdge)
CountsTableEdge <- estimateTagwiseDisp(CountsTableEdge)

#create design matrix
design <- model.matrix(~0+factor_names+batch)
colnames(design) = gsub('factor_names','',colnames(design))

#fit the model
fit <- glmFit(CountsTableEdge,design)

#make the contract matrix and test DE
cont.matrix <- makeContrasts(contrasts="Tumor - Control", levels=design)
tRes = glmLRT(fit,contrast=cont.matrix)

#calulate the average values in each group
TumorMean = rowMeans(CountsTableEdge$pseudo.counts[,TumorCols]) ControlMean = rowMeans(CountsTableEdge$pseudo.counts[,ControlCols])
logFC=log2(TumorMean/ControlMean)

### logFC matches tRes$table$logFC only for a simplified model matrix without the batch factor

modified 4 months ago by Aaron Lun23k • written 4 months ago by Roman 10
2
4 months ago by
Aaron Lun23k
Cambridge, United Kingdom
Aaron Lun23k wrote:

Firstly, I should point out that using pseudo-counts (i.e., "classic" edgeR) is not the recommended way to do things anymore. Use estimateDisp instead of the old dispersion estimation functions, and use glmQLFit and glmQLFTest instead of glmFit and glmLRT.

Now, let's correct some of your misconceptions. The log-fold change of the row means is not guaranteed to match the reported logFC from edgeR's GLM-based functions:

set.seed(0)
y <- matrix(rnbinom(10000, mu=10, size=10), ncol=10)
groups <- gl(2, 5)

design <- model.matrix(~groups)
fit <- glmFit(y, design, dispersion=0.1) # glmFit(), as dispersion is known.
res <- glmLRT(fit)

adj <- cpm(y) # accounting for slight differences in library size.
summary(res$table$logFC-custom) # non-zero.


This is partly because the estimate of the "average expression" in a negative binomial model is only proportional to the mean count when the library sizes are equal across samples. This is never true in real data sets. Computing library size-normalized values with cpm doesn't help - this is a phenomenon related to the mean-variance relationship of count data. So using rowMeans is already a step in the wrong direction. In addition, glmFit() automatically adds a prior.count=0.125 to avoid undefined log-fold changes when one or the other group has zero counts, which further shifts the reported value from your simple estimate.

If we wanted to get average expression values that match up to the log-fold changes, I would prefer:

design2 <- model.matrix(~0 + groups)
fit2 <- glmFit(y, design2, dispersion=0.1)
res2 <- glmLRT(fit2, contrast=c(-1, 1))

g1 <- fit2$coefficients[,1]/log(2) # coefficients are natural log. g2 <- fit2$coefficients[,2]/log(2)
custom <- g2 - g1
summary(res2$table$logFC - custom) # basically all-zero


Here, g1 and g2 represent the log-average expression in each group, adjusted by the library size. You might add the log-transformed mean library size to each of these values to get them back to the same magnitude as the raw counts, which should make things a bit easier to interpret in whatever downstream application you have planned.

Okay, so how does this relate to your example? The way you've set up your design means that the "Tumor" and "Control" coefficients already represent the equivalents to g1 and g2 in my example above. This is because your log-fold change is being computed by cont.matrix as "Tumor - Control", so you can just do the same thing as what I've done with g1 and g2.

batch <- factor(rep(1:5, 2))
design3 <- model.matrix(~0 + groups + batch)
fit3 <- glmFit(y, design3, dispersion=0.1)
res3 <- glmLRT(fit3, contrast=c(-1, 1, 0, 0, 0, 0))

control <- fit3$coefficients[,1]/log(2) tumor <- fit3$coefficients[,2]/log(2)
custom <- tumor - control
summary(res3$table$logFC - custom) # basically all-zero


The slight caveat is that these values are not really "averages", they represent the fitted values for each condition in the first level of batch. This is probably fine but if you really need absolute values that are more "average-y", you could do some ad hoc adjustments:

actual.ave <- aveLogCPM(y)
dummy.ave <- (control + tumor)/2
diff <- actual.ave - dummy.ave
control <- control + diff
tumor <- tumor + diff


I should stress that you are lucky in the sense that your design can still be parametrized as a difference between groups. When you're dealing with GLMs, there may not be a concept of a group at all. For example, I could reparametrize your design:

model.matrix(~batch + factor_names)


... and now there are no "groups". The last coefficient is the log-fold change and is estimated directly during the model fit, without going through an intermediate step of estimating an average abundance per group.

Thank you very much! this is exactly what I needed.

The reason why I stick to the "classic" edgeR is because it provides the pseudo-counts that I needed to create visualizations for individual features (requested by my collaborator). I am aware that this is not the best way of doing the analysis and that pseudo-counts are intended mainly for internal use by the edgeR, as stated in the manual, but what other option there is to visualize individual samples? The average counts were actually requested by one of the reviewers of our manuscript. It seems that despite some things have changed people are still used to the classic statistics from the microarray age.

EDIT: for the second part I made a small modification, instead of:

actual.ave <- aveLogCPM(CountsTableEdge)


I used:

actual.ave <- tRes$table$logCPM


since it appears they provide something different.

1

For any visualization that you are using pseudo-counts for, you can replace it directly with the output of cpm, which is a more common measure of expression anyway. As a reviewer, I wouldn't even bat an eye upon seeing CPMs; pseudo-counts, on the other hand, make me wonder... And once a reviewer has to think about something, you're in trouble.