Question

Is Limma's removeBatchEffect() and log2() commutative?

3

Entering edit mode

Ekarl2 ▴ 80

@ekarl2-7202

Last seen 8.1 years ago

Sweden

I am working on an RNA-seq dataset and have been using the EdgeR differential expression analysis methods with GLMs. Here, I have specified batch as a factor in the design matrix. The batch factor is uncorrelated with the other experimental factors and each batch is evenly represented among all environments.

But I have been asked to get batch-adjusted RPKM expression values for a select number of genes to make a bar chart (I of course want to use another chart type since bar charts can be deceptive) because if we just show the rpkm-normalized values that are not batch-adjusted, the results might be compromised by batch.

My general approach is as follows. I start with the entire dataset of about 100k genes. I make the design matrix from the experimental factors in accordance with the experimental design, and then define a batch factor. I then put the counts into a DGEList, normalize, log2 transform with rpkm() using effective gene lengths and then use the removeBatchEffect() function.

factorA <- factor(substring(colnames(data), 1, 6))
factorB <- factor(substring(colnames(data), 7, 11))
batch <- factor(substring(colnames(data), 13, 13))

# Making the design matrix. Should only contain factors to be saved.
design <- model.matrix(~factorA*factorB)

# Filtering
keep <- rowSums(cpm(data)>0.1) >= 8
data <- data[keep, ]

# Put counts into DGEList container needed for latter steps.
y <- DGEList(counts=data)

# TMM-normalization.
y <- calcNormFactors(y)

# Convert to RPKM and log2 transformation.
data <- rpkm(y, gene.length=length$effective_length, log=TRUE, prior.count=0.01)

# Remove batch effect.
data <- removeBatchEffect(data, batch=batch, design=design)

The reason for such a shallow filtering and prior count is because those genes that are of interest are very, very lowly expressed, so they might otherwise get removed in the filtering step or given a prior count that, proportionally speaking, would increase their normalized expression value substantially.

So for my questions:

- Can I remove the 2 logarithm and get back non-logarithm batch-adjusted normalized values? In other words, is Limma's removeBatchEffect() and log2() commutative? I am guessing no since log2() is rarely commutative with other operations.

- When running rpkm() as above, is log2() applied before or after rpkm normalization?

- Anything that looks questionable in the above approach I have taken?

limma rna-seq removebatcheffect() • 3.9k views

ADD COMMENT • link updated 6.5 years ago by alakatos ▴ 130 • written 8.3 years ago by Ekarl2 ▴ 80

score 10 · Answer 1 · 2016-04-21

10

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 8 hours ago

The city by the bay

There's a way around this that doesn't involve log-transformation. Let's mock up some data first:

grouping <- factor(rep(1:2, each=10))
batch <- factor(rep(1:2, 10))
counts <- matrix(rnbinom(2000, mu=100, size=20), ncol=length(grouping))

Here I'm using a fairly simple set-up, but you should be able to generalize this to your design. Now, let's do:

y <- DGEList(counts)
y <- calcNormFactors(y)

The next step involves fitting a GLM to the counts, using the full design matrix. I'll assume that the dispersion is 0.05, though you can use the estimated dispersion for each gene if you run estimateDisp beforehand.

design.all <- model.matrix(~grouping + batch)
fit <- glmFit(y, design.all, dispersion=0.05)
old.fitted <- fit$fitted.values

We then identify the batch coefficients and recompute the fitted values as if the batch effect were zero. That is:

batch.coefs <- 3 # change according to your design matrix
new.coefs <- fit$unshrunk.coefficients
new.coefs[,batch.coefs] <- 0
new.fitted <- exp(new.coefs %*% t(design.all) + as.matrix(fit$offset))

Finally, we perform quantile-quantile mapping from the old means to the new means:

pseudo.counts <- q2qnbinom(y$counts, old.fitted, new.fitted, dispersion=0.05)

This gives you batch-corrected pseudo-counts that you can use for downstream visualization, e.g., in rpkm with log=FALSE. This is probably better than log-transforming and removing the batch effect with removeBatchEffect, which would be too sensitive to fluctuations at low and zero counts when your prior count is so low.

ADD COMMENT • link 8.3 years ago • updated 6.5 years ago Aaron Lun ★ 28k

1

Entering edit mode

This will work even better if you set

contrasts(batch) <- contr.sum(levels(batch))

before forming the design matrix. That will ensure that setting the batch effects to zero does not change the size of the counts on average.

ADD REPLY • link 6.5 years ago • updated 5.1 years ago Gordon Smyth 51k

0

Entering edit mode

Dear Gordon,

By any chance, should contr.sum set to be 2 instead of 10 as it should be the length of batch levels according to the example given by Aaron?

PS: I know this is an old post, but I hope you will see this.

ADD REPLY • link 5.1 years ago altintas.ali • 0

0

Entering edit mode

Yes, you're right. I've edited my comment to fix it. The use of levels(batch) will ensure the correct number of levels is used, whatever that is.

ADD REPLY • link 5.1 years ago Gordon Smyth 51k

0

Entering edit mode

Thanks for this neat technique!

ADD REPLY • link 8.3 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Aaron, I followed the above code (together with the recommendations from https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html#edgeR to incorporate the length offsets from tximport) and I get negative values for the pseudo.counts. Any recommendation on how to deal with this?

keep <- filterByExpr(y, design = Design)
y <- y[keep, , keep.lib.sizes=FALSE]  
disp <- estimateDisp(y, design = Design)
fit <- glmFit(y, Design, dispersion = disp$common.dispersion)
old.fitted <- fit$fitted.values
new.coefs <- fit$unshrunk.coefficients
new.coefs[,BatchCoeff] <- 0
new.fitted <- exp(new.coefs %*% t(Design) + as.matrix(fit$offset))
pseudo.counts <- q2qnbinom(y$counts, old.fitted, new.fitted, dispersion = disp$common.dispersion)

Negative values occur at greatest magnitude in samples where batch in the design matrix is 1 rather than 0.

>Design
    group1  group2  group3  group4  batch2
1   1   0   0   0   0
2   1   0   0   0   1
3   1   0   0   0   1
4   1   0   0   0   0
5   0   1   0   0   0
6   0   1   0   0   1
7   0   1   0   0   1
8   0   1   0   0   0
9   0   0   1   0   0
10  0   0   1   0   1
11  0   0   1   0   1
12  0   0   1   0   0
13  0   0   0   1   0
14  0   0   0   1   1
15  0   0   0   1   1
16  0   0   0   1   0
attr(,"assign")                 
[1] 1   1   1   1   2
attr(,"contrasts")                  
attr(,"contrasts")$Group                    
[1] contr.treatment             

attr(,"contrasts")$Batch                    
[1] contr.treatment             

>colMins(pseudo.counts)
 [1] -0.0000000000000035527137 -1.5251902206364846836806 -2.3256109649740057676581 -0.0000000000000008881784 -0.0000000000000035527137 -1.9248305668560199421790 -2.1904227934311872871831
 [8] -0.0000000000000284217094 -0.0000000000000035527137 -1.9332348577044555781868 -1.4595762667571228199392 -0.0000000000004547473509 -0.0000000000000035527137 -1.8116796948588387294876
[15] -1.8251452891036841208461 -0.0000000000000017763568

ADD REPLY • link 4.0 years ago ATpoint ★ 4.2k

0

Entering edit mode

Probably just floor it at zero. The procedure in q2qnbinom is approximate (check out the documentation as to why) so there's always a chance of this happening. Mind, this requires the counts to be pretty low and most of the time you'd have filtered out low-abundance genes already.

ADD REPLY • link 4.0 years ago Aaron Lun ★ 28k

score 0 · Answer 2 · 2018-01-24

0

Entering edit mode

alakatos ▴ 130

@alakatos-6983

Last seen 4.8 years ago

United States

Aaron,

I tried your example code at and recevied an error message.
(I changed desing to design.all)

pseudo.counts <- q2qnbinom(y$counts, old.fitted, new.fitted, dispersion=0.05)
Error in raw.mat[i, , drop = FALSE] : 
  (subscript) logical subscript too long

I am not sure what I was doin wrong. I just copied the code.

Thank you for you help.

Anita

ADD COMMENT • link 6.5 years ago alakatos ▴ 130

0

Entering edit mode

Please use the "Add comment" button rather than the "Add answer" button, as you're responding to an existing answer rather than adding your own answer. The problem is caused by a series of unfortunate coincidences after updates to the internal edgeR machinery. You can solve this by using as.matrix(new.fitted) instead.

ADD REPLY • link 6.5 years ago Aaron Lun ★ 28k