I would like to generate a batch-corrected expression matrix from RNA-seq data which is NOT on a log2 scale.
After looking at similar posts, it is commonly advised to use limma's removeBatchEffect on the output generated by the rlog function from DESeq2:
dds <- makeExampleDESeqDataSet()
dds$batch <- factor(rep(c("A","B"), each = 3))
rld <- rlog(dds)
mod <- model.matrix(~ condition, colData(rld))
assay(rld) <- limma::removeBatchEffect(assay(rld), batch = rld$batch, design = mod)
Would it be correct to simply inverse the log2 transformation to get the desired matrix?
normcounts <- counts(dds, normalized = TRUE) # normalized counts with batch
normcounts.batch <- 2 ^ assay(rld) # normalized counts without batch
Doing it this way, the batch-corrected counts seem to be on a different scale than the normalized counts matrix. Therefore, would it be better to run the batch effect removal on the normalized counts matrix:
mod <- model.matrix(~ condition, colData(rld))
normcounts.batch <- limma::removeBatchEffect(normcounts, batch = rld$batch, design = mod)
I'm unsure as to whether the removeBatchEffect function requires the input values to be on a log2 scale?
Should data be on log-scale if using
removeBatchEffect
=> yes, according to e.g. this exchange on the mailing list and?limma::removeBatchEffect()
(the text explaining what the input datax
should be) for reasons explained in that first link. Both the authors of limma/edgeR (Batch-corrected expression matrix NOT on a log2 scale) and DESeq2 (see its vignette) seem to recommend that correcting batch effects for DE analysis should be part of the model so unless one have comparable experts knowledge I'd follow that advise. For everything else however, e.g. plots and other downstream applications, you need something else that directly corrects the counts. I personally made good experience with ComBat-Seq which lives in thesva
package of Biocoductor. The advantages over other methods that have originally be developped for array-like data is that a) it has been developped specifically for RNA-seq, b) it takes raw counts and returns corrected raw counts so you can plug it into any downstream normalization you want/need, c) it preserves zeros as zeros and does not introduce negative values and d) it returns values not on log scale so it is completely on you how to transform data and you do not have to bother with conversion from log scale back to normal.