I am confused about the log2FC when I use the limma to analyze the RNA-seq data. In the paper (Law et al. 2016; https://f1000research.com/articles/5-1408 ), Log2FC of the gene (ENTREZID:12759) is -5.44 in basal.vs.lp, getting from topTable (Page 15). However, the mean expression getting from voom is 5.58 and 10.89 in group basal and lp, respectively. 5.58 - 10.89 = -5.31, is not equal to -5.44. How to get the final expression value match to the log2FC of topTable ?
Note also that the analysis in Law et al (2016), from which OP has taken the logFC, included a batch correction term, another reason why naive subtraction of means will not match the actual computed logFC.
With a batch effect, there is no way to compute logFCs other than to fit the weighted linear model.
Dear Gordon,
Thank you very much for your reply. I have removed the batch effect, but the subtraction of means still does not matches the actual logFC. I have displayed the test code. Is it ok?
Dear Aaron,
Thank you very much for your answer. I saw the code of lm.wfit. I can get the coefficient base on the v$E and v$weight. But I do not know how to get the final expression for each sample matching to logFC of topTable. In Law et al(2016) paper, the v$E was considered as the final expression to do the next analysis such as heatmap (Page 16). It was confused for me. Base on the Manual, the v$E should remove the batch effect using the function removeBatchEffect. Moreover, I follow your code to do the manual calculation of logFC using the data in Law et al(2016) paper. The manual logFC is also not equal to the actual logFC of topTable. I also used the removeBatchEffect to remove the batch effect, but after removing the batch effects, the subtraction of means still not matches the actual computed logFC. Below is the test code using the same data in Law's paper. Thanks again.
files <- c("GSM1545535_10_6_5_11.txt", "GSM1545536_9_6_5_11.txt",
"GSM1545538_purep53.txt","GSM1545539_JMS8-2.txt",
"GSM1545540_JMS8-3.txt","GSM1545541_JMS8-4.txt",
"GSM1545542_JMS8-5.txt","GSM1545544_JMS9-P7c.txt",
"GSM1545545_JMS9-P8c.txt")
read.delim(files[1], nrow=5)
x <- readDGE(files, columns=c(1,3))
class(x)
samplenames <- substring(colnames(x), 12, nchar(colnames(x)))
samplenames
colnames(x) <- samplenames
group <- as.factor(c("LP", "ML", "Basal", "Basal", "ML", "LP", "Basal", "ML", "LP"))
x$samples$group <- group
lane <- as.factor(rep(c("L004","L006","L008"), c(3,4,2)))
x$samples$lane <- lane
x$samples
library(Mus.musculus)
geneid <- rownames(x)
genes <- select(Mus.musculus, keys=geneid, columns=c("SYMBOL", "TXCHROM"),
keytype="ENTREZID")
dim(genes)
head(genes)
genes <- genes[!duplicated(genes$ENTREZID),]
x$genes <- genes
cpm <- cpm(x)
lcpm <- cpm(x, log=TRUE)
table(rowSums(x$counts==0)==9)
keep.exprs <- rowSums(cpm>1)>=3
x <- x[keep.exprs,, keep.lib.sizes=FALSE]
dim(x)
x <- calcNormFactors(x, method = "TMM")
design <- model.matrix(~0+group+lane)
colnames(design) <- gsub("group", "", colnames(design))
contr.matrix <- makeContrasts(
BasalvsLP = Basal-LP,
BasalvsML = Basal - ML,
LPvsML = LP - ML,
levels = colnames(design))
contr.matrix
v <- voom(x, design, plot=TRUE)
vfit <- lmFit(v, design)
vfit <- contrasts.fit(vfit, contrasts=contr.matrix)
efit <- eBayes(vfit)
basal.vs.lp <- topTable(efit, coef=1, sort.by="none")
head(basal.vs.lp)
ENTREZID SYMBOL TXCHROM logFC AveExpr t P.Value adj.P.Val B
497097 497097 Xkr4 chr1 7.45743870 -1.7775010 13.9455600 7.111517e-07 1.332370e-05 6.0632317
27395 27395 Mrpl15 chr1 0.48326620 4.2082183 2.3879867 4.414848e-02 7.017880e-02 -5.2737955
18777 18777 Lypla1 chr1 0.29313803 5.2818178 1.9143098 9.210726e-02 1.321884e-01 -6.0961967
#basal.vs.lp
weighted=v$E*v$weights
head(rowSums(weighted[,c(1,6,9)])/rowSums(v$weights[,c(1,6,9)])
- rowSums(weighted[,c(3,4,7)])/rowSums(v$weights[,c(3,4,7)]))
#Output
ENTREZID 497097 27395 18777
logFC -7.5681082 -0.4334995 -0.2368517
#reomve the Batch Effect
rv <- voom(x, design, plot=FALSE)
design.BatchEffect=model.matrix(~0+group)
rBE.exp =removeBatchEffect(rv$E,covariates = = x$samples$lane,design=design.BatchEffect)
head(rowMeans(rBE.exp[,c(1,6,9)])-rowMeans(rBE.exp[,c(3,4,7)]))
#output
ENTREZID 497097 27395 18777
logFC -7.5033660 -0.4798565 -0.3034289
Again, weights. Your naive application of
removeBatchEffect
doesn't account for the observation-level weights. These weights are considered bylmFit
and affect the estimation of all coefficients (including the batch effect).I manually remove the batch effect with weights base on the function of removeBatchEffect. However, the result still does not match to the autal logFC of topTable.
rv <- voom(x, design, plot=FALSE)
design.BatchEffect=model.matrix(~0+group)
batch <- as.factor(x$samples$lane)
contrasts(batch) <- contr.sum(levels(batch))
batch <- model.matrix(~batch)[, -1, drop = FALSE]
batch2 = NULL
covariates = NULL
X.batch <- cbind(batch, batch2, covariates)
rm.fit <- lmFit(rv$E, cbind(design.BatchEffect, X.batch), weights = v$weights)
beta <- rm.fit$coefficients[, -(1:ncol(design.BatchEffect)), drop = FALSE]
beta[is.na(beta)] <- 0
rm.w.adj=as.matrix(v$E) - beta %*% t(X.batch)
head(rowMeans(rm.w.adj[,c(3,4,7)])-rowMeans(rm.w.adj[,c(1,6,9)]))
#output
ENTREZID 497097 27395 18777 21399 58175 108664
logFC 7.4586913 0.4733701 0.3017084 0.3788963 -5.4797058 -0.8135124
logFC of topTable:
ENTREZID 497097 27395 18777 21399 58175 108664
logFC 7.4574387 0.4832662 0.2931380 0.3711194 -5.4879042 -0.8143231
Why do you insist on using
rowMeans
? As Gordon has said, naive subtraction of means will not give you the correct result. This is because part of the residuals are driven by the batch effects, which will not cancel out when the number of samples in each batch are not the same between groups.I just want to know why base on the definition of logFC. You and Gordon are right. the actual logFC is base on the coefficient. Thanks again.