Limma: The log2FC of topTable is not equal to Mean(log2CPM) of Basal - Mean(log2CPM) of LP
1
0
Entering edit mode
zhouqz • 0
@zhouqz-17283
Last seen 3.7 years ago

I am confused about the log2FC when I use the limma to analyze the RNA-seq data. In the paper (Law et al. 2016; https://f1000research.com/articles/5-1408 ),  Log2FC of the gene (ENTREZID:12759) is -5.44 in basal.vs.lp, getting from topTable (Page 15).  However, the mean expression getting from voom is 5.58 and 10.89 in group basal and lp, respectively. 5.58 - 10.89 = -5.31, is not equal to -5.44. How to get the final expression value match to the log2FC of topTable ? 

limma • 2.0k views
ADD COMMENT
2
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 8 hours ago
The city by the bay

Weights.

set.seed(1000)
y <- matrix(rnbinom(10000, mu=1:1000, size=1), ncol=10)
design <- model.matrix(~gl(2, 5))
v <- voom(y, design)
fit <- lmFit(v, design)
fit <- eBayes(fit)
topTable(fit, coef=2, sort.by="none")

# Comparing to manual calculation:
head(rowMeans(v$E[,6:10]) - rowMeans(v$E[,1:5]))
weighted <- v$E * v$weights
head(rowSums(weighted[,6:10])/rowSums(v$weights[,6:10])
    - rowSums(weighted[,1:5])/rowSums(v$weights[,1:5]))

I can only assume you're asking out of curiosity. Don't do the above in actual analyses, lmFit and topTable have been very thoroughly tested and will be much less buggy than any manual calculation.

ADD COMMENT
0
Entering edit mode

Note also that the analysis in Law et al (2016), from which OP has taken the logFC, included a batch correction term, another reason why naive subtraction of means will not match the actual computed logFC.

With a batch effect, there is no way to compute logFCs other than to fit the weighted linear model.

ADD REPLY
0
Entering edit mode

Dear Gordon,

Thank you very much for your reply. I have removed the batch effect, but the subtraction of means still does not matches the actual logFC. I have displayed the test code. Is it ok? 

 

ADD REPLY
0
Entering edit mode

Dear Aaron,

Thank you very much for your answer. I saw the code of lm.wfit. I can get the coefficient base on the v$E and v$weight. But I do not know how to get the final expression for each sample matching to logFC of topTable.  In Law et al(2016) paper, the v$E was considered as the final expression to do the next analysis such as heatmap (Page 16). It was confused for me. Base on the Manual, the v$E should remove the batch effect using the function removeBatchEffect. Moreover, I follow your code to do the manual calculation of logFC using the data in Law et al(2016) paper. The manual logFC is also not equal to the actual logFC of topTable.  I also used the removeBatchEffect to remove the batch effect, but after removing the batch effects, the subtraction of means still not matches the actual computed logFC. Below is the test code using the same data in Law's paper. Thanks again.


files <- c("GSM1545535_10_6_5_11.txt", "GSM1545536_9_6_5_11.txt",
           "GSM1545538_purep53.txt","GSM1545539_JMS8-2.txt",
           "GSM1545540_JMS8-3.txt","GSM1545541_JMS8-4.txt",
           "GSM1545542_JMS8-5.txt","GSM1545544_JMS9-P7c.txt",
           "GSM1545545_JMS9-P8c.txt")
read.delim(files[1], nrow=5)

x <- readDGE(files, columns=c(1,3))
class(x)

samplenames <- substring(colnames(x), 12, nchar(colnames(x)))
samplenames

colnames(x) <- samplenames
group <- as.factor(c("LP", "ML", "Basal", "Basal", "ML", "LP", "Basal", "ML", "LP"))
x$samples$group <- group
lane <- as.factor(rep(c("L004","L006","L008"), c(3,4,2)))
x$samples$lane <- lane
x$samples

library(Mus.musculus)
geneid <- rownames(x)
genes <- select(Mus.musculus, keys=geneid, columns=c("SYMBOL", "TXCHROM"),
                keytype="ENTREZID")
dim(genes)
head(genes)
genes <- genes[!duplicated(genes$ENTREZID),]

x$genes <- genes
cpm <- cpm(x)
lcpm <- cpm(x, log=TRUE)
table(rowSums(x$counts==0)==9)

keep.exprs <- rowSums(cpm>1)>=3
x <- x[keep.exprs,, keep.lib.sizes=FALSE]
dim(x)
x <- calcNormFactors(x, method = "TMM")

design <- model.matrix(~0+group+lane)
colnames(design) <- gsub("group", "", colnames(design))

contr.matrix <- makeContrasts(
  BasalvsLP = Basal-LP,
  BasalvsML = Basal - ML,
  LPvsML = LP - ML,
  levels = colnames(design))
contr.matrix
v <- voom(x, design, plot=TRUE)
vfit <- lmFit(v, design)
vfit <- contrasts.fit(vfit, contrasts=contr.matrix)
efit <- eBayes(vfit)

basal.vs.lp <- topTable(efit, coef=1, sort.by="none")

head(basal.vs.lp)

            ENTREZID  SYMBOL TXCHROM       logFC    AveExpr           t      P.Value    adj.P.Val          B
497097   497097    Xkr4    chr1  7.45743870 -1.7775010  13.9455600 7.111517e-07 1.332370e-05  6.0632317
27395     27395  Mrpl15    chr1  0.48326620  4.2082183   2.3879867 4.414848e-02 7.017880e-02 -5.2737955
18777     18777  Lypla1    chr1  0.29313803  5.2818178   1.9143098 9.210726e-02 1.321884e-01 -6.0961967

#basal.vs.lp
weighted=v$E*v$weights
head(rowSums(weighted[,c(1,6,9)])/rowSums(v$weights[,c(1,6,9)])
     - rowSums(weighted[,c(3,4,7)])/rowSums(v$weights[,c(3,4,7)]))

#Output

ENTREZID    497097      27395      18777 
logFC       -7.5681082 -0.4334995 -0.2368517 

#reomve the Batch Effect
rv <- voom(x, design, plot=FALSE)

design.BatchEffect=model.matrix(~0+group)

rBE.exp =removeBatchEffect(rv$E,covariates =  = x$samples$lane,design=design.BatchEffect)
head(rowMeans(rBE.exp[,c(1,6,9)])-rowMeans(rBE.exp[,c(3,4,7)]))
#output
ENTREZID      497097      27395       18777
logFC       -7.5033660 -0.4798565 -0.3034289

ADD REPLY
0
Entering edit mode

Again, weights. Your naive application of removeBatchEffect doesn't account for the observation-level weights. These weights are considered by lmFit and affect the estimation of all coefficients (including the batch effect).

ADD REPLY
0
Entering edit mode

I manually remove the batch effect with weights base on the function of removeBatchEffect. However, the result still does not match to the autal logFC of topTable. 

rv <- voom(x, design, plot=FALSE)
design.BatchEffect=model.matrix(~0+group)
batch <- as.factor(x$samples$lane)
contrasts(batch) <- contr.sum(levels(batch))
batch <- model.matrix(~batch)[, -1, drop = FALSE]
batch2 = NULL
covariates = NULL
X.batch <- cbind(batch, batch2, covariates)
rm.fit <- lmFit(rv$E, cbind(design.BatchEffect, X.batch), weights = v$weights)
beta <- rm.fit$coefficients[, -(1:ncol(design.BatchEffect)), drop = FALSE]
beta[is.na(beta)] <- 0
rm.w.adj=as.matrix(v$E) - beta %*% t(X.batch)
head(rowMeans(rm.w.adj[,c(3,4,7)])-rowMeans(rm.w.adj[,c(1,6,9)]))

#output

ENTREZID   497097      27395      18777      21399      58175     108664 
logFC       7.4586913  0.4733701  0.3017084  0.3788963 -5.4797058 -0.8135124

logFC of topTable:

ENTREZID   497097      27395      18777      21399      58175     108664 

logFC    7.4574387  0.4832662  0.2931380  0.3711194 -5.4879042 -0.8143231

 

ADD REPLY
0
Entering edit mode

Why do you insist on using rowMeans? As Gordon has said, naive subtraction of means will not give you the correct result. This is because part of the residuals are driven by the batch effects, which will not cancel out when the number of samples in each batch are not the same between groups.

ADD REPLY
0
Entering edit mode

I just want to know why base on the definition of logFC. You and Gordon are right. the actual logFC is base on the coefficient. Thanks again.

ADD REPLY

Login before adding your answer.

Traffic: 1076 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6