Limma model fitting: including and excluding replicates in the model gives different results in RNAseq Analysis
1
0
Entering edit mode
anver • 0
@anver-8131
Last seen 8.8 years ago
Germany

Hi there,

I analyzed my single-end RNA-seq data (Illumina) with limma. I have two treatments (treatment and mock) and three time points (1h, 9h and 24h) each with 3 replicates (total 18 libraries). I tried two models - one including the replicates (Model1) and another excluding the replicates (Model 2). Rest of the analysis is similar for both designs. Unfortunately both give very different results. Model1 where I included the replicates in the model gives 2000 more differentially expressed genes than Model2. My understanding is the limma analysis itself takes variance in the replicates in to account and doesn't need to include replicates in the model as in Model1. Accordance with that edgeR analysis without any model fitting and model fitting as in Model3 gives similar results to Model3. Could someone explain why is this difference and please advice which is better. Please see the code below

Thank you

*****************************
#design of linear model
****************************

#####Model1

#Define the variables

treatment = factor(rep(c('01_mock', '02_trt'), times=9))
time = factor(rep(c('01_1h', '02_9h', '03_24h'), each=2, times=3))
replicate = factor(rep(c("r1", "r2", "r3"), each=6))

comb.names = paste(treatment, time, replicate, sep='_')

design_M1 = model.matrix( ~ C(treatment:time) -1 + replicate)
colnames(design_M1) = c("mock_1h", "trt_1h" , "mock_9h", "trt_9h", "mock_24h", "trt_24h", "rep2", "rep3")
rownames(design_M1) = comb.names

##############
####Model2

##Same variables and and comb.names used and only the design is different
#This time didn't include the replicates in the model

design_M2 = model.matrix( ~ C(time:treatment) -1)
colnames(design_M2) = c("mock_1h", "trt_1h" , "mock_9h", "trt_9h", "mock_24h", "trt_24h")
rownames(design_M2) = comb.names

##############Model3
# bit different
# designed with a sample info file which looked like this
#     Sample Treat  Time
# 1 mk_1h_r1  mock   1h
# 2 mk_1h_r2  mock   1h
# 3 mk_1h_r3  mock   1h
# 4 mk_9h_r1  mock   9h
# 5 mk_9h_r2  mock   9h
# 6 mk_9h_r3  mock   9h

Groups <- factor(paste(targets$Treat, targets$Time, sep="."))
cbind (targets, Group=Groups)

design <- model.matrix(~0+Groups)
colnames(design) <- levels(Groups)

##############################################
#Rest was performed the same way as follows for both models
#############################################################

##############
## voom transformation
#####################

lib.size = colSums(data)*nf$samples$norm.factors
v <- voom(dat.mat, design, lib.size=lib.size)

###############
## fitting the model
##############
fit = lmFit (v, design)

###################
#specify the contrasts
contrast.matrix <- makeContrasts (trt_1h - mock_1h, trt_9h - mock_9h, trt_24h - mock_24h, levels=design)
fit = contrasts.fit (fit, contrast.matrix)

fit.pval = 2*pt( abs(fit$coef / fit$stdev.unscaled / fit$sigma), fit$df.residual, lower.tail=F )


# empirical Bayes for variance shrinkage
eb.fit = eBayes(fit)

#########################
#qValues
###########
q.val = qvalue(eb.fit$p.val)$qvalues

q.val_trt_mock_1h = q.val [, 'trt_1h - mock_1h']
q.val_trt_mock_9h = q.val [, 'trt_9h - mock_9h']
q.val_trt_mock_24h = q.val [, 'trt_24h - mock_24h']

#Number of DEGs using Model1
sum(q.va_trt_mock_1h < 0.01)
[1] 6029
sum(q.val_trt_mock_9h < 0.01)
[1] 6008
sum(q.val_trt_mock_24h < 0.01)
[1] 3785

#Number of DEGs using Model 2 where the replicates were not included in the model
sum(q.val_trt_mock_1h < 0.01)
[1] 3882
sum(q.val_trt_mock_9h < 0.01)
[1] 3905
sum(q.val_trt_mock_24h < 0.01)
[1] 2385

#Number of DEGs using Model 3 variables defined with a sample info file==similar to Model2
sum(q.val_trt_mock_1h < 0.01)
[1] 3882
sum(q.val_trt_mock_9h < 0.01)
[1] 3905
sum(q.val_trt_mock_24h < 0.01)
[1] 2385

####################END of the file###########

 

limma RNAseq_Analysis model.matrix • 1.7k views
ADD COMMENT
1
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 6 hours ago
The city by the bay

Is there any reason to include the replicate number as an additional factor in the design? For example, are the first replicates of each time/treatment combination sequenced in the same batch, and the second replicates in another batch, and the third replicates in yet another batch? If so, then it makes sense to include the replicate number in replicate as a blocking factor to account for the batch effect. If not, then you should probably leave replicate out of the design, as there isn't really any experimental justification to put it in.

As to why it is different; if you have a batch effect, the first model will have coefficients that absorb it during fitting. This means that systematic differences between samples from different batches will not affect the variance estimates. However, in the second model, there are no terms to absorb the batch effect. This means that differences between batches are incorporated into the variance estimate, inflating the variances and reducing power to detect DE genes (which manifests as a reduced number of DE genes). Of course, this is assuming that you have a batch effect in your data.

ADD COMMENT
0
Entering edit mode

Thank you very much Aaron. It makes sense.

Sha

ADD REPLY

Login before adding your answer.

Traffic: 886 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6