edgeR and batch effect
5
1
Entering edit mode
@abrantespatricia-13177
Last seen 7.3 years ago

Hi all,
I am using EdgeR to perform a DE analysis on 3 different treatments with 3 replicates each but I have a strong batch effect on the day of experiments.
I've been reading the manual and the example exercises (although none is exactly like my case...) and I performed the following analyses:

library(limma)
library(edgeR)
library(ggplot2)

new<-read.delim("Trinity_trans.counts.matrix")
group<-factor(c("A","A","A","B","B","B","C","C","C"))
batch<-factor(c(4,2,3,1,2,3,1,2,3))
y<-DGEList(counts=new[,2:10], genes=new[,1:1],group=group)
y
design<-model.matrix(~0+group+batch, data=y$samples)
design
                        groupA groupB groupC batch2 batch3 batch4
A10                        1          0          0            0        0        1
Exp2-4                   1          0          0            1        0        0
A18                        1          0          0            0        1        0
Exp1_2                  0          1          0            0        0        0
Exp2_5                  0          1          0            1        0        0
A17                        0          1          0            0        1        0
Exp1_3                  0          0          1            0         0       0
Exp2_6                  0          0          1            1         0       0
A13                        0          0          1            0         1       0

keep<-rowSums(cpm(y)>1)>=3
table(keep)
y<-y[keep,,keep.lib.sizes=FALSE]

z<-calcNormFactors(y)
z$samples
plotMDS(z)

z<-estimateDisp(z,design)
z$common.dispersion
plotBCV(z)

fit2<-glmQLFit(z,design,robust=TRUE)
plotQLDisp(fit2)
qlf<-glmQLFTest(fit2, coef=1:3)
topTags(qlf)

qlf1 <- glmQLFTest(fit2, contrast=c(-1,1,0,0,0,0))
topTags(qlf1)

qlf2 <- glmQLFTest(fit2, contrast=c(-1,0,1,0,0,0))
topTags(qlf2)

qlf3 <- glmQLFTest(fit2, contrast=c(-0,-1,1,0,0,0))
topTags(qlf3)

My problem is that I'm not completely sure that I'm removing the batch effect (since I'm not sure that the model considers the information on the 3 batch columns) and also because I don't have any differential expressed gene when performing the different pairwise treatment comparisons.
- Could you please see if this is the correct script for removing the batch effect (4 days of experiments)?
- Is there any step that I could be doing wrong that is causing zero DE genes?
Any help would be greatly appreciated.

Thanks in advance,

Patrícia Abrantes, PhD

GHTM, Instituto de Higiene e Medicina Tropical,

Universidade Nova de Lisboa, Portugal

edger • 8.9k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 16 minutes ago
United States

You can only fit a batch effect if you have complete replication of all sample types in each batch. So your design won't work because the first sample (which is the only one in batch 4) will be ignored. You could use limma voom and account for the batch effect using duplicateCorrelation, however.
 

ADD COMMENT
1
Entering edit mode

A minor correction: the design in the original post will still work (in the sense that it's of full rank with non-zero degrees of freedom, and you can fit the model and get estimates for the coefficients and dispersion). It just means that the first sample won't be used for anything, resulting in some loss of power for the comparisons involving the first group.

As for the lack of DE genes, I would suggest looking at a MDS plot and seeing whether there are samples from different groups mixing together. If so... well, there might not be any differences between groups. You'll also lose power from high dispersion estimates if the expression is highly variable; while this depends a lot on the context, my rule of thumb is that anything higher than 0.1 is concerning for RNA-seq experiments performed in laboratory (i.e., well-controlled) conditions.

ADD REPLY
0
Entering edit mode

Good point - it will 'work', it just won't be doing what the OP thought it was doing.

ADD REPLY
0
Entering edit mode
@abrantespatricia-13177
Last seen 7.3 years ago

Hi,

thanks again for you comments.

Best regards,

Patrícia

ADD COMMENT
0
Entering edit mode
@abrantespatricia-13177
Last seen 7.3 years ago

Hi again,

if I want to analyse treatment but want to adjust for batch effect what is the correct way of using model.matrix?

design<-model.matrix(~0+group+batch, data=y$samples)
or

design<-model.matrix(~0+batch+group, data=y$samples)?

Sorry about so simple question but I get completely different results depending on the order of batch+group or group+batch and from the examples in the manual it is not clear for me.

I have already seen examples with the 2 options...

Thanks in advance,

Patrícia Abrantes

 


 

ADD COMMENT
0
Entering edit mode

The two models are mathematically identical, differing only in the order of coefficients. Obviously, this means that you'll have to change the coef that you drop in glmQLFTest, otherwise you'll be testing for significant differences between batches rather than for a significant difference between groups. For example:

batch <- factor(c(1,2,1,2))
group <- factor(LETTERS[c(1,1,2,2)])
model.matrix(~0 + batch + group)  # last coefficient = difference between groups
model.matrix(~0 + group + batch)  # last coefficient = difference between batches

P.S. Don't post a reply as an answer unless you're answering your own question; use comments instead.

ADD REPLY
0
Entering edit mode
@abrantespatricia-13177
Last seen 7.3 years ago

Hi Aaron,

thanks for your quick reply.

Patrícia

 

ADD COMMENT
0
Entering edit mode
sutturka • 0
@sutturka-14580
Last seen 19 months ago
United States

RNASeq experiment:

Control (3 replicates) and 3 Treatments (3 replicates each).

sequencing for 2 replicates of Control sample was performed 2 years later which adds very high sequencing depth and data is generated on different instrument. 

Sample_Name Treatment Sample
NORM1 NORM Batch1
NORM2 NORM Batch2
NORM3 NORM Batch2
TRT1_1 TRT1 Batch1
TRT1_2 TRT1 Batch1
TRT1_3 TRT1 Batch1
TRT2_1 TRT2 Batch1
TRT2_2 TRT2 Batch1
TRT2_3 TRT2 Batch1
TRT3_1 TRT3 Batch1
TRT3_2 TRT3 Batch1
TRT3_3 TRT3 Batch1

Can you please help to make sure below code is correct?

#Prepare the design Matrix
TM = factor(samples$treatment)
TM = relevel(TM, ref="NORM")
Batch = factor(samples$batch)

ALL = DGEList(counts=as.matrix(countData), group= TM)
ALL <- calcNormFactors(ALL)

# Design matrix with considering batch effect
design <- model.matrix(~0+Batch+TM)
rownames(design) = colnames(ALL)

#calculate dispersion

ALL <- estimateDisp(ALL, design, robust=TRUE)
fit <- glmQLFit(ALL, design, robust=TRUE)

Design is as below:

  Batchbatch1 Batchbatch2 TRT1 TRT2 TRT3
NORM1      1 0 0 0 0
NORM2      0 1 0 0 0
NORM3      0 1 0 0 0
TRT1      1 0 1 0 0
TRT1      1 0 1 0 0
TRT1      1 0 1 0 0
TRT2      1 0 0 1 0
TRT2        1 0 0 1 0
TRT2      1 0 0 1 0
TRT3      1 0 0 0 1
TRT3      1 0 0 0 1
TRT3         1 0 0 0 1
         

#calculate Differential Expression

#TRT1

qlf_TRT1 = glmQLFTest(fit, coef = 3)

#TRT2

qlf_TRT2 = glmQLFTest(fit, coef = 4)

#TRT3

qlf_CARC = glmQLFTest(fit, coef = 6)

I want to make sure that above statements compare the TRT1, TRT2, and TRT3 against the normal because I used the relevel parameter in variable TM and also account for the batch effect as it is included in the design matrix and subsequent steps.

 

ADD COMMENT
0
Entering edit mode

Your design matrix is fine (though qlf_CARFC should use coef=5). I also hope you have performed filtering on abundance at some point. Note that NORM2 and NORM3 will not participate in the comparisons at all, as their fitted values are fully determined by the Batch2 coefficient; these samples are only used in dispersion estimation.

For future reference, new questions should be posted separately, rather than resurrecting old posts.

ADD REPLY

Login before adding your answer.

Traffic: 922 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6