Question

edgeR and batch effect

1

Entering edit mode

abrantes.patricia ▴ 10

@abrantespatricia-13177

Last seen 6.8 years ago

Hi all,
I am using EdgeR to perform a DE analysis on 3 different treatments with 3 replicates each but I have a strong batch effect on the day of experiments.
I've been reading the manual and the example exercises (although none is exactly like my case...) and I performed the following analyses:

library(limma)
library(edgeR)
library(ggplot2)

new<-read.delim("Trinity_trans.counts.matrix")
group<-factor(c("A","A","A","B","B","B","C","C","C"))
batch<-factor(c(4,2,3,1,2,3,1,2,3))
y<-DGEList(counts=new[,2:10], genes=new[,1:1],group=group)
y
design<-model.matrix(~0+group+batch, data=y$samples)
design
            groupA groupB groupC batch2 batch3 batch4
A10                        1          0          0            0        0        1
Exp2-4                   1          0          0            1        0        0
A18                        1        0          0            0        1        0
Exp1_2                  0          1          0            0        0        0
Exp2_5                  0          1          0            1        0        0
A17                        0          1          0            0        1        0
Exp1_3                  0          0          1            0         0       0
Exp2_6                  0          0          1            1         0       0
A13                        0          0          1            0         1       0

keep<-rowSums(cpm(y)>1)>=3
table(keep)
y<-y[keep,,keep.lib.sizes=FALSE]

z<-calcNormFactors(y)
z$samples
plotMDS(z)

z<-estimateDisp(z,design)
z$common.dispersion
plotBCV(z)

fit2<-glmQLFit(z,design,robust=TRUE)
plotQLDisp(fit2)
qlf<-glmQLFTest(fit2, coef=1:3)
topTags(qlf)

qlf1 <- glmQLFTest(fit2, contrast=c(-1,1,0,0,0,0))
topTags(qlf1)

qlf2 <- glmQLFTest(fit2, contrast=c(-1,0,1,0,0,0))
topTags(qlf2)

qlf3 <- glmQLFTest(fit2, contrast=c(-0,-1,1,0,0,0))
topTags(qlf3)

My problem is that I'm not completely sure that I'm removing the batch effect (since I'm not sure that the model considers the information on the 3 batch columns) and also because I don't have any differential expressed gene when performing the different pairwise treatment comparisons.
- Could you please see if this is the correct script for removing the batch effect (4 days of experiments)?
- Is there any step that I could be doing wrong that is causing zero DE genes?
Any help would be greatly appreciated.

Thanks in advance,

Patrícia Abrantes, PhD

GHTM, Instituto de Higiene e Medicina Tropical,

Universidade Nova de Lisboa, Portugal

edger • 8.0k views

ADD COMMENT • link updated 6.4 years ago by sutturka • 0 • written 6.9 years ago by abrantes.patricia ▴ 10

score 1 · Answer 1 · 2017-06-02

1

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 20 minutes ago

United States

You can only fit a batch effect if you have complete replication of all sample types in each batch. So your design won't work because the first sample (which is the only one in batch 4) will be ignored. You could use limma voom and account for the batch effect using duplicateCorrelation, however.

ADD COMMENT • link 6.9 years ago James W. MacDonald 65k

1

Entering edit mode

A minor correction: the design in the original post will still work (in the sense that it's of full rank with non-zero degrees of freedom, and you can fit the model and get estimates for the coefficients and dispersion). It just means that the first sample won't be used for anything, resulting in some loss of power for the comparisons involving the first group.

As for the lack of DE genes, I would suggest looking at a MDS plot and seeing whether there are samples from different groups mixing together. If so... well, there might not be any differences between groups. You'll also lose power from high dispersion estimates if the expression is highly variable; while this depends a lot on the context, my rule of thumb is that anything higher than 0.1 is concerning for RNA-seq experiments performed in laboratory (i.e., well-controlled) conditions.

ADD REPLY • link 6.9 years ago Aaron Lun ★ 28k

0

Entering edit mode

Good point - it will 'work', it just won't be doing what the OP thought it was doing.

ADD REPLY • link 6.9 years ago James W. MacDonald 65k

score 0 · Answer 2 · 2017-06-05

0

Entering edit mode

abrantes.patricia ▴ 10

@abrantespatricia-13177

Last seen 6.8 years ago

Hi,

thanks again for you comments.

Best regards,

Patrícia

ADD COMMENT • link 6.9 years ago abrantes.patricia ▴ 10

score 0 · Answer 3 · 2017-06-08

0

Entering edit mode

abrantes.patricia ▴ 10

@abrantespatricia-13177

Last seen 6.8 years ago

Hi again,

if I want to analyse treatment but want to adjust for batch effect what is the correct way of using model.matrix?

design<-model.matrix(~0+group+batch, data=y$samples)
or

design<-model.matrix(~0+batch+group, data=y$samples)?

Sorry about so simple question but I get completely different results depending on the order of batch+group or group+batch and from the examples in the manual it is not clear for me.

I have already seen examples with the 2 options...

Thanks in advance,

Patrícia Abrantes

ADD COMMENT • link 6.9 years ago abrantes.patricia ▴ 10

0

Entering edit mode

The two models are mathematically identical, differing only in the order of coefficients. Obviously, this means that you'll have to change the coef that you drop in glmQLFTest, otherwise you'll be testing for significant differences between batches rather than for a significant difference between groups. For example:

batch <- factor(c(1,2,1,2))
group <- factor(LETTERS[c(1,1,2,2)])
model.matrix(~0 + batch + group)  # last coefficient = difference between groups
model.matrix(~0 + group + batch)  # last coefficient = difference between batches

P.S. Don't post a reply as an answer unless you're answering your own question; use comments instead.

ADD REPLY • link 6.9 years ago Aaron Lun ★ 28k

score 0 · Answer 4 · 2017-06-08

0

Entering edit mode

abrantes.patricia ▴ 10

@abrantespatricia-13177

Last seen 6.8 years ago

Hi Aaron,

thanks for your quick reply.

Patrícia

ADD COMMENT • link 6.9 years ago abrantes.patricia ▴ 10

score 0 · Answer 5 · 2017-12-07

RNASeq experiment:

Control (3 replicates) and 3 Treatments (3 replicates each).

sequencing for 2 replicates of Control sample was performed 2 years later which adds very high sequencing depth and data is generated on different instrument.

Sample_Name	Treatment	Sample
NORM1	NORM	Batch1
NORM2	NORM	Batch2
NORM3	NORM	Batch2
TRT1_1	TRT1	Batch1
TRT1_2	TRT1	Batch1
TRT1_3	TRT1	Batch1
TRT2_1	TRT2	Batch1
TRT2_2	TRT2	Batch1
TRT2_3	TRT2	Batch1
TRT3_1	TRT3	Batch1
TRT3_2	TRT3	Batch1
TRT3_3	TRT3	Batch1

Can you please help to make sure below code is correct?

#Prepare the design Matrix
TM = factor(samples$treatment)
TM = relevel(TM, ref="NORM")
Batch = factor(samples$batch)

ALL = DGEList(counts=as.matrix(countData), group= TM)
ALL <- calcNormFactors(ALL)

# Design matrix with considering batch effect
design <- model.matrix(~0+Batch+TM)
rownames(design) = colnames(ALL)

#calculate dispersion

ALL <- estimateDisp(ALL, design, robust=TRUE)
fit <- glmQLFit(ALL, design, robust=TRUE)

Design is as below:

	Batchbatch1	Batchbatch2	TRT1	TRT2	TRT3
NORM1	1	0	0	0	0
NORM2	0	1	0	0	0
NORM3	0	1	0	0	0
TRT1	1	0	1	0	0
TRT1	1	0	1	0	0
TRT1	1	0	1	0	0
TRT2	1	0	0	1	0
TRT2	1	0	0	1	0
TRT2	1	0	0	1	0
TRT3	1	0	0	0	1
TRT3	1	0	0	0	1
TRT3	1	0	0	0	1

#calculate Differential Expression

#TRT1

qlf_TRT1 = glmQLFTest(fit, coef = 3)

#TRT2

qlf_TRT2 = glmQLFTest(fit, coef = 4)

#TRT3

qlf_CARC = glmQLFTest(fit, coef = 6)

I want to make sure that above statements compare the TRT1, TRT2, and TRT3 against the normal because I used the relevel parameter in variable TM and also account for the batch effect as it is included in the design matrix and subsequent steps.