edgeR and batch effect
5
1
Entering edit mode
@abrantespatricia-13177
Last seen 5.4 years ago

Hi all,
I am using EdgeR to perform a DE analysis on 3 different treatments with 3 replicates each but I have a strong batch effect on the day of experiments.
I've been reading the manual and the example exercises (although none is exactly like my case...) and I performed the following analyses:

library(limma)
library(edgeR)
library(ggplot2)

group<-factor(c("A","A","A","B","B","B","C","C","C"))
batch<-factor(c(4,2,3,1,2,3,1,2,3))
y<-DGEList(counts=new[,2:10], genes=new[,1:1],group=group)
y
design<-model.matrix(~0+group+batch, data=y$samples) design groupA groupB groupC batch2 batch3 batch4 A10 1 0 0 0 0 1 Exp2-4 1 0 0 1 0 0 A18 1 0 0 0 1 0 Exp1_2 0 1 0 0 0 0 Exp2_5 0 1 0 1 0 0 A17 0 1 0 0 1 0 Exp1_3 0 0 1 0 0 0 Exp2_6 0 0 1 1 0 0 A13 0 0 1 0 1 0 keep<-rowSums(cpm(y)>1)>=3 table(keep) y<-y[keep,,keep.lib.sizes=FALSE] z<-calcNormFactors(y) z$samples
plotMDS(z)

z<-estimateDisp(z,design)
z$common.dispersion plotBCV(z) fit2<-glmQLFit(z,design,robust=TRUE) plotQLDisp(fit2) qlf<-glmQLFTest(fit2, coef=1:3) topTags(qlf) qlf1 <- glmQLFTest(fit2, contrast=c(-1,1,0,0,0,0)) topTags(qlf1) qlf2 <- glmQLFTest(fit2, contrast=c(-1,0,1,0,0,0)) topTags(qlf2) qlf3 <- glmQLFTest(fit2, contrast=c(-0,-1,1,0,0,0)) topTags(qlf3) My problem is that I'm not completely sure that I'm removing the batch effect (since I'm not sure that the model considers the information on the 3 batch columns) and also because I don't have any differential expressed gene when performing the different pairwise treatment comparisons. - Could you please see if this is the correct script for removing the batch effect (4 days of experiments)? - Is there any step that I could be doing wrong that is causing zero DE genes? Any help would be greatly appreciated. Thanks in advance, Patrícia Abrantes, PhD GHTM, Instituto de Higiene e Medicina Tropical, Universidade Nova de Lisboa, Portugal edger • 6.2k views ADD COMMENT 1 Entering edit mode @james-w-macdonald-5106 Last seen 59 minutes ago United States You can only fit a batch effect if you have complete replication of all sample types in each batch. So your design won't work because the first sample (which is the only one in batch 4) will be ignored. You could use limma voom and account for the batch effect using duplicateCorrelation, however. ADD COMMENT 1 Entering edit mode A minor correction: the design in the original post will still work (in the sense that it's of full rank with non-zero degrees of freedom, and you can fit the model and get estimates for the coefficients and dispersion). It just means that the first sample won't be used for anything, resulting in some loss of power for the comparisons involving the first group. As for the lack of DE genes, I would suggest looking at a MDS plot and seeing whether there are samples from different groups mixing together. If so... well, there might not be any differences between groups. You'll also lose power from high dispersion estimates if the expression is highly variable; while this depends a lot on the context, my rule of thumb is that anything higher than 0.1 is concerning for RNA-seq experiments performed in laboratory (i.e., well-controlled) conditions. ADD REPLY 0 Entering edit mode Good point - it will 'work', it just won't be doing what the OP thought it was doing. ADD REPLY 0 Entering edit mode @abrantespatricia-13177 Last seen 5.4 years ago Hi, thanks again for you comments. Best regards, Patrícia ADD COMMENT 0 Entering edit mode @abrantespatricia-13177 Last seen 5.4 years ago Hi again, if I want to analyse treatment but want to adjust for batch effect what is the correct way of using model.matrix? design<-model.matrix(~0+group+batch, data=y$samples)
or

design<-model.matrix(~0+batch+group, data=y$samples)? Sorry about so simple question but I get completely different results depending on the order of batch+group or group+batch and from the examples in the manual it is not clear for me. I have already seen examples with the 2 options... Thanks in advance, Patrícia Abrantes ADD COMMENT 0 Entering edit mode The two models are mathematically identical, differing only in the order of coefficients. Obviously, this means that you'll have to change the coef that you drop in glmQLFTest, otherwise you'll be testing for significant differences between batches rather than for a significant difference between groups. For example: batch <- factor(c(1,2,1,2)) group <- factor(LETTERS[c(1,1,2,2)]) model.matrix(~0 + batch + group) # last coefficient = difference between groups model.matrix(~0 + group + batch) # last coefficient = difference between batches P.S. Don't post a reply as an answer unless you're answering your own question; use comments instead. ADD REPLY 0 Entering edit mode @abrantespatricia-13177 Last seen 5.4 years ago Hi Aaron, thanks for your quick reply. Patrícia ADD COMMENT 0 Entering edit mode sutturka • 0 @sutturka-14580 Last seen 9 months ago United States RNASeq experiment: Control (3 replicates) and 3 Treatments (3 replicates each). sequencing for 2 replicates of Control sample was performed 2 years later which adds very high sequencing depth and data is generated on different instrument.  Sample_Name Treatment Sample NORM1 NORM Batch1 NORM2 NORM Batch2 NORM3 NORM Batch2 TRT1_1 TRT1 Batch1 TRT1_2 TRT1 Batch1 TRT1_3 TRT1 Batch1 TRT2_1 TRT2 Batch1 TRT2_2 TRT2 Batch1 TRT2_3 TRT2 Batch1 TRT3_1 TRT3 Batch1 TRT3_2 TRT3 Batch1 TRT3_3 TRT3 Batch1 Can you please help to make sure below code is correct? #Prepare the design Matrix TM = factor(samples$treatment)
TM = relevel(TM, ref="NORM")
Batch = factor(samples\$batch)

ALL = DGEList(counts=as.matrix(countData), group= TM)
ALL <- calcNormFactors(ALL)

# Design matrix with considering batch effect
design <- model.matrix(~0+Batch+TM)
rownames(design) = colnames(ALL)

#calculate dispersion

ALL <- estimateDisp(ALL, design, robust=TRUE)
fit <- glmQLFit(ALL, design, robust=TRUE)

Design is as below:

 Batchbatch1 Batchbatch2 TRT1 TRT2 TRT3 NORM1 1 0 0 0 0 NORM2 0 1 0 0 0 NORM3 0 1 0 0 0 TRT1 1 0 1 0 0 TRT1 1 0 1 0 0 TRT1 1 0 1 0 0 TRT2 1 0 0 1 0 TRT2 1 0 0 1 0 TRT2 1 0 0 1 0 TRT3 1 0 0 0 1 TRT3 1 0 0 0 1 TRT3 1 0 0 0 1

#calculate Differential Expression

#TRT1

qlf_TRT1 = glmQLFTest(fit, coef = 3)

#TRT2

qlf_TRT2 = glmQLFTest(fit, coef = 4)

#TRT3

qlf_CARC = glmQLFTest(fit, coef = 6)

I want to make sure that above statements compare the TRT1, TRT2, and TRT3 against the normal because I used the relevel parameter in variable TM and also account for the batch effect as it is included in the design matrix and subsequent steps.

0
Entering edit mode

Your design matrix is fine (though qlf_CARFC should use coef=5). I also hope you have performed filtering on abundance at some point. Note that NORM2 and NORM3 will not participate in the comparisons at all, as their fitted values are fully determined by the Batch2 coefficient; these samples are only used in dispersion estimation.

For future reference, new questions should be posted separately, rather than resurrecting old posts.