Hi all,
I am using EdgeR to perform a DE analysis on 3 different treatments with 3 replicates each but I have a strong batch effect on the day of experiments.
I've been reading the manual and the example exercises (although none is exactly like my case...) and I performed the following analyses:
library(limma)
library(edgeR)
library(ggplot2)
new<-read.delim("Trinity_trans.counts.matrix")
group<-factor(c("A","A","A","B","B","B","C","C","C"))
batch<-factor(c(4,2,3,1,2,3,1,2,3))
y<-DGEList(counts=new[,2:10], genes=new[,1:1],group=group)
y
design<-model.matrix(~0+group+batch, data=y$samples)
design
groupA groupB groupC batch2 batch3 batch4
A10 1 0 0 0 0 1
Exp2-4 1 0 0 1 0 0
A18 1 0 0 0 1 0
Exp1_2 0 1 0 0 0 0
Exp2_5 0 1 0 1 0 0
A17 0 1 0 0 1 0
Exp1_3 0 0 1 0 0 0
Exp2_6 0 0 1 1 0 0
A13 0 0 1 0 1 0
keep<-rowSums(cpm(y)>1)>=3
table(keep)
y<-y[keep,,keep.lib.sizes=FALSE]
z<-calcNormFactors(y)
z$samples
plotMDS(z)
z<-estimateDisp(z,design)
z$common.dispersion
plotBCV(z)
fit2<-glmQLFit(z,design,robust=TRUE)
plotQLDisp(fit2)
qlf<-glmQLFTest(fit2, coef=1:3)
topTags(qlf)
qlf1 <- glmQLFTest(fit2, contrast=c(-1,1,0,0,0,0))
topTags(qlf1)
qlf2 <- glmQLFTest(fit2, contrast=c(-1,0,1,0,0,0))
topTags(qlf2)
qlf3 <- glmQLFTest(fit2, contrast=c(-0,-1,1,0,0,0))
topTags(qlf3)
My problem is that I'm not completely sure that I'm removing the batch effect (since I'm not sure that the model considers the information on the 3 batch columns) and also because I don't have any differential expressed gene when performing the different pairwise treatment comparisons.
- Could you please see if this is the correct script for removing the batch effect (4 days of experiments)?
- Is there any step that I could be doing wrong that is causing zero DE genes?
Any help would be greatly appreciated.
Thanks in advance,
Patrícia Abrantes, PhD
GHTM, Instituto de Higiene e Medicina Tropical,
Universidade Nova de Lisboa, Portugal
A minor correction: the
design
in the original post will still work (in the sense that it's of full rank with non-zero degrees of freedom, and you can fit the model and get estimates for the coefficients and dispersion). It just means that the first sample won't be used for anything, resulting in some loss of power for the comparisons involving the first group.As for the lack of DE genes, I would suggest looking at a MDS plot and seeing whether there are samples from different groups mixing together. If so... well, there might not be any differences between groups. You'll also lose power from high dispersion estimates if the expression is highly variable; while this depends a lot on the context, my rule of thumb is that anything higher than 0.1 is concerning for RNA-seq experiments performed in laboratory (i.e., well-controlled) conditions.
Good point - it will 'work', it just won't be doing what the OP thought it was doing.