Hello Bioconductor support,
I have read through several other posts similar to this topic but haven't found one quite like my case.
I'm currently analyzing a group of RNA-Seq samples of knockdown genes for differential expression of genes. The experimental setup is simple--four knockdown gene models and a wild type control (2 or 3 biological replicates/sample, depending on removal of outliers), where we want to see pairwise differential expression between each knockdown model and the control. However, the control samples were sequenced at a different time than the knockdown samples, leaving us vulnerable to confounding batch effect. When I ran a standard edgeR pipeline (generalized linear model option, with pairwise contrasts between each knockdown model and the control), I find ~75% overlap in the differentially expressed genes in the four knockdown models--much higher than we would have expected. Looking back at the raw read counts (generated from Tophat and HTSeq-count) I can clearly see a difference in read number between the controls and all four knockdowns in the differentially-expressed genes.
I put together a combined edgeR-SVA script to try and correct for the batch effect (shown below). I can clearly see the batch effect variable in the output of svaseq$sv (the first surrogate variable), where the coefficients for the wild-type samples are >0.35 while all other coefficients are between -0.35 and 0.06. However, when I try to incorporate those terms into my edgeR model for the glmQLFTest, I still get around 75% overlap in the differentially expressed genes in each model.
So, my questions are:
- Is it possible/valid to account for batch effect when doing a pairwise comparison between groups of samples with both experimental (knockdown vs. wild-type) and batch effects?
- If so, is my incorporation of the SVA batch effect coefficients into my design matrix for edgeR correct?
Thank you very much for your help!
Code:
#Load libraries
library("edgeR")
library("sva")
#Load raw counts file
biol_reps_data<-read.table("biol_reps_filter.txt", sep="\t", stringsAsFactors=FALSE,row.names=1,header=T)
#Convert to matrix and convert characters to numeric type
biol_reps_matrix<-as.matrix(biol_reps_data)
class(biol_reps_matrix)<-"numeric"
#Load data matrix into DGEList format to begin edgeR pipepline
groups<-c(rep("WT",3),rep("KD4",3),rep("KD2",2),rep("KD3",2),rep("KD1",3))
#Note: We removed outliers based on the unfiltered dendrogram.
y<-DGEList(biol_reps_matrix, group=groups)
#Filter variants: Recommended defaults here are 5 counts in smallest library, translated to counts/million (here ~23 million = 0.2 cpm), and in at least 3 samples (or expressed in all members of one group).
keep<-rowSums(cpm(y)>0.2)>=3
y<-y[keep, , keep.lib.sizes=FALSE] #Recalculate library sizes afterwards
#Calculate normalization factor
y<-calcNormFactors(y)
#Create full model matrix (all knockdown variables) and null model matrix (no variables, just intercept)
group<-y$samples$group
group<-factor(group,levels=c("WT","KD1","KD2","KD3","KD4")) #Put WT first so it can be replaced by intercept
mod = model.matrix(~group)
mod0 = cbind(mod[,1])
#Run SVA-Seq to identify nunber of latent variables and prepare effect statistics
svseq = svaseq(cpm(y),mod,mod0)
#Add SVA output to edgeR design matrix
design<-model.matrix(~ 0 + group)
design<-cbind(design,svseq$sv)
#Estimate dispersion of variances along model
y <- estimateDisp(y, design, robust=TRUE)
#Model the quasi-likelihood dispersions of variation within model
fit <- glmQLFit(y, design, robust=TRUE)
#Perform differential comparison tests. Include the first SVA component for batch effect
KD1<-glmQLFTest(fit, contrast=c(-1,1,0,0,0,-1,0,0))
KD2<-glmQLFTest(fit, contrast=c(-1,0,1,0,0,-1,0,0))
KD3<-glmQLFTest(fit, contrast=c(-1,0,0,1,0,-1,0,0))
KD4<-glmQLFTest(fit, contrast=c(-1,0,0,0,1,-1,0,0))
And here is my design matrix after combining SVA with edgeR:
groupWT groupKD1 groupKD2 groupKD3 groupKD4 1 1 0 0 0 0 0.43971823 0.2538267 -0.34694067 2 1 0 0 0 0 0.41845801 0.3303754 0.07421955 3 1 0 0 0 0 0.35834159 0.2728753 0.23699350 4 0 0 0 0 1 0.05039491 -0.3770501 0.08139615 5 0 0 0 0 1 0.06050462 -0.4323167 0.30362393 6 0 0 0 0 1 -0.04206468 -0.2853763 0.06568937 7 0 0 1 0 0 -0.30751819 0.1202135 -0.43753883 8 0 0 1 0 0 -0.21037194 -0.1273101 -0.39692925 9 0 0 0 1 0 0.07644279 -0.2255518 -0.33624544 10 0 0 0 1 0 0.14552734 -0.2711915 0.13777360 11 0 1 0 0 0 -0.28544004 0.2024568 0.30216039 12 0 1 0 0 0 -0.35634657 0.2767993 -0.05730942 13 0 1 0 0 0 -0.34764607 0.2622495 0.37310710

Hi Aaron,
Thank you very much for your quick response! I was afraid that would be the case, but it's good to hear this directly from someone else with a lot more experience.