Question

Pairwise differential expression analysis with batch effect using edgeR and SVA

0

Entering edit mode

mpj5142 • 0

@mpj5142-12838

Last seen 7.0 years ago

Penn State University, University Park,…

Hello Bioconductor support,

I have read through several other posts similar to this topic but haven't found one quite like my case.

I'm currently analyzing a group of RNA-Seq samples of knockdown genes for differential expression of genes. The experimental setup is simple--four knockdown gene models and a wild type control (2 or 3 biological replicates/sample, depending on removal of outliers), where we want to see pairwise differential expression between each knockdown model and the control. However, the control samples were sequenced at a different time than the knockdown samples, leaving us vulnerable to confounding batch effect. When I ran a standard edgeR pipeline (generalized linear model option, with pairwise contrasts between each knockdown model and the control), I find ~75% overlap in the differentially expressed genes in the four knockdown models--much higher than we would have expected. Looking back at the raw read counts (generated from Tophat and HTSeq-count) I can clearly see a difference in read number between the controls and all four knockdowns in the differentially-expressed genes.

I put together a combined edgeR-SVA script to try and correct for the batch effect (shown below). I can clearly see the batch effect variable in the output of svaseq$sv (the first surrogate variable), where the coefficients for the wild-type samples are >0.35 while all other coefficients are between -0.35 and 0.06. However, when I try to incorporate those terms into my edgeR model for the glmQLFTest, I still get around 75% overlap in the differentially expressed genes in each model.

So, my questions are:

Is it possible/valid to account for batch effect when doing a pairwise comparison between groups of samples with both experimental (knockdown vs. wild-type) and batch effects?
If so, is my incorporation of the SVA batch effect coefficients into my design matrix for edgeR correct?

Thank you very much for your help!

Code:

#Load libraries
library("edgeR")
library("sva")

#Load raw counts file
biol_reps_data<-read.table("biol_reps_filter.txt", sep="\t", stringsAsFactors=FALSE,row.names=1,header=T)

#Convert to matrix and convert characters to numeric type
biol_reps_matrix<-as.matrix(biol_reps_data)
class(biol_reps_matrix)<-"numeric"

#Load data matrix into DGEList format to begin edgeR pipepline
groups<-c(rep("WT",3),rep("KD4",3),rep("KD2",2),rep("KD3",2),rep("KD1",3))
#Note: We removed outliers based on the unfiltered dendrogram.

y<-DGEList(biol_reps_matrix, group=groups)

#Filter variants: Recommended defaults here are 5 counts in smallest library, translated to counts/million (here ~23 million = 0.2 cpm), and in at least 3 samples (or expressed in all members of one group).
keep<-rowSums(cpm(y)>0.2)>=3
y<-y[keep, , keep.lib.sizes=FALSE] #Recalculate library sizes afterwards

#Calculate normalization factor
y<-calcNormFactors(y)

#Create full model matrix (all knockdown variables) and null model matrix (no variables, just intercept)
group<-y$samples$group
group<-factor(group,levels=c("WT","KD1","KD2","KD3","KD4")) #Put WT first so it can be replaced by intercept
mod = model.matrix(~group)
mod0 = cbind(mod[,1])

#Run SVA-Seq to identify nunber of latent variables and prepare effect statistics
svseq = svaseq(cpm(y),mod,mod0)

#Add SVA output to edgeR design matrix
design<-model.matrix(~ 0 + group)
design<-cbind(design,svseq$sv)
#Estimate dispersion of variances along model
y <- estimateDisp(y, design, robust=TRUE)
#Model the quasi-likelihood dispersions of variation within model
fit <- glmQLFit(y, design, robust=TRUE)

#Perform differential comparison tests. Include the first SVA component for batch effect
KD1<-glmQLFTest(fit, contrast=c(-1,1,0,0,0,-1,0,0))
KD2<-glmQLFTest(fit, contrast=c(-1,0,1,0,0,-1,0,0))
KD3<-glmQLFTest(fit, contrast=c(-1,0,0,1,0,-1,0,0))
KD4<-glmQLFTest(fit, contrast=c(-1,0,0,0,1,-1,0,0))

And here is my design matrix after combining SVA with edgeR:

         groupWT groupKD1 groupKD2 groupKD3 groupKD4                                   
1        1         0          0         0         0  0.43971823  0.2538267 -0.34694067
2        1         0          0         0         0  0.41845801  0.3303754  0.07421955
3        1         0          0         0         0  0.35834159  0.2728753  0.23699350
4        0         0          0         0         1  0.05039491 -0.3770501  0.08139615
5        0         0          0         0         1  0.06050462 -0.4323167  0.30362393
6        0         0          0         0         1 -0.04206468 -0.2853763  0.06568937
7        0         0          1         0         0 -0.30751819  0.1202135 -0.43753883
8        0         0          1         0         0 -0.21037194 -0.1273101 -0.39692925
9        0         0          0         1         0  0.07644279 -0.2255518 -0.33624544
10       0         0          0         1         0  0.14552734 -0.2711915  0.13777360
11       0         1          0         0         0 -0.28544004  0.2024568  0.30216039
12       0         1          0         0         0 -0.35634657  0.2767993 -0.05730942
13       0         1          0         0         0 -0.34764607  0.2622495  0.37310710

edger rna-seq sva differential expression • 2.2k views

ADD COMMENT • link updated 7.0 years ago by Aaron Lun ★ 28k • written 7.0 years ago by mpj5142 • 0

score 2 · Accepted Answer · 2017-04-13

The smaller problem is that your contrasts are wrong. You want to block on the latent factors, so they shouldn't be involved in any of the contrasts. You should be doing something like this instead:

KD1 <- glmQLFTest(fit, contrast=c(-1,1,0,0,0,0,0,0))

However, this only masks the larger problem, which is that the experimental design is fundamentally broken. If all your WT samples were prepared/sequenced in one batch, and all of your KD samples were prepared/sequenced in another batch, it is mathematically impossible to determine whether the differences between WT/KD groups are driven by biology or batch effects. No amount of statistical magic will change this. Indeed, it could be said that sva and friends aren't even appropriate here. Finding the surrogate variable/latent factor is not the problem - you already know the factor of variation, which is completely confounded with your conditions of interest. This is reflected quite nicely in the factors from sva, which correlate well to your KD predictors (i.e., all samples in the same group have similar values).

If I got a data set like this, I would consider it to be a very good reason to be angry. At my institute, anyone who wants to do genomics/sequencing must attend an experimental design session, explicitly to avoid this kind of problem with batch effects.