Hello Bioconductor support,
I have read through several other posts similar to this topic but haven't found one quite like my case.
I'm currently analyzing a group of RNA-Seq samples of knockdown genes for differential expression of genes. The experimental setup is simple--four knockdown gene models and a wild type control (2 or 3 biological replicates/sample, depending on removal of outliers), where we want to see pairwise differential expression between each knockdown model and the control. However, the control samples were sequenced at a different time than the knockdown samples, leaving us vulnerable to confounding batch effect. When I ran a standard edgeR pipeline (generalized linear model option, with pairwise contrasts between each knockdown model and the control), I find ~75% overlap in the differentially expressed genes in the four knockdown models--much higher than we would have expected. Looking back at the raw read counts (generated from Tophat and HTSeq-count) I can clearly see a difference in read number between the controls and all four knockdowns in the differentially-expressed genes.
I put together a combined edgeR-SVA script to try and correct for the batch effect (shown below). I can clearly see the batch effect variable in the output of svaseq$sv
(the first surrogate variable), where the coefficients for the wild-type samples are >0.35 while all other coefficients are between -0.35 and 0.06. However, when I try to incorporate those terms into my edgeR model for the glmQLFTest
, I still get around 75% overlap in the differentially expressed genes in each model.
So, my questions are:
- Is it possible/valid to account for batch effect when doing a pairwise comparison between groups of samples with both experimental (knockdown vs. wild-type) and batch effects?
- If so, is my incorporation of the SVA batch effect coefficients into my design matrix for edgeR correct?
Thank you very much for your help!
Code:
#Load libraries library("edgeR") library("sva") #Load raw counts file biol_reps_data<-read.table("biol_reps_filter.txt", sep="\t", stringsAsFactors=FALSE,row.names=1,header=T) #Convert to matrix and convert characters to numeric type biol_reps_matrix<-as.matrix(biol_reps_data) class(biol_reps_matrix)<-"numeric" #Load data matrix into DGEList format to begin edgeR pipepline groups<-c(rep("WT",3),rep("KD4",3),rep("KD2",2),rep("KD3",2),rep("KD1",3)) #Note: We removed outliers based on the unfiltered dendrogram. y<-DGEList(biol_reps_matrix, group=groups) #Filter variants: Recommended defaults here are 5 counts in smallest library, translated to counts/million (here ~23 million = 0.2 cpm), and in at least 3 samples (or expressed in all members of one group). keep<-rowSums(cpm(y)>0.2)>=3 y<-y[keep, , keep.lib.sizes=FALSE] #Recalculate library sizes afterwards #Calculate normalization factor y<-calcNormFactors(y) #Create full model matrix (all knockdown variables) and null model matrix (no variables, just intercept) group<-y$samples$group group<-factor(group,levels=c("WT","KD1","KD2","KD3","KD4")) #Put WT first so it can be replaced by intercept mod = model.matrix(~group) mod0 = cbind(mod[,1]) #Run SVA-Seq to identify nunber of latent variables and prepare effect statistics svseq = svaseq(cpm(y),mod,mod0) #Add SVA output to edgeR design matrix design<-model.matrix(~ 0 + group) design<-cbind(design,svseq$sv) #Estimate dispersion of variances along model y <- estimateDisp(y, design, robust=TRUE) #Model the quasi-likelihood dispersions of variation within model fit <- glmQLFit(y, design, robust=TRUE) #Perform differential comparison tests. Include the first SVA component for batch effect KD1<-glmQLFTest(fit, contrast=c(-1,1,0,0,0,-1,0,0)) KD2<-glmQLFTest(fit, contrast=c(-1,0,1,0,0,-1,0,0)) KD3<-glmQLFTest(fit, contrast=c(-1,0,0,1,0,-1,0,0)) KD4<-glmQLFTest(fit, contrast=c(-1,0,0,0,1,-1,0,0))
And here is my design matrix after combining SVA with edgeR:
groupWT groupKD1 groupKD2 groupKD3 groupKD4 1 1 0 0 0 0 0.43971823 0.2538267 -0.34694067 2 1 0 0 0 0 0.41845801 0.3303754 0.07421955 3 1 0 0 0 0 0.35834159 0.2728753 0.23699350 4 0 0 0 0 1 0.05039491 -0.3770501 0.08139615 5 0 0 0 0 1 0.06050462 -0.4323167 0.30362393 6 0 0 0 0 1 -0.04206468 -0.2853763 0.06568937 7 0 0 1 0 0 -0.30751819 0.1202135 -0.43753883 8 0 0 1 0 0 -0.21037194 -0.1273101 -0.39692925 9 0 0 0 1 0 0.07644279 -0.2255518 -0.33624544 10 0 0 0 1 0 0.14552734 -0.2711915 0.13777360 11 0 1 0 0 0 -0.28544004 0.2024568 0.30216039 12 0 1 0 0 0 -0.35634657 0.2767993 -0.05730942 13 0 1 0 0 0 -0.34764607 0.2622495 0.37310710
Hi Aaron,
Thank you very much for your quick response! I was afraid that would be the case, but it's good to hear this directly from someone else with a lot more experience.