design and contrast matrix for RNA seq analysis with three different controls
1
1
Entering edit mode
lidixu.se • 0
@lidixuse-11370
Last seen 7.7 years ago

Dear all!

I am analysing RNA sequencing data by using edgeR package, but I am not sure about how should I make the design matrix. The counts table are obtained by using Rsubread package.

Firstly, I will describe what my samples look like. I have two stable doxycycline (Dox) inducible cell clones expressing either shCtrl(shC) or shTargetgene (shT). They were treated with Dox or without Dox. By adding Dox, shCtrl cells should not be affected, while shTargetgene cells will have knockdown of the target gene. My samples are in four different groups: C (control), C plus Dox, T and T plus Dox and I have prepared triplicates. Therefore, in total 12 samples: Group <- c(rep("C",3),rep("C+Dox",3),rep("T",3),rep("T+Dox",3)). Actually, all the three groups : "C", "C+Dox" and "T" are all considered to be controls for group "T+Dox".

My idea is that I want to get a list of genes that are differentially expressed between W plus and W groups but not between C plus and C group.

Another thing need to be mentioned is that: since I have collected the cells and prepared RNA in different days, the batch effect exits. Therefore, I used batch = as.factor(c(1,2,3,1,2,3,4,5,6,4,5,6)) in my case. 

How could I make a correct design? Should I make a contrast between T and T+Dox group and get top tags, subset the first list of genes with FDR < 0.05 and abs(logFC)>1. Then make another contrast between C and C+Dox and also subset the second list of genes with FDR < 0.05 and abs(logFC)>1. After that I substract the genes in the second list from the first gene lists.

Is that good to do the analysis in this way or is there any design can make things easier? Can you be nice and help me with it?

Bellow is the pipeline I want to use for the RNA seq analysis. This is a simplified pipeline. I have not added contrasts in the pipeline. I also attached sessionInfo().

 

#read csv file containing rawcounts

counts <- read.csv (file="rawcounts.csv",sep = ",")
colnames( counts ) <- paste(c("C_1","C_2","C_3","C plus Dox_1","C plus Dox_2","C plus Dox_3","T_1","T_2","T_3","T plus Dox_1","T plus Dox_2","T plus Dox_3"),sep=",")

#building the edgeR Object

Group <- c(rep("C",3),rep("C+Dox",3),rep("T",3),rep("T+Dox",3))
cds <- DGEList(counts, group = Group)
names( cds )
dim(cds)

#Filtering

keep <- rowSums(cpm(cds) >0.5) >= 2
cdsfilter <- cds[keep,,keep.lib.sizes= FALSE]
cdsfilter$samples
table(keep)
cdsNorm <- calcNormFactors(cdsfilter)
cdsNorm$samples

#Batch effect removal

batch = as.factor(c(1,2,3,1,2,3,4,5,6,4,5,6))
logFC <- predFC(cdsNorm,design,prior.count = 1, dispersion = 0.05)
cor(logFC[,1:6])
logCPM <- cpm(cdsNorm,log =TRUE, prior.count =5)
logCPM <- removeBatchEffect(logCPM,batch =batch)

# MDS plot

plotMDS(logCPM,lables= cdsNorm$Group, ylim = c(-1,1), col=c(1,1,1,2,2,2,3,3,3,4,4,4),main = "MDS plot")
legend("bottomleft", legend = c("shC-Dox","shC+Dox","shT-Dox","shT+Dox"),col= 1:4, pch=16,cex=1)

#The design matrix before fit GLMs

design <- model.matrix(~Group+batch)
rownames(design) <- colnames(cdsNorm)
design

# estimate the dispersion

cdsGLM <- estimateGLMCommonDisp(cdsNorm,design)
cdsGLM$common.dispersion
cdsTag <- estimateGLMTagwiseDisp(cdsGLM,design)
plotBCV(cdsTag,main="BCV plot")
plotMeanVar(cdsTag,show.raw =TRUE,show.tagwise= TRUE, show.binned = TRUE)

#Differentiatl expression(GLM model for multiple factor)

fit<-glmQLFit(cdsTag,design,robust =TRUE)
head(fit$coefficients)
plotQLDisp(fit)
lrt <- glmLRT(fit, coef =2)
topTags (lrt)
res.sig <- topTags (lrt)

 

> seesionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.2 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
 [1] biomaRt_2.28.0             org.Mm.eg.db_3.3.0         AnnotationDbi_1.34.4     
 [4] edgeR_3.14.0               DESeq2_1.12.4              BiocInstaller_1.22.3           
 [7]limma_3.28.20   GenomicAlignments_1.8.4   SummarizedExperiment_1.2.3 Biobase_2.32.0                     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.6         RColorBrewer_1.1-2  plyr_1.8.4          bitops_1.0-6      
 [5] tools_3.3.1         zlibbioc_1.18.0     statmod_1.4.25      rpart_4.1-10      
 [9] RSQLite_1.0.0       annotate_1.50.0     gtable_0.2.0        lattice_0.20-33   
[13] Matrix_1.2-6        DBI_0.5             gridExtra_2.2.1     genefilter_1.54.2 
[17] cluster_2.0.4       locfit_1.5-9.1      grid_3.3.1          nnet_7.3-12       
[21] data.table_1.9.6    XML_3.98-1.4        survival_2.39-5     BiocParallel_1.6.6
[25] foreign_0.8-66      latticeExtra_0.6-28 Formula_1.2-1       org.Hs.eg.db_3.3.0
[29] GO.db_3.3.0         geneplotter_1.50.0  ggplot2_2.1.0       Hmisc_3.17-4      
[33] scales_0.4.0        splines_3.3.1       xtable_1.8-2        colorspace_1.2-6  
[37] acepack_1.3-3.3     RCurl_1.95-4.8      munsell_0.4.3       chron_2.3-47
bioconductor edgeR design and contrast matrix rnaseq differential gene expression • 2.6k views
ADD COMMENT
3
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen just now
The city by the bay

First, I would set up the design like this (some renaming required to keep makeContrasts happy):

Group <- c(rep("C",3),rep("C_Dox",3),rep("T",3),rep("T_Dox",3))
batch <- as.factor(c(1,2,3,1,2,3,4,5,6,4,5,6))
design <- model.matrix(~0+Group+batch)
design <- design[,colnames(design)!="batch4"] # get to full rank

... and then perform the contrast by supplying the following to the contrast argument of glmLRT:

con <- makeContrasts((GroupT_Dox - GroupT) - (GroupC_Dox - GroupC), 
    levels=design) # i.e., differential log-fold change between T and C

This looks for a differential response to doxycycline between the sh-control and sh-target cell lines. In the simplest case, genes will be detected if they are DE upon Dox treatment in sh-target but not in sh-control. However, it's more general than that - for example, genes will also be picked up if they are DE upon Dox treatment in sh-target, but DE in the opposite direction upon Dox treatment in sh-control. In this manner, you can find all genes that behave differently between the two cell lines upon Dox treatment.

Now, onto some comments about your actual code:

  1. Consider filtering for CPMs above 0.5 in at least 3 libraries, given that you have 3 replicates per group.
  2. Consider using estimateDisp instead of estimateGLMCommonDisp, etc., as the former is newer.
  3. It doesn't make sense to use glmLRT with glmQLFit, use glmQLFTest instead.
ADD COMMENT
0
Entering edit mode

Thanks Aaron for your great help!
I have taken your suggestions and rewrite the R code, however, There is a problem with the new design :  

design <- model.matrix(~0+Group+batch)

It shows: Error in glmFit.default(y, design, offset = log(lib.size), dispersion = dispersion,  : Design matrix not of full rank.  The following coefficients not estimable:
 batch6

I suspected the problem is due to the batch set, therefore, I changed the 

batch <- as.factor(c(1,2,3,1,2,3,1,2,3,1,2,3))

Then the problem was solved. But I do not understand why. If I want to keep the batch set as before, how should I make the right design then? 

Bellow is my modified R code: 

#read csv file containing rawcounts

counts <- read.csv (file="rawcounts.csv",sep = ",")
colnames( counts ) <- paste(c("C_1","C_2","C_3","C_Dox_1","C_Dox_2","C_Dox_3","T_1","T_2","T_3","T_Dox_1","T_ Dox_2","T_Dox_3"),sep=",")

#building the edgeR Object

Group <- c(rep("C",3),rep("C_Dox",3),rep("T",3),rep("T_Dox",3))
cds <- DGEList(counts, group = Group)
names( cds )
dim(cds)

#Filtering

keep <- rowSums(cpm(cds) >1) >= 3
cdsfilter <- cds[keep,,keep.lib.sizes= FALSE]
cdsfilter$samples
table(keep)
cdsNorm <- calcNormFactors(cdsfilter)
cdsNorm$samples

#Batch effect removal

batch = as.factor(c(1,2,3,1,2,3,4,5,6,4,5,6))
logFC <- predFC(cdsNorm,design,prior.count = 1, dispersion = 0.05)
cor(logFC[,1:6])
logCPM <- cpm(cdsNorm,log =TRUE, prior.count =5)
logCPM <- removeBatchEffect(logCPM,batch =batch)

# MDS plot

plotMDS(logCPM,lables= cdsNorm$Group, ylim = c(-1,1), col=c(1,1,1,2,2,2,3,3,3,4,4,4),main = "MDS plot")
legend("bottomleft", legend = c("shC-Dox","shC+Dox","shT-Dox","shT+Dox"),col= 1:4, pch=16,cex=1)

#The design matrix before fit GLMs

design <- model.matrix(~0+Group+batch)
rownames(design) <- colnames(cdsNorm)
design

# estimate the dispersion

cdsGLM <- estimateDisp(cdsNorm,design)
cdsGLM$common.dispersion

plotBCV(cdsTag,main="BCV plot")
plotMeanVar(cdsTag,show.raw =TRUE,show.tagwise= TRUE, show.binned = TRUE)

#Differentiatl expression(GLM model for multiple factor)

fit<-glmQLFit(cdsTag,design,robust =TRUE)
head(fit$coefficients)
plotQLDisp(fit)

contrast <- makeContrasts ((GroupT_Dox -GroupT) - (GroupC_Dox-GroupC), levels =design)

lrt <- glmLRT(fit, contrast)

topTags (lrt)
ADD REPLY
0
Entering edit mode

Oops - you have to remove one of the batch4-6 coefficients to get to full rank, otherwise they are linearly dependent with the GroupT* coefficients. I've modified my response above (arbitrarily removing batch4).

ADD REPLY
0
Entering edit mode

Thank you very much Aaron! 

ADD REPLY

Login before adding your answer.

Traffic: 707 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6