Question

Differential expression analysis: comparison of DESeq2 and edgeR robust results

1

Entering edit mode

Jane Merlevede ▴ 90

@jane-merlevede-5019

Last seen 7.1 years ago

Hello everybody,

I have analysed an experiment of ribodepleted samples using both DESeq2 and edgeR robust. I read that one can expect a concordance of 70-80% between both tools. Here, this is not the case.

In this experiment, there are 3 conditions : cell lines of patient with one specific mutation (6 samples), cell lines of patient with another specific mutation (6 samples) and cell lines of controls (3 samples).

I attached the comparison of edgeR and DESeq2 for:
1/ cell lines of patient (12 samples) vs cell lines of controls (3 samples): http://i.imgur.com/gYWJ26b.png?1
2/ cell lines of patient with one specific mutation (6 samples) vs cell lines of patient with another specific mutation (6 samples): http://i.imgur.com/7JClh40.png?1

In both comparisons, I compare all the deregulated genes (top panel), genes with padj<=0.001 (middle panel) and genes with at least an average of 100 reads in one of the conditions (bottom panel). I wanted to see if deregulated genes found by only one tool were enriched in low significant or lowly expressed genes, which does not seem to be the case.

In the first comparison, DEseq2 finds more up-regulated genes, whereas edgeR finds more down-regulated genes.
In the second comparison, barely one third of the deregulated genes are found by both methods.

Any idea for these discrepancies? What do you think about these results?
I can provide the code I used for more details.

Thank you in advance for your help,
Jane

deseq2 edgeR differential gene expression • 6.2k views

ADD COMMENT • link updated 9.5 years ago by Aaron Lun ★ 29k • written 9.5 years ago by Jane Merlevede ▴ 90

0

Entering edit mode

Put some code in your post to show how you ran each of the two analyses.

ADD REPLY • link 9.5 years ago Aaron Lun ★ 29k

0

Entering edit mode

Sure, here is the DESeq2 code for the first analysis:

library("DESeq2")

#####################################
### Creating CountDataSet object  ###
#####################################
Nom=c("PatientCellLine_ARN_COM","PatientCellLine_ARN_GAN","PatientCellLine_ARN_GEN","PatientCellLine_ARN_GRA","PatientCellLine_ARN_KOR","PatientCellLine_ARN_LEV","PatientCellLine_ARN_PAR","PatientCellLine_ARN_POP","PatientCellLine_ARN_RIP","PatientCellLine_ARN_SAU","PatientCellLine_ARN_SEG","PatientCellLine_ARN_SEV","Ctrl3_ARN_3","Ctr_ARN_2","Ctr_ARN_3")
Status=c("PatientCellLine", "PatientCellLine","PatientCellLine", "PatientCellLine", "PatientCellLine","PatientCellLine", "PatientCellLine", "PatientCellLine", "PatientCellLine","PatientCellLine", "PatientCellLine", "PatientCellLine", "Ctrl", "Ctrl", "Ctrl")

Fichier = paste(Nom,"gencod.htseqCount",sep=".")
DataFrame=data.frame(Nom,Fichier,Status)

################################################
###              Model                       ###
### Estimation of sizeFactors and Dispersion ###
################################################

dds_raw=DESeqDataSetFromHTSeqCount(DataFrame,"/home/PatientCellLine_ARN/Results/Data/17052016", ~Status) # 63568
colData(dds_raw)$Status = factor(colData(dds_raw)$Status, levels=c("Ctrl","PatientCellLine"))
colData(dds_raw)$Status = relevel(colData(dds_raw)$Status, "Ctrl")

write.table(counts(dds_raw), file="PatientCellLine_vs_3Ctrl_DESeq2_RawTableFull",quote=FALSE,row.names=TRUE, col.names=TRUE,sep="\t")

length( which( rowSums(counts(dds_raw)) ==0 ) ) #22671
dds <- dds_raw[ rowSums(counts(dds_raw)) > 1, ] #37227
dds=estimateSizeFactors(dds)
dds=estimateDispersions(dds)

#####################################
### Differential expression Tests ###
#####################################
dds=nbinomWaldTest(dds)

ddsFinal=dds
resultsNames(ddsFinal)
"Intercept"     "StatusCtrl"    "StatusPatientCellLine"

res1=results(ddsFinal,contrast=c(0,-1,1),pAdjustMethod="BH", cooksCutoff=FALSE, independentFiltering=TRUE,alpha=0.01)
sum(res1$padj < 0.01, na.rm=TRUE)
summary(res1)

resNoFilt=results(ddsFinal,contrast=c(0,-1,1),pAdjustMethod="BH", cooksCutoff=FALSE, independentFiltering=FALSE,alpha=0.01)
table(filtering=(res1$padj<0.01), noFiltering=(resNoFilt$padj<0.01))

res1$FoldChange=2^(res1$log2FoldChange)
res1$MeanPatientCellLine <- rowMeans(counts(ddsFinal,normalized=TRUE)[,c(1:12)])
res1$MeanCtrl <- rowMeans(counts(ddsFinal,normalized=TRUE)[,c(13:15)])
Max=pmax(res1$MeanCtrl,res1$MeanPatientCellLine)
ChosenThreshold=100
ind_lowexpressed=which(as.vector(Max)<ChosenThreshold)

### Differential genes ###
res_final=res1
padj=0.01

indSigIncrease=order(res1$padj)
Sig= which(res1$padj[indSigIncrease] <= padj)
indSig=indSigIncrease[Sig]
resSig=res1[indSig,]
dim(resSig) #695

#most strongly down regulated genes
ind_down=which(resSig$log2FoldChange<=0)
resSigDown=resSig[ind_down,]
dim(resSigDown) #252
MaxDown=pmax(resSigDown$MeanCtrl,resSigDown$MeanPatientCellLine)
ind_down_below100=which(as.vector(MaxDown)<ChosenThreshold)
length(ind_down_below100) #79

#most strongly up regulated genes
ind_up=which(resSig$log2FoldChange>=0)
resSigUp=resSig[ind_up,]
dim(resSigUp) #443
MaxUp=pmax(resSigUp$MeanCtrl,resSigUp$MeanPatientCellLine)
ind_up_below100=which(as.vector(MaxUp)<ChosenThreshold)
length(ind_up_below100) #150

ADD REPLY • link 9.5 years ago Jane Merlevede ▴ 90

0

Entering edit mode

And the edgeR code for the first analysis:

library("edgeR")
library("gplots")

rawdata <- read.delim("PatientCellLine_vs_3Ctrl_DESeq2_RawTableFull",header=TRUE, stringsAsFactors=FALSE,row.names="Gene" )
Status=c("PatientCellLine", "PatientCellLine","PatientCellLine", "PatientCellLine", "PatientCellLine","PatientCellLine", "PatientCellLine", "PatientCellLine", "PatientCellLine","PatientCellLine", "PatientCellLine", "PatientCellLine", "Ctrl", "Ctrl", "Ctrl")

#################
# design matrix #
#################
design <- model.matrix(~Status)
rownames(design) <- colnames(rawdata)
y <- DGEList(counts=rawdata, group=Status)
keep <- rowSums(cpm(y)>1) >= 1
y <- y[keep,]
dim(y) #17904

###################################################
# Differential expression & Dispersion estimation #
###################################################
y <- calcNormFactors(y)
scale <- y$samples$lib.size* y$samples$norm.factors
normCounts <- round(t(t(y$counts)/scale)*mean(scale)) #Normalized counts
y$normCounts=normCounts

y <- estimateGLMRobustDisp(y,design)

fit <- glmFit(y,design)
lrt <- glmLRT(fit,coef=ncol(fit$design))

summary(de <- decideTestsDGE(lrt,p.value=0.01,lfc=0))
detags <- rownames(y)[as.logical(de)]

table= topTags(lrt,n=nrow(y),p.value=1,sort.by="none")$table
table$baseMean=rowMeans(normCounts)
table$MeanPatientCellLine=rowMeans(normCounts[,1:12])
table$MeanCtrl=rowMeans(normCounts[,13:15])       

padj=0.01
ChosenThreshold=100

indSigIncrease=order(table$FDR)
Sig= which(table$FDR[indSigIncrease] <= padj)
indSig=indSigIncrease[Sig]
resSig=table[indSig,]
dim(resSig) #702

#most strongly down regulated genes
ind_down=which(resSig$logFC<=0)
resSigDown=resSig[ind_down,]
dim(resSigDown) #445
MaxDown=pmax(resSigDown$MeanPatientCellLine,resSigDown$MeanCtrl)
ind_down_below100=which(as.vector(MaxDown)<ChosenThreshold)
length(ind_down_below100) #95

#most strongly up regulated genes
ind_up=which(resSig$logFC>=0)
resSigUp=resSig[ind_up,]
dim(resSigUp) #257
MaxUp=pmax(resSigUp$MeanPatientCellLine,resSigUp$MeanCtrl)
ind_up_below100=which(as.vector(MaxUp)<ChosenThreshold)
length(ind_up_below100) #54

ADD REPLY • link 9.5 years ago Jane Merlevede ▴ 90

0

Entering edit mode

Here is the DESeq2 code for analysis 2:

library("DESeq2")

#####################################
### Creating CountDataSet object  ###
#####################################
Nom=c("PatientCellLine_ARN_COM","PatientCellLine_ARN_GAN","PatientCellLine_ARN_GEN","PatientCellLine_ARN_LEV","PatientCellLine_ARN_PAR","PatientCellLine_ARN_SAU","PatientCellLine_ARN_GRA","PatientCellLine_ARN_KOR","PatientCellLine_ARN_POP","PatientCellLine_ARN_RIP",
       "PatientCellLine_ARN_SEG","PatientCellLine_ARN_SEV")
MUT=c("Mut1", "Mut1","Mut1", "Mut1", "Mut1","Mut1", "Mut2", "Mut2","Mut2", "Mut2","Mut2", "Mut2")

Fichier = paste(Nom,"gencod.htseqCount",sep=".")
DataFrame=data.frame(Nom,Fichier,MUT)

################################################
###              Modèle                      ###
### Estimation of sizeFactors and Dispersion ###
################################################

dds_raw=DESeqDataSetFromHTSeqCount(DataFrame,"/home/PatientCellLine_ARN/Results/Data/17052016", ~MUT)
colData(dds_raw)$MUT = factor(colData(dds_raw)$MUT, levels=c("Mut2", "Mut1"))
colData(dds_raw)$MUT = relevel(colData(dds_raw)$MUT, "Mut2")

write.table(counts(dds_raw),file="PatientCellLine_Mut_DESeq2_RawTableFull",quote=FALSE,row.names=TRUE, col.names=TRUE,sep="\t")

sum( rowSums(counts(dds_raw)) ==0) #23622
dds <- dds_raw[ rowSums(counts(dds_raw)) > 1, ] #36179
dds=DESeq(dds,  minReplicatesForReplace=6)

#####################################
### Differential expression Tests ###
#####################################
ddsFinal=dds
resultsNames(ddsFinal)
"Intercept"    "Mut2" "Mut1"   

res1=results(ddsFinal,contrast=c(0,1,-1),pAdjustMethod="BH", cooksCutoff=TRUE, independentFiltering=TRUE,alpha=0.01)
sum(res1$padj < 0.01, na.rm=TRUE)

resNoFilt=results(ddsFinal,contrast=c(0,-1,1),pAdjustMethod="BH", cooksCutoff=TRUE, independentFiltering=FALSE,alpha=0.01)
table(filtering=(res1$padj<0.01), noFiltering=(resNoFilt$padj<0.01))

res1$FoldChange=2^(res1$log2FoldChange)
res1$MeanMut1 <- rowMeans(counts(ddsFinal,normalized=TRUE)[,1:6])
res1$MeanMut2 <- rowMeans(counts(ddsFinal,normalized=TRUE)[,6:12])
Max=pmax(res1$MeanMut1,res1$MeanMut2)
ChosenThreshold=100

### Differential genes ###
res_final=res1
padj=0.01
indSigIncrease=order(res1$padj)
Sig= which(res1$padj[indSigIncrease] <= padj)
indSig=indSigIncrease[Sig]
resSig=res1[indSig,]
dim(resSig) #145

#most strongly down regulated genes
ind_down=which(resSig$log2FoldChange<=0)
resSigDown=resSig[ind_down,]
dim(resSigDown) #65
MaxDown=pmax(resSigDown$MeanMut1,resSigDown$MeanMut2)
ind_down_below100=which(as.vector(MaxDown)<ChosenThreshold)
length(ind_down_below100) #21

#most strongly up regulated genes
ind_up=which(resSig$log2FoldChange>=0)
resSigUp=resSig[ind_up,]
dim(resSigUp) #80
MaxUp=pmax(resSigUp$MeanMut1,resSigUp$MeanMut2)
ind_up_below100=which(as.vector(MaxUp)<ChosenThreshold)
length(ind_up_below100) #22

ADD REPLY • link 9.5 years ago Jane Merlevede ▴ 90

0

Entering edit mode

Sorry for the length...

Finally, the edgeR code for analysis 2:

library("edgeR")

rawdata <- read.delim("PatientCellLine_MUT_DESeq2_RawTableFull",row.names="Gene",header=TRUE, stringsAsFactors=FALSE)
MUT= factor(c("Mut1", "Mut1","Mut1","Mut1", "Mut1","Mut1", "Mut2", "Mut2", "Mut2", "Mut2", "Mut2", "Mut2"))

#################
# design matrix #
#################
design <- model.matrix(~MUT)
rownames(design) <- colnames(rawdata)
y <- DGEList(counts=rawdata, group=MUT)
keep <- rowSums(cpm(y)>1) >= 1
y <- y[keep,]
dim(y) # 17556

##########################
# Differential expression & Dispersion estimation
##########################
y <- calcNormFactors(y)
scale <- y$samples$lib.size* y$samples$norm.factors
normCounts <- round(t(t(y$counts)/scale)*mean(scale))
y$normCounts=normCounts

y <- estimateGLMRobustDisp(y,design)

fit <- glmFit(y,design)
lrt <- glmLRT(fit,coef=2)

summary(de <- decideTestsDGE(lrt,p.value=0.01, lfc=0))
detags <- rownames(y)[as.logical(de)]
table= topTags(lrt,n=nrow(y),p.value=1,sort.by="none")$table
table$baseMean=rowMeans(normCounts)
table$MeanMut1=rowMeans(normCounts[,1:6])       
table$MeanMut2=rowMeans(normCounts[,7:12])

padj=0.01
ChosenThreshold=100
indSigIncrease=order(table$FDR)
Sig= which(table$FDR[indSigIncrease] <= padj)
indSig=indSigIncrease[Sig]
resSig=table[indSig,]
dim(resSig) #210

#most strongly down regulated genes
ind_down=which(resSig$logFC<=0)
resSigDown=resSig[ind_down,]
dim(resSigDown) #73
MaxDown=pmax(resSigDown$MeanMut1,resSigDown$MeanMut2)
ind_down_below100=which(as.vector(MaxDown)<ChosenThreshold)
length(ind_down_below100) #19

#most strongly up regulated genes
ind_up=which(resSig$logFC>=0)
resSigUp=resSig[ind_up,]
dim(resSigUp) #137
MaxUp=pmax(resSigUp$MeanMut1,resSigUp$MeanMut2)
ind_up_below100=which(as.vector(MaxUp)<ChosenThreshold)
length(ind_up_below100) #20

ADD REPLY • link 9.5 years ago Jane Merlevede ▴ 90

score 3 · Answer 1 · 2016-05-19

3

Entering edit mode

Aaron Lun ★ 29k

@alun

Last seen 4 hours ago

The city by the bay

For starters, the filtering schemes are different. For DESeq2, there's over 35000 genes going into the DE analysis, whereas for edgeR, there's less than half that number. Obviously, if your inputs are different, then the results are going to be different as well. Use the same genes for both analyses, and for the sake of comparison, turn off independent filtering in results and just get the DE statistics for all genes.

Differences may also be due to the fact that the outlier protection strategies are different in DESeq2 with cook=TRUE and edgeR's estimateGLMRobustDisp. I don't use them much, but I would suggest doing more standard dispersion estimation in both pipelines and checking whether the differences are still present.

Also, in both edgeR and DESeq2, you're fitting separate models for each DE comparison. This is suboptimal at best, because you're not using all of the information to estimate the dispersion; and silly at worst, because you're treating the two sets of patient samples as replicates in the first analysis (which clearly they aren't, otherwise it wouldn't make much sense to compare between them). Just fit one model with three groups - patient 1, patient 2, and control - and then test for DE between groups. For example, for your first analysis, you can do something similar to the last contrast in section 3.2.3 of the edgeR user's guide.

Finally, try to cut down your code to a minimal working example. It's hard for people to figure out what's going on in big blobs of code.

P.S. Your calculation of "normalized counts" is something we don't recommend or support; better to use cpm to get normalized expression values. Or, if you want to identify lowly-expressed genes, just use the average log-CPM reported by topTable.

ADD COMMENT • link 9.5 years ago Aaron Lun ★ 29k

0

Entering edit mode

Thank you a lot Aaron for your suggestions.

I planned to use the same filtering for both (removing "0 lines" and performing independent filtering with geneFilter in edgeR as well), but I did not succeed in applying independent filtering in edgeR. Well, I will simply remove "0 lines" in both for comparison, without independent filtering as you suggested.

In fact, I started to use edgeR, as suggested by Mike Love, because one known deregulated genes had high dispersion and was not found by DESeq2: DESEq2 - Why is "my gene" not differentially expressed?

I will use contrasts for testing the three groups. I had difficulties to write contrasts in edgeR in lrt <- glmLRT(fit,coef=2). I will take a closer look at the GLM approach described. Thank you.

ADD REPLY • link 9.5 years ago Jane Merlevede ▴ 90

0

Entering edit mode

I re-analysed my two comparisons using a single GLM with edgeR. It looks fine for "PatientWithMut1 - PatientWithMut2" but something is wrong for "Patient vs ctrl". I am not sure how to write the contrast.

I included the design matrix and results of glmLRT for the tested contrasts:

library("edgeR")
rawdata <- read.delim("RawTableFull",header=TRUE, stringsAsFactors=FALSE,row.names="Gene" )
Status=c("Ctrl", "Ctrl", "Ctrl", "Mut2", "Mut2","Mut2", "Mut2", "Mut2","Mut2", "Mut1", "Mut1", "Mut1","Mut1", "Mut1", "Mut1")

design <- model.matrix(~Status)
rownames(design) <- colnames(rawdata)
design
(Intercept) StatusMut2 StatusMut1
PatientCellLine_ARN_ctr1           1           0              0
PatientCellLine_ARN_ctr2           1           0              0
PatientCellLine_ARN_ctr3           1           0              0
PatientCellLine_ARN_COM           1           1              0
PatientCellLine_ARN_GAN           1           1              0
PatientCellLine_ARN_GEN           1           1              0
PatientCellLine_ARN_LEV           1           1              0
PatientCellLine_ARN_PAR           1           1              0
PatientCellLine_ARN_SAU           1           1              0
PatientCellLine_ARN_GRA           1           0              1
PatientCellLine_ARN_KOR           1           0              1
PatientCellLine_ARN_POP           1           0              1
PatientCellLine_ARN_RIP           1           0              1
PatientCellLine_ARN_SEG           1           0              1
PatientCellLine_ARN_SEV           1           0              1
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$Status
[1] "contr.treatment"

y <- DGEList(counts=rawdata, group=Status)
keep <- rowSums(cpm(y)>1) >= 1
y <- y[keep,]
dim(y) #17904

# Differential expression & Dispersion estimation
y <- calcNormFactors(y)
y <- estimateGLMRobustDisp(y,design)
fit <- glmFit(y,design) #,robust=TRUE

# Comparison 1: Mut1 - Mut2
lrt1 <- glmLRT(fit,contrast=c(0,-1,1))
summary(de <- decideTestsDGE(lrt1,p.value=0.01,lfc=0))
-1    63
0  17727
1    114

# Comparison 2: Mut1 and Mut2 - Ctrl
lrt2 <- glmLRT(fit, contrast=c(-1,1,1))
summary(de <- decideTestsDGE(lrt2,p.value=0.01,lfc=0))
-1     4
0    420
1  17480

# Comparison 2bis: Average of Mut1 and Mut2 - Ctrl
lrt2bis <- glmLRT(fit, contrast=c(-1,0.5,0.5))
summary(de <- decideTestsDGE(lrt2bis,p.value=0.01,lfc=0))
-1     0
0     91
1  17813

# Comparison 2ter: Mut1 - Ctrl
lrt2ter <- glmLRT(fit, contrast=c(-1,0,1))
summary(de <- decideTestsDGE(lrt2ter,p.value=0.01,lfc=0))
-1     0
0     93
1  17811

There is something wrong even when simply testing PatientWithMut1 - ctrl... Could you please tell me what I am doing wrong in this comparison?

ADD REPLY • link 9.5 years ago Jane Merlevede ▴ 90

0

Entering edit mode

You've parametrized your design matrix with an intercept. The first coefficient represents the average expression in the control, while the second and third coefficients represent the log-fold changes of mutants 1 and 2, respectively, over the control. Thus, to test each mutant to the control you should be just dropping each coefficient:

contrast=c(0,1,0) # mutant 1 vs control
contrast=c(0,0,1) # mutant 2 vs control

To test the average of the mutants against the control, you need to average the terms corresponding to log-fold changes:

contrast=c(0,1/2,1/2) # average of mutant 1 and mutant2 vs control

ADD REPLY • link 9.5 years ago Aaron Lun ★ 29k

0

Entering edit mode

Thank you Aaron, I did not understood well. To me, the version with no intercept seems easier to use.

I got much more plausible results:

lrt2 <- glmLRT(fit, contrast=c(0,1,1))

summary(de <- decideTestsDGE(lrt2,p.value=0.01,lfc=0))
-1   517
0  17089
1    298

Then, I tried the GLM without intercept. No problem for the first comparison, same contrast as previously, but I do not figure out what is the contrast for "Patient1+Patient2 vs Ctrl". I thought the contrast would be the following, but no:

design <- model.matrix(~0+Status)

> design
StatusCtrl StatusMut2 StatusMut1
ctrl1_ARN_FET          1           0              0
ctrl2_ARN_HNS          1           0              0
ctrl3_ARN_MID          1           0              0
PatientCellLine_ARN_COM          0           1              0
PatientCellLine_ARN_GAN          0           1              0
PatientCellLine_ARN_GEN          0           1              0
PatientCellLine_ARN_LEV          0           1              0
PatientCellLine_ARN_PAR          0           1              0
PatientCellLine_ARN_SAU          0           1              0
PatientCellLine_ARN_GRA          0           0              1
PatientCellLine_ARN_KOR          0           0              1
PatientCellLine_ARN_POP          0           0              1
PatientCellLine_ARN_RIP          0           0              1
PatientCellLine_ARN_SEG          0           0              1
PatientCellLine_ARN_SEV          0           0              1

lrt2nointercept <- glmLRT(fit, contrast=c(-1,1,1))
> summary(de <- decideTestsDGE(lrt2nointercept,p.value=0.01,lfc=0))
-1 17780
0    124
1      0

ADD REPLY • link 9.5 years ago Jane Merlevede ▴ 90

1

Entering edit mode

Remember, you want the average of the mutant groups. So you should be doing:

contrast=c(-1, 1/2, 1/2)

ADD REPLY • link 9.5 years ago Aaron Lun ★ 29k

0

Entering edit mode

Ok, I see.

lrt2nointercept <- glmLRT(fit, contrast=c(-1,0.5,0.5))
> summary(de <- decideTestsDGE(lrt2nointercept,p.value=0.01,lfc=0))
   [,1]
-1   517
0  17089
1    298

I will try your other suggestions and check if now the edgeR and DESeq2 results are more concordant.

Thank you a lot.

ADD REPLY • link 9.5 years ago Jane Merlevede ▴ 90