Question

DESeq2 VS edgeR DEG results different in analysis same data

0

Entering edit mode

15958021290 ▴ 10

@15958021290-21573

Last seen 4.4 years ago

Hi. I have a problem about different DEG result from analysis same data using DESeq2 and edgeR. Here is two raw all DEG result without fiter based on |logFC| and padj/FDR

# DESeq2  23000 gene
    baseMean    log2FoldChange  lfcSE   stat    pvalue  padj
AT1G61780   849.0853127 -8.903695371    0.466024423 -19.10564109    2.27E-81    4.93E-77
AT5G37130   4294.715279 3.273271654 0.182620448 17.92390553 7.67E-72    8.34E-68
AT2G22330   1462.260281 -2.042049359    0.190228279 -10.73473076    6.99E-27    5.07E-23
AT4G39950   1767.094459 -1.715318859    0.189821306 -9.036492751    1.62E-19    8.79E-16
AT4G07850.1 157.2221618 -3.419929682    0.384430341 -8.896097194    5.78E-19    2.52E-15
AT4G33467   164.1142641 -3.886305717    0.469515739 -8.277263985    1.26E-16    4.57E-13
AT1G29025   139.9963086 3.812475165 0.494958859 7.702610213 1.33E-14    4.13E-11
AT5G39100   5286.22104  -1.419454643    0.184684843 -7.685820976    1.52E-14    4.13E-11
AT5G40780   308.4253424 1.599004982 0.21056858  7.593749184 3.11E-14    7.51E-11
AT4G37060   453.042214  -1.607152702    0.222616086 -7.219391609    5.22E-13    1.14E-09


# edgeR  19000gene
    logFC   logCPM  F   PValue  FDR
AT1G61780   -8.845660521    5.249802313 1871.270675 7.25E-11    1.38E-06
AT5G37130   3.283848639 7.594636928 310.2048595 9.66E-08    0.000922328
AT2G22330   -2.030610109    6.033101299 138.9722858 2.23E-06    0.014165433
AT4G39950   -1.70625125 6.305968363 104.290019  6.66E-06    0.031780368
AT5G40780   1.608654845 3.798079714 86.26828826 1.36E-05    0.051882335
AT4G07850.1 -3.404115495    2.81708925  78.44634595 1.94E-05    0.061573085
AT4G27410   -1.360683624    3.788481722 73.59416568 2.45E-05    0.06682167
AT4G37060   -1.596523553    4.346113775 70.67441336 2.84E-05    0.067857601
AT5G39100   -1.410359242    7.886167204 65.86720875 3.68E-05    0.078056
AT3G42670   -1.25508234 3.044920624 62.2533565  4.52E-05    0.086254526

As you can see. The logFC result is very similar for every gene. But the padj and FDR are very different! If I use |logFC|>1 & padj/FDR < 0.05,then DESeq2 will remain about 400 gene, but edgeR only leave 4 gene. I dont'know why ? Here is the process code :

# DESeq2
suppressPackageStartupMessages(library(DESeq2))
count <- read.csv("demo.csv",row.names = 1, stringsAsFactors =F)
head(count)
#           WT1  WT2  WT3  Al1  Al2  Al3
# AT1G01010 1301 1648 1121 1941 2307 1980
# AT1G01020  784  987  730  699  779 1033
# AT1G01030   36   54   33   30   28   39
# AT1G01040 1796 1950 1531 1433 2068 2463
# AT1G01046   13   10    6    8    9   27
# AT1G01050 1549 2030 1305 1332 1386 2593
condition <- factor(c(rep("wt",3), rep("Al", 3)), levels = c("wt", "Al"))
coldata <- data.frame(row.names = colnames(count), condition)
coldata
#     condition
# WT1        wt
# WT2        wt
# WT3        wt
# Al1        Al
# Al2        Al
# Al3        Al
dds <- DESeqDataSetFromMatrix(count,coldata,design = ~ condition)
dds
# class: DESeqDataSet 
# dim: 33610 6 
# metadata(1): version
# assays(1): counts
# rownames(33610): AT1G01010 AT1G01020 ... ATMG01400 ATMG01410
# rowData names(0):
#   colnames(6): WT1 WT2 ... Al2 Al3
# colData names(1): condition
dds <- dds[rowSums(counts(dds)) >= 6,] # filter rowSum of count <= 6
dds2 <- DESeq(dds)
res <- results(dds2)
resOrder <- res[order(res$padj),]
dim(resOrder)
#[1] 23086     6
write.csv(resOrder, "DESeq2_result_all.csv")
deg <- subset(resOrder, padj <= 0.05 & abs(log2FoldChange) >= 1) # get most significant deg
# dim(deg)
# [1] 443   6
write.csv(deg,"DESeq2_result_significant.csv")

# edgeR prcess code
suppressPackageStartupMessages(library(edgeR))
data <- read.csv("demo.csv",row.names = 1,stringsAsFactors = F)
head(data)
#           WT1  WT2  WT3  Al1  Al2  Al3
# AT1G01010 1301 1648 1121 1941 2307 1980
# AT1G01020  784  987  730  699  779 1033
# AT1G01030   36   54   33   30   28   39
# AT1G01040 1796 1950 1531 1433 2068 2463
# AT1G01046   13   10    6    8    9   27
# AT1G01050 1549 2030 1305 1332 1386 2593
group <- factor(c(rep("wt",3),rep("Al",3)),levels = c("wt","Al"))
group
# [1] wt wt wt Al Al Al
# Levels: wt Al
y <- DGEList(data,group)
keep <- rowSums(cpm(y$counts)>1) >=2
y <- y[keep,,keep.lib.sizes=F]
y <- calcNormFactors(y)
design2 <- model.matrix(~group)
design2
# (Intercept) groupAl
# 1           1       0
# 2           1       0
# 3           1       0
# 4           1       1
# 5           1       1
# 6           1       1
# attr(,"assign")
# [1] 0 1
# attr(,"contrasts")
# attr(,"contrasts")$group
# [1] "contr.treatment"
y <- estimateDisp(y,design2)
fit <- glmQLFit(y,design2)
qlf <- glmQLFTest(fit,coef = 2)
b <- topTags(qlf,n=nrow(fit$counts))
dim(b$table)
#[1] 19477     5
head(b$table)
write.csv(b,file = "edgeR_allDiff.csv")
edgeR_sigDiff.gene <- subset(b$table,FDR <0.05 & abs(logFC)>1)
dim(edgeR_sigDiff.gene)
#[1] 4 5
write.csv(edgeR_sigDiff.gene,file = "edgeR_sigDiff.csv")

Can someone give valueable suggestion or ideas! Thanks a lot!

DESeq2 edgeR • 817 views

ADD COMMENT • link updated 4.6 years ago by Gordon Smyth 50k • written 4.6 years ago by 15958021290 ▴ 10

0

Entering edit mode

You say that these are results without filtering, but the edgeR pipelines all assume that you have filtered on expression. You can't omit the filtering step.
No one will be able to help you unless provide the code that produced the output you show.

ADD REPLY • link 4.6 years ago Gordon Smyth 50k

0

Entering edit mode

Thank your for suggestion! I have edit my question! And I know your No.1 point. What I want to say is "without filter raw deg result based on |logFC| and padj/FDR. Not "without filter count" .I know both edgeR and DESeq2 would filter low count in the process.

ADD REPLY • link 4.6 years ago 15958021290 ▴ 10

score 3 · Accepted Answer · 2019-10-07

The DESeq2 statistical tests are analogous to the edgeR glmLRT tests in that they treat the estimated dispersion parameters as if known. Both DESeq2 and edgeR-LRT can give very small p-values and are (in my experience) slightly anti-conservative.

The edgeR quasi-likelihood pipeline is intrinsically more conservative that DESeq2 or edgeR-LRT and will almost always give larger p-values. It takes into account the uncertainty with which the dispersion is estimated and hence gives more rigorous error rate control. I myself use edgeR-QL rather than edgeR-LRT in most cases for this reason.

Having said that, you could get a little more power from edgeR-QL by adding robust=TRUE to the glmQLFTest call and by using keep <- filterByExpr(y) in place of your current filtering step.

BTW, I not recommend the use of a FC filter for the DE results. It probably has relatively little effect here anyway.