Question

Why DESeq2 and edgeR give different results?

0

Entering edit mode

Biologist ▴ 120

@biologist-9801

Last seen 5.8 years ago

Dear All,

I have a matrix "U2" with 44 samples (41 Disease + 3 Normal) as columns and approx. 15k genes as rows. "coldata" is where samples as rows and columns "Subtypes" and "Type". Column "Type" is with Disease and Normal.

library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = U2, colData = coldata, design = ~ Type)
dds

class: DESeqDataSet 
dim: 15754 44 
metadata(1): version
assays(1): counts
rownames(15754): A1BG-AS1 A2M-AS1 ... ZSWIM8-AS1 hsa-mir-1253
rowData names(0):
colnames(44): AT PT ... CT OT
colData names(2): Subtypes Type

keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

nrow(dds)
[1] 11755

dds$Type <- factor(dds$Type, levels = c("Normal","Disease"))
dds <- DESeq(dds)

estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 1206 genes
-- DESeq argument 'minReplicatesForReplace' = 7 
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing

res <- results(dds)
res

res <- results(dds, name="Type_Disease_vs_Normal")
res <- results(dds, contrast=c("Type","Disease","Normal"))
resultsNames(dds)

[1] "Intercept"           "Type_Disease_vs_Normal"

Differential analysis:

res05 <- results(dds, lfcThreshold = 1.2)
table(res05$padj < 0.05)
summary(res05)

out of 11755 with nonzero total read count
adjusted p-value < 0.1
LFC > 1.20 (up)    : 2, 0.016%
LFC < -1.20 (down) : 0, 0%
outliers [1]       : 72, 0.57%
low counts [2]     : 33, 0.26%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

With DeSeq2 I see only two upregulated and no downregulated genes using lofFC 1.2 and FDR < 0.05.

I also used same data with edgeR.

library(edgeR)
group <- factor(paste0(samples2$Type))
y <- DGEList(U2,group = group)

design2 <- model.matrix(~ 0 + group)
colnames(design2) <- c("Normal","Disease")
design2
keep <- filterByExpr(y, design2)
keep <- rowSums(cpm(y) > 0.5) >= 1

table(keep)
keep
FALSE  TRUE 
 1573 14181

summary(keep)
y <- y[keep, , keep.lib.sizes=FALSE]
y <- calcNormFactors(y,method = "TMM")

y <- estimateDisp(y, design2, robust=TRUE)

y$common.dispersion
[1] 1.428964

fit <- glmQLFit(y, design2, robust=TRUE)
head(fit$coefficients)

contrast.matrix <- makeContrasts(Disease-Normal, levels=design2)
contrast.matrix

qlf <- glmQLFTest(fit, contrast=contrast.matrix)
topTags(qlf)
summary(decideTestsDGE(qlf))

tr <- glmTreat(fit, contrast=contrast.matrix, lfc=log2(1.2))
topTags(tr)
tab2 <- topTags(tr,n=Inf)
keep <- tab2$table$FDR <= 0.05

summary(decideTestsDGE(tr))
       -1*Normal 1*Disease
Down                203
NotSig            13974
Up                    4

Questions:

1) Is my analysis with DeSeq2 right? I feel so because it doesn't show any down regulated genes.

2) with deseq2 there are two upregulated and with edgeR there are four. And none of them are same.

3) My analysis is with mainly lncRNAs. These have very very low expression values. So, Is there a better way to do differential analysis?

Thank you

deseq2 rnaseq differential gene expression edger • 3.4k views

ADD COMMENT • link updated 4.6 years ago by Gordon Smyth 53k • written 7.6 years ago by Biologist ▴ 120

score 3 · Answer 1 · 2018-05-25

Dear Biologist,

I and my colleagues have answered a few questions from you over the past year or so, so I'm going to take the liberty of giving some advice about DE, data quality and data exploration.

Filtering

First a comment on filtering. I replied to a series of questions from you on filtering for edgeR, so I am surprised that you haven't followed the advice that we gave you. Instead you've kept thousands of genes in your edgeR analysis that have only one non-zero count or have fewer than 10 read counts in total. What do you hope to learn from genes with such sparse counts? If you publish results for genes with data like that, it will cast doubt on the scientific accuracy of your results.

Sequencing depth

You haven't said anything about sequencing depth in any of your posts. I am guessing that the library sizes for your data are relatively small. If you want to learn about genes or lncRNAs at low expression levels, then obviously you need to sequence at sufficient depth to allow those RNAs to be reliably detected and to generate sufficient read counts for meaningful analysis.

Variability

Your replicate libraries are tremendously heterogeneous and inconsistent. edgeR tells you that the biological coefficient of variation of your replicates is well over 100%. You have to be cautious making conclusions from data that is as variable as this, when samples in the same disease group are so different from one another.

Have you made any plots of your data to examine the variability? plotMDS(y) would an obvious place to start. Do the disease and normal groups separate from one another? Are there any recognizable outlier samples that could be removed or down-weighted? Are there any batch effects that you have not accounted for? Are there disease subtypes that you have not accounted for? Do you need to look for hidden batch effects (by searching for surrogate variables or running one of the RUV algorithms)? Or is your data perhaps just poor quality? You don't mention having examined any of these things.

It's not the DE method

The problems with your data are not to do with the statistical test you are using nor with the package you are using (limma or edgeR or DESeq2). Nor is the fact that you have only 3 normal samples the main problem. The problem is basic data quality, which is something you need to take responsibility for.

score 2 · Answer 2 · 2018-05-25

2

Entering edit mode

Michael Love 43k

@mikelove

Last seen 5 days ago

United States

This has been asked and answered before, even recently. The methods are different, and when there is little power then you may not find the same genes at a given FDR. You shouldn’t run multiple statistical tests but just stick to one. There is no alternate analysis for low expression genes.

ADD COMMENT • link 7.6 years ago Michael Love 43k

0

Entering edit mode

Thank you very much

ADD REPLY • link 7.6 years ago Biologist ▴ 120

0

Entering edit mode

Hi Michael,

I have a very small cohort 28 cancer vs 3 Normal only. For this type of cohort for differential analysis do I need keep any special arguments while using deseq2?

ADD REPLY • link 7.6 years ago Biologist ▴ 120