Search
Question: Why DESeq2 and edgeR give different results?
0
gravatar for Biologist
12 weeks ago by
Biologist70
Biologist70 wrote:

Dear All,

I have a matrix "U2" with 44 samples (41 Disease + 3 Normal) as columns and approx. 15k genes as rows. "coldata" is where samples as rows and columns "Subtypes" and  "Type". Column "Type" is with Disease and Normal.

library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = U2, colData = coldata, design = ~ Type)
dds

class: DESeqDataSet 
dim: 15754 44 
metadata(1): version
assays(1): counts
rownames(15754): A1BG-AS1 A2M-AS1 ... ZSWIM8-AS1 hsa-mir-1253
rowData names(0):
colnames(44): AT PT ... CT OT
colData names(2): Subtypes Type

keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

nrow(dds)
[1] 11755

dds$Type <- factor(dds$Type, levels = c("Normal","Disease"))
dds <- DESeq(dds)

estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 1206 genes
-- DESeq argument 'minReplicatesForReplace' = 7 
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing

res <- results(dds)
res

res <- results(dds, name="Type_Disease_vs_Normal")
res <- results(dds, contrast=c("Type","Disease","Normal"))
resultsNames(dds)

[1] "Intercept"           "Type_Disease_vs_Normal"

Differential analysis:

res05 <- results(dds, lfcThreshold = 1.2)
table(res05$padj < 0.05)
summary(res05)

out of 11755 with nonzero total read count
adjusted p-value < 0.1
LFC > 1.20 (up)    : 2, 0.016%
LFC < -1.20 (down) : 0, 0%
outliers [1]       : 72, 0.57%
low counts [2]     : 33, 0.26%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

With DeSeq2 I see only two upregulated and no downregulated genes using lofFC 1.2 and FDR < 0.05.

I also used same data with edgeR.

library(edgeR)
group <- factor(paste0(samples2$Type))
y <- DGEList(U2,group = group)

design2 <- model.matrix(~ 0 + group)
colnames(design2) <- c("Normal","Disease")
design2
keep <- filterByExpr(y, design2)
keep <- rowSums(cpm(y) > 0.5) >= 1

table(keep)
keep
FALSE  TRUE 
 1573 14181

summary(keep)
y <- y[keep, , keep.lib.sizes=FALSE]
y <- calcNormFactors(y,method = "TMM")

y <- estimateDisp(y, design2, robust=TRUE)

y$common.dispersion
[1] 1.428964

fit <- glmQLFit(y, design2, robust=TRUE)
head(fit$coefficients)

contrast.matrix <- makeContrasts(Disease-Normal, levels=design2)
contrast.matrix

qlf <- glmQLFTest(fit, contrast=contrast.matrix)
topTags(qlf)
summary(decideTestsDGE(qlf))

tr <- glmTreat(fit, contrast=contrast.matrix, lfc=log2(1.2))
topTags(tr)
tab2 <- topTags(tr,n=Inf)
keep <- tab2$table$FDR <= 0.05

summary(decideTestsDGE(tr))
       -1*Normal 1*Disease
Down                203
NotSig            13974
Up                    4

Questions:

1) Is my analysis with DeSeq2 right? I feel so because it doesn't show any down regulated genes.

2) with deseq2 there are two upregulated and with edgeR there are four. And none of them are same.

3) My analysis is with mainly lncRNAs. These have very very low expression values. So, Is there a better way to do differential analysis?

Thank you

 

ADD COMMENTlink modified 12 weeks ago by Gordon Smyth34k • written 12 weeks ago by Biologist70
3
gravatar for Gordon Smyth
12 weeks ago by
Gordon Smyth34k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth34k wrote:

Dear Biologist,

I and my colleagues have answered a few questions from you over the past year or so, so I'm going to take the liberty of giving some advice about DE, data quality and data exploration.

Filtering

First a comment on filtering. I replied to a series of questions from you on filtering for edgeR, so I am surprised that you haven't followed the advice that we gave you. Instead you've kept thousands of genes in your edgeR analysis that have only one non-zero count or have fewer than 10 read counts in total. What do you hope to learn from genes with such sparse counts? If you publish results for genes with data like that, it will cast doubt on your scientific accuracy.

Sequencing depth

You haven't said anything about sequencing depth in any of your posts. I am guessing that the library sizes for your data are relatively small. If you want to learn about genes or lncRNAs at low expression levels, then obviously you need to sequence at sufficient depth to allow those RNAs to be reliably detected and to generate sufficient read counts for meaningful analysis.

Variability

Your replicate libraries are tremendously heterogeneous and inconsistent. edgeR tells you that the biological coefficient of variation of your replicates is well over 100%. You have to be cautious making conclusions from data that is as variable as this, when samples in the same disease group are so different from one another.

Have you made any plots of your data to examine the variability? plotMDS(y) would an obvious place to start. Do the disease and normal groups separate from one another? Are there any recognizable outlier samples that could be removed or down-weighted? Are there any batch effects that you have not accounted for? Are there disease subtypes that you have not accounted for? Do you need to look for hidden batch effects (by searching for surrogate variables or running one of the RUV algorithms)? Or is your data perhaps just poor quality? You don't mention having examined any of these things.

It's not the DE method

The problems with your data are not to do with the statistical test you are using nor with the package you are using (limma or edgeR or DESeq2). Nor is the fact that you have only 3 normal samples the main problem. The problem is basic data quality, which is something you need to take responsibility for.

ADD COMMENTlink modified 12 weeks ago • written 12 weeks ago by Gordon Smyth34k
2
gravatar for Michael Love
12 weeks ago by
Michael Love19k
United States
Michael Love19k wrote:

This has been asked and answered before, even recently. The methods are different, and when there is little power then you may not find the same genes at a given FDR. You shouldn’t run multiple statistical tests but just stick to one. There is no alternate analysis for low expression genes.

 

ADD COMMENTlink written 12 weeks ago by Michael Love19k

Thank you very much

ADD REPLYlink written 12 weeks ago by Biologist70

Hi Michael,

I have a very small cohort 28 cancer vs 3 Normal only. For this type of cohort for differential analysis do I need keep any special arguments while using deseq2?

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by Biologist70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 210 users visited in the last hour