Question

Why Differential expression results differ with edgeR and limma ?

0

Entering edit mode

Biologist ▴ 110

@biologist-9801

Last seen 4.1 years ago

Hello,

I have counts data with 366 samples as columns and more than 50k ensembl ids as rownames. The data is stored in a dataframe "U".
Data looks like following:

S1 S2 S3
ENSG00000000003.13 5745 8742 7257
ENSG00000000005.5 4 1 0
ENSG00000000419.11 1394 2211 615
ENSG00000000457.12 344 1166 692
ENSG00000000460.15 405 480 258
ENSG00000000938.11 144 132 110

table(df$Status)

F2 F3 F4 NE
125 220 20 1

Initially I have used limma, Glimma, edgeR for differential analysis which is mentioned in this link (http://combine-australia.github.io/RNAseq-R/06-rnaseq-day1.html) I followed the same steps with my data.

library(limma)
library(Glimma)
library(edgeR)

mycpm <- cpm(U)
head(mycpm)
thresh <- mycpm > 0.5
head(thresh)
keep <- rowSums(thresh) >= 2

table(keep)
summary(keep)
counts.keep <- U[keep,]
dim(counts.keep)
y <- DGEList(counts.keep)
y

y <- calcNormFactors(y,method = "TMM")

group <- factor(paste0(df$Status))

design2 <- model.matrix(~ 0 + group)

contrast.matrix <- makeContrasts(F2-F3, levels=design2)
contrast.matrix

Contrasts
Levels F2 - F3
F2 1
F3 -1
F4 0
NE 0

v <- voom(y, design2, plot=TRUE)

fit <- lmFit(v, design2)

fitC <- contrasts.fit(fit, contrast.matrix)
fitC <- eBayes(fitC)
dim(fitC)
tab <- topTable(fitC,adjust="BH",n=Inf)

head(tab)
logFC AveExpr t P.Value adj.P.Val B
ENSG00000129749.3 0.4809976 -2.100567 5.158856 4.079325e-07 0.01019219 4.776599
ENSG00000121274.11 -0.3031990 3.152421 -4.356722 1.716605e-05 0.10013460 2.675503
ENSG00000185973.9 -0.2884105 2.409274 -4.247311 2.746711e-05 0.10013460 2.212433
ENSG00000176170.12 1.0325463 1.930914 4.233827 2.908560e-05 0.10013460 2.157469
ENSG00000132744.6 -0.8300652 3.152324 -4.218580 3.102526e-05 0.10013460 2.146414
ENSG00000172943.17 -0.3884991 4.481074 -4.205040 3.285071e-05 0.10013460 2.125056

summary(decideTests(fitC))

F2 - F3
-1 0
0 24984
1 1

So with this I found only one differentially expressed gene (based on adj.P.Val < 0.05)

################################################################################################

I also tried using edgeR with the same data. I used the same code mentioned in edgeR tutorial for differential expression with featurecounts data.

library(edgeR)

group <- factor(paste0(df$Status))
y <- DGEList(U, group=group)

keep <- rowSums(cpm(y) > 0.5) >= 2

table(keep)
summary(keep)

Mode FALSE TRUE NA's

logical 35503 24985 0

y <- y[keep, , keep.lib.sizes=FALSE]
y <- calcNormFactors(y, method = "TMM")

design2 <- model.matrix(~ 0 + group)
colnames(design2) <- c("F2","F3","F4","NE")
design2

y <- estimateDisp(y, design2, robust=TRUE)

y$common.dispersion

[1] 1.030704

fit <- glmQLFit(y, design2, robust=TRUE)
head(fit$coefficients)

F2 F3 F4 NE
ENSG00000000003.13 -9.614857 -9.514155 -9.219875 -9.388728
ENSG00000000005.5 -17.008674 -16.908788 -16.166986 -16.630297
ENSG00000000419.11 -10.940426 -10.914407 -10.898501 -10.804819
ENSG00000000457.12 -11.582309 -11.493746 -11.423292 -12.203859
ENSG00000000460.15 -12.362215 -12.205290 -11.947569 -12.040664
ENSG00000000938.11 -12.514476 -12.640490 -13.160899 -13.074225

contrast.matrix <- makeContrasts(F2-F3, levels=design2)
contrast.matrix

Contrasts
Levels F2 - F3
F2 1
F3 -1
F4 0
NE 0

qlf <- glmQLFTest(fit, contrast=contrast.matrix)

topTags(qlf)

Coefficient: 1*F2 -1*F3

logFC logCPM F PValue FDR
ENSG00000124232.9 7.351170 0.86857907 278.9784 6.083121e-47 1.519868e-42
ENSG00000102195.8 7.077038 -0.11326609 236.7369 1.532441e-41 1.914402e-37
ENSG00000165066.12 6.034396 -1.42387186 224.2053 7.269869e-40 6.054590e-36
ENSG00000138161.11 3.629804 0.03880203 210.2262 5.937722e-38 3.242534e-34
ENSG00000164434.10 5.533205 -0.92141498 209.9531 6.488962e-38 3.242534e-34
ENSG00000256463.7 5.278677 -2.52333685 201.0082 1.152294e-36 4.798346e-33
ENSG00000233887.1 4.054302 -4.49118571 200.4813 3.120264e-36 1.113711e-32
ENSG00000196431.3 4.767828 -1.79054718 192.6636 1.759004e-35 5.173259e-32
ENSG00000280109.1 5.872673 0.25250281 192.4883 1.863491e-35 5.173259e-32
ENSG00000180053.7 3.349894 -4.85962856 193.4720 2.948655e-35 7.367214e-32

summary(decideTestsDGE(qlf))
[,1]
-1 949
0 22517
1 1519

tr <- glmTreat(fit, contrast=contrast.matrix, lfc=log2(1.2))

tab2 <- topTags(tr,n=Inf)
keep <- tab2$table$FDR <= 0.05
summary(decideTestsDGE(tr))

[,1]
-1 537
0 23545
1 903

So from this 1440 (537 downregulated and 903 upregulated) genes are detected as Differentially expressed genes with fold-change significantly above 1.2 at an FDR cut-off of 5%.

I would like to know why there is lot of difference between these two ways for differential analysis?

In the first way (limma,Glimma,edgeR) I found only one differentially expressed gene. In this process data will be transformed into Voom. Is it because of that? From few days back I'm trying in this way but couldn't find more than 1 differentially expressed gene.

But now I used edgeR which is mentioned in tutorial from feature counts. I used above code and got around 1440 DE genes.

Can anyone clear my doubt? Really appreciate your kind help.

Thank you

Venkatesh

limma edgeR differential gene expression r • 2.3k views

ADD COMMENT • link 6.7 years ago Biologist ▴ 110

1

Entering edit mode

I'd say this is definitely an unexpected result. limma and edgeR generally have pretty good agreement on which genes are significant, and even when they don't, they should still give similar logFC and logCPM/AveExpr values. The fact that edgeR is reporting logFC values as high as 7 and limma is only reporting 0.48 makes me think that something is wrong in one or the other. I don't see any obvious mistakes in your code, so I'm not sure where the discrepancy has occurred, but I would advise you to run through your limma and edgeR code and check each step to make sure it matches your expectations. Perhaps The log transform was somehow run twice on the limma data, or something like that.

ADD REPLY • link 6.7 years ago Ryan C. Thompson ★ 7.9k

score 0 · Answer 1 · 2017-08-22

0

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 3 hours ago

The city by the bay

Echoing what Ryan has said, there is almost certainly a discrepancy somewhere in your underlying inputs to the two analyses. My guess would be that the design matrix is not the same between the two analyses. For example, your contrasts.fit output doesn't make sense; there's no F2 or F3 in your levels, only EG2 and EG3. I would suggest you clear your R session and re-run both analyses from scratch.

ADD COMMENT • link 6.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

Sorry its not EG2 and EG3....Its F2 and F3. I edited just now. The results I got is correct.

ADD REPLY • link 6.7 years ago Biologist ▴ 110

0

Entering edit mode

Frankly, I find this hard to believe. Can you make a minimum working example that can be shared?

ADD REPLY • link 6.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

Dear Aaron,

I have done the same two way analysis with some other data.

With the first way (limma, Glimma, edgeR) there are no differential expressed genes. In this way the differential expressed genes with topTable function are based on adj.P.Val <= 0.05. Am I right?

With the second way (edgeR) there are around 1021 differential expressed genes.

ADD REPLY • link 6.7 years ago Biologist ▴ 110

0

Entering edit mode

Do you have an example that I can run on my own computer?

ADD REPLY • link 6.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

Please try with this data.

	S3	S1	S4	S2	S6	S7	S5
ENSG00000000003.13	5578	4300	4587	12247	3296	2532	6436
ENSG00000000005.5	0	3	5	1	0	1	0
ENSG00000000419.11	1681	669	1187	2149	600	1610	1391
ENSG00000000457.12	958	1579	394	1895	698	618	458
ENSG00000000460.15	480	363	215	188	162	237	253
ENSG00000000938.11	267	157	179	274	69	286	55
ENSG00000000971.14	148451	38530	38052	2165	20173	11477	3080
ENSG00000001036.12	5978	2427	2451	4097	3319	4301	10070
ENSG00000001084.9	5884	12598	2608	1426	1380	2127	8224

Sample.ID	Grade
S1	3
S2	2
S3	3
S4	2
S5	3
S6	3
S7	3

Please try with both the ways. In the above mentioned comments I said I found differential expressed genes with edgeR, but those genes are unexpected one. All the expected genes are not found in the results.

Please run with this data. Thank you

ADD REPLY • link 6.7 years ago Biologist ▴ 110

0

Entering edit mode

Let's load in the counts in counts.txt, and the annotation table in meta.txt:

U <- read.table("counts.txt", header=TRUE, row.names=1)
annot <- read.table("meta.txt", header=TRUE)
annot <- annot[match(colnames(U), annot$Sample.ID),]
design <- model.matrix(~factor(annot$Grade))
options(digits=3) # to fit onto the page

Using edgeR (skipping the filtering because there's not a lot of genes):

library(edgeR)
y <- DGEList(U)
y <- calcNormFactors(y)
y <- estimateDisp(y, design)
efit <- glmQLFit(y, design, robust=TRUE)
eres <- glmQLFTest(efit)
topTags(eres)

... gives me:

Coefficient:  factor(annot$Grade)3
                    logFC logCPM      F PValue   FDR
ENSG00000001084.9   1.784  16.92 3.5615 0.0656 0.590
ENSG00000000460.15  0.672  12.63 1.6239 0.2091 0.671
ENSG00000001036.12  0.926  16.79 1.1237 0.2948 0.671
ENSG00000000005.5  -1.661   6.32 1.1062 0.2985 0.671
ENSG00000000938.11 -0.406  11.97 0.6230 0.4341 0.671
ENSG00000000003.13 -0.643  16.98 0.5879 0.4472 0.671
ENSG00000000419.11 -0.345  14.86 0.3339 0.5662 0.728
ENSG00000000971.14  0.857  19.53 0.2095 0.6494 0.731
ENSG00000000457.12 -0.107  14.42 0.0349 0.8527 0.853

Using limma and voom, recycling y from the edgeR analysis above:

v <- voom(y, design)
vfit <- lmFit(v, design)
vfit <- eBayes(vfit)
topTable(vfit)[,1:5] # to fit on page

... gives me:

Removing intercept from test coefficients
                     logFC AveExpr       t P.Value adj.P.Val
ENSG00000000460.15  0.6459    12.5  1.2614   0.214     0.641
ENSG00000001084.9   1.0711    16.3  1.3655   0.179     0.641
ENSG00000000005.5  -1.4523     4.9 -1.2735   0.209     0.641
ENSG00000000003.13 -0.8300    16.8 -0.9151   0.365     0.657
ENSG00000000971.14  0.9345    18.6  0.6736   0.504     0.659
ENSG00000000938.11 -0.4965    11.8 -0.9596   0.342     0.657
ENSG00000001036.12  0.6233    16.6  0.6434   0.523     0.659
ENSG00000000419.11 -0.3520    14.8 -0.5491   0.586     0.659
ENSG00000000457.12  0.0125    14.2  0.0222   0.982     0.982

So the two are pretty consistent on my computer (edgeR 3.18.1); no genes are detected as DE at a FDR of 5% in either case, and the log-fold changes are generally pretty similar.

ADD REPLY • link 6.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

In limma and voom.....the "y" you used is after Normalization or estimateDip?

ADD REPLY • link 6.7 years ago Biologist ▴ 110

0

Entering edit mode

After estimateDisp, but it shouldn't matter. voom will ignore the fields that estimateDisp adds.

ADD REPLY • link 6.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

Ok

And how should I give "design matrix" if it looks like following? I'm every time confused in creating design matrix.

Sample.ID	Grade
S1	3
S2	2
S3	3
S4	2
S5	4
S6	3
S7	NE

ADD REPLY • link 6.7 years ago Biologist ▴ 110

0

Entering edit mode

It is still a one-way layout, using ~Grade or ~0+Grade.

ADD REPLY • link 6.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

Dear Aaron,

Thanks a lot for helping me out. I found the bug in my code. "counts" and "annot" samples were same but they are not in the same order. I fixed that now.

I found differential expressed genes in both the ways (edgeR and limma-voom)

Thanks again !!

ADD REPLY • link 6.7 years ago Biologist ▴ 110