Hello,
I have a question about intercept and LogFC in edgeR. I am exactly following the article. From reads to genes to pathways...the edgeR quasi-likelihood pipeline. I also read the article A guide to creating design matrices for gene expression experiments but I am still confused about how intercept is associated with LogFC values and other statistics.
## model without intercept
y0 <- DGEList(GenewiseCounts[,-1], group=group,
genes=GenewiseCounts[,1,drop=FALSE]
)
y0$genes$Symbol <- mapIds(org.Mm.eg.db, rownames(y0),
keytype="ENTREZID", column="SYMBOL")
y0 <- y0[!is.na(y$genes$Symbol), ]
keep <- rowSums(cpm(y0) > 0.5) >= 2
y0 <- y0[keep, , keep.lib.sizes=FALSE]
y0 <- calcNormFactors(y0)
design_0 <- model.matrix(~0+group)
colnames(design_0) <- levels(group)
design_0
y0 <- estimateDisp(y0, design_0, robust=TRUE)
fit0 <- glmQLFit(y0, design_0, robust=TRUE)
B.LvsP0 <- makeContrasts(B.lactating-B.pregnant, levels=design_0)
res0 <- glmQLFTest(fit0, contrast=B.LvsP0)
## model with intercept
y1 <- DGEList(GenewiseCounts[,-1], group=group,
genes=GenewiseCounts[,1,drop=FALSE]
)
options(digits = 3)
y1$genes$Symbol <- mapIds(org.Mm.eg.db, rownames(y1),
keytype="ENTREZID", column="SYMBOL")
y1 <- y1[!is.na(y1$genes$Symbol), ]
keep <- rowSums(cpm(y1) > 0.5) >= 2
y1 <- y1[keep, , keep.lib.sizes=FALSE]
y1 <- calcNormFactors(y1)
design1 <- model.matrix(~ + group)
colnames(design1) <- levels(group)
y1 <- estimateDisp(y1, design1, robust=TRUE)
fit1 <- glmQLFit(y1, design1, robust=TRUE)
B.LvsP1 <- makeContrasts(B.lactating-B.pregnant, levels=design1)
res1 <- glmQLFTest(fit1, contrast=B.LvsP1)
My design matrix without intercept
> design_0
B.lactating B.pregnant B.virgin L.lactating L.pregnant L.virgin
1 0 0 1 0 0 0
2 0 0 1 0 0 0
3 0 1 0 0 0 0
4 0 1 0 0 0 0
5 1 0 0 0 0 0
6 1 0 0 0 0 0
7 0 0 0 0 0 1
8 0 0 0 0 0 1
9 0 0 0 0 1 0
10 0 0 0 0 1 0
11 0 0 0 1 0 0
12 0 0 0 1 0 0
My design matrix with intercept
> design1
B.lactating B.pregnant B.virgin L.lactating L.pregnant L.virgin
1 1 0 1 0 0 0
2 1 0 1 0 0 0
3 1 1 0 0 0 0
4 1 1 0 0 0 0
5 1 0 0 0 0 0
6 1 0 0 0 0 0
7 1 0 0 0 0 1
8 1 0 0 0 0 1
9 1 0 0 0 1 0
10 1 0 0 0 1 0
11 1 0 0 1 0 0
12 1 0 0 1 0 0
When I print out top DE genes using topTags, they are different.
## model without intercept
> topTags(res0)
Coefficient: 1*B.lactating -1*B.pregnant
Length Symbol logFC logCPM F PValue FDR
12992 765 Csn1s2b 6.09 10.18 421 4.86e-11 7.68e-07
211577 2006 Mrgprf 5.15 2.74 345 1.30e-10 8.05e-07
226101 7094 Myof 2.32 6.44 322 1.97e-10 8.05e-07
381290 8292 Atp2b4 2.14 6.14 320 2.04e-10 8.05e-07
140474 11281 Muc4 -7.17 6.05 308 2.64e-10 8.34e-07
231830 3346 Micall2 -2.25 5.18 282 4.47e-10 1.18e-06
24117 2242 Wif1 -1.82 6.76 260 7.30e-10 1.65e-06
12740 1812 Cldn4 -5.32 9.87 298 8.88e-10 1.71e-06
21953 667 Tnni2 5.75 3.86 313 9.76e-10 1.71e-06
231991 2873 Creb5 2.57 4.87 241 1.16e-09 1.83e-06
## model with intercept
> topTags(res1)
Coefficient: 1*B.lactating -1*B.pregnant
Length Symbol logFC logCPM F PValue FDR
328451 3224 Gm5088 -15.9 6.11 5179 5.47e-18 1.2e-15
107951 3389 Cdk9 -15.4 5.83 5048 6.43e-18 1.2e-15
433806 462 Chchd2-ps -14.6 5.87 4971 7.09e-18 1.2e-15
18570 1524 Pdcd6 -15.0 5.95 4900 7.77e-18 1.2e-15
67397 1243 Erp29 -14.4 6.26 4859 8.20e-18 1.2e-15
66046 999 Ndufb5 -15.1 6.14 4799 8.86e-18 1.2e-15
72047 4012 Ddx42 -14.3 6.19 4796 8.90e-18 1.2e-15
18141 4749 Nup50 -15.4 6.00 4790 8.97e-18 1.2e-15
66286 770 Sec11c -16.3 5.96 4730 9.72e-18 1.2e-15
21376 1565 Tbrg1 -14.2 6.08 4711 9.97e-18 1.2e-15
My understanding of the model with intercept is, in this case, the expression of B.lactating is the intercept term and each estimated coefficient is differences between the reference level (B.lactating) and each level (e.g. B.pregnant ... L.virgin). Whereas, the model without intercept is each estimated coefficient is the mean expression of each level.
I am not sure but I think LogFC seems to be log2(B.lactating / B.pregnant) in the model without intercept. But I can't even guess how how LogFC is calculated in the model with intercept. So could anyone explain me why the two models print out different top DE genes in terms of their statistics? (e.g. FDR, logFC...)
Sorry about my English. Have a great day!
The link you've entered for "A guide to creating design matrices for gene expression experiments" doesn't work. Could you please correct the URL.
Oops. I just corrected the URL. Thank you Dr. Smyth.