Question

LogFC values in edgeR - w/wo intercept

1

Entering edit mode

JK Kim ▴ 20

@jk-kim-14168

Last seen 12 months ago

United States

Hello,

I have a question about intercept and LogFC in edgeR. I am exactly following the article. From reads to genes to pathways...the edgeR quasi-likelihood pipeline. I also read the article A guide to creating design matrices for gene expression experiments but I am still confused about how intercept is associated with LogFC values and other statistics.


## model without intercept
y0 <- DGEList(GenewiseCounts[,-1], group=group,
               genes=GenewiseCounts[,1,drop=FALSE]
             )

y0$genes$Symbol <- mapIds(org.Mm.eg.db, rownames(y0),
                             keytype="ENTREZID", column="SYMBOL")

y0 <- y0[!is.na(y$genes$Symbol), ]
keep <- rowSums(cpm(y0) > 0.5) >= 2
y0 <- y0[keep, , keep.lib.sizes=FALSE]
y0 <- calcNormFactors(y0)

design_0 <- model.matrix(~0+group)
colnames(design_0) <- levels(group)
design_0
y0 <- estimateDisp(y0, design_0, robust=TRUE)
fit0 <- glmQLFit(y0, design_0, robust=TRUE)
B.LvsP0 <- makeContrasts(B.lactating-B.pregnant, levels=design_0)
res0 <- glmQLFTest(fit0, contrast=B.LvsP0)



## model with intercept
y1 <- DGEList(GenewiseCounts[,-1], group=group,
               genes=GenewiseCounts[,1,drop=FALSE]
             )
options(digits = 3)

y1$genes$Symbol <- mapIds(org.Mm.eg.db, rownames(y1),
                             keytype="ENTREZID", column="SYMBOL")
y1 <- y1[!is.na(y1$genes$Symbol), ]
keep <- rowSums(cpm(y1) > 0.5) >= 2

y1 <- y1[keep, , keep.lib.sizes=FALSE]
y1 <- calcNormFactors(y1)

design1 <- model.matrix(~ + group)
colnames(design1) <- levels(group)

y1 <- estimateDisp(y1, design1, robust=TRUE)
fit1 <- glmQLFit(y1, design1, robust=TRUE)

B.LvsP1 <- makeContrasts(B.lactating-B.pregnant, levels=design1)
res1 <- glmQLFTest(fit1, contrast=B.LvsP1)

My design matrix without intercept

> design_0
   B.lactating B.pregnant B.virgin L.lactating L.pregnant L.virgin
1            0          0        1           0          0        0
2            0          0        1           0          0        0
3            0          1        0           0          0        0
4            0          1        0           0          0        0
5            1          0        0           0          0        0
6            1          0        0           0          0        0
7            0          0        0           0          0        1
8            0          0        0           0          0        1
9            0          0        0           0          1        0
10           0          0        0           0          1        0
11           0          0        0           1          0        0
12           0          0        0           1          0        0

My design matrix with intercept

> design1
   B.lactating B.pregnant B.virgin L.lactating L.pregnant L.virgin
1            1          0        1           0          0        0
2            1          0        1           0          0        0
3            1          1        0           0          0        0
4            1          1        0           0          0        0
5            1          0        0           0          0        0
6            1          0        0           0          0        0
7            1          0        0           0          0        1
8            1          0        0           0          0        1
9            1          0        0           0          1        0
10           1          0        0           0          1        0
11           1          0        0           1          0        0
12           1          0        0           1          0        0

When I print out top DE genes using topTags, they are different.

## model without intercept
> topTags(res0)
Coefficient:  1*B.lactating -1*B.pregnant 
       Length  Symbol logFC logCPM   F   PValue      FDR
12992     765 Csn1s2b  6.09  10.18 421 4.86e-11 7.68e-07
211577   2006  Mrgprf  5.15   2.74 345 1.30e-10 8.05e-07
226101   7094    Myof  2.32   6.44 322 1.97e-10 8.05e-07
381290   8292  Atp2b4  2.14   6.14 320 2.04e-10 8.05e-07
140474  11281    Muc4 -7.17   6.05 308 2.64e-10 8.34e-07
231830   3346 Micall2 -2.25   5.18 282 4.47e-10 1.18e-06
24117    2242    Wif1 -1.82   6.76 260 7.30e-10 1.65e-06
12740    1812   Cldn4 -5.32   9.87 298 8.88e-10 1.71e-06
21953     667   Tnni2  5.75   3.86 313 9.76e-10 1.71e-06
231991   2873   Creb5  2.57   4.87 241 1.16e-09 1.83e-06

## model with intercept
> topTags(res1)
Coefficient:  1*B.lactating -1*B.pregnant 
       Length    Symbol logFC logCPM    F   PValue     FDR
328451   3224    Gm5088 -15.9   6.11 5179 5.47e-18 1.2e-15
107951   3389      Cdk9 -15.4   5.83 5048 6.43e-18 1.2e-15
433806    462 Chchd2-ps -14.6   5.87 4971 7.09e-18 1.2e-15
18570    1524     Pdcd6 -15.0   5.95 4900 7.77e-18 1.2e-15
67397    1243     Erp29 -14.4   6.26 4859 8.20e-18 1.2e-15
66046     999    Ndufb5 -15.1   6.14 4799 8.86e-18 1.2e-15
72047    4012     Ddx42 -14.3   6.19 4796 8.90e-18 1.2e-15
18141    4749     Nup50 -15.4   6.00 4790 8.97e-18 1.2e-15
66286     770    Sec11c -16.3   5.96 4730 9.72e-18 1.2e-15
21376    1565     Tbrg1 -14.2   6.08 4711 9.97e-18 1.2e-15

My understanding of the model with intercept is, in this case, the expression of B.lactating is the intercept term and each estimated coefficient is differences between the reference level (B.lactating) and each level (e.g. B.pregnant ... L.virgin). Whereas, the model without intercept is each estimated coefficient is the mean expression of each level.

I am not sure but I think LogFC seems to be log2(B.lactating / B.pregnant) in the model without intercept. But I can't even guess how how LogFC is calculated in the model with intercept. So could anyone explain me why the two models print out different top DE genes in terms of their statistics? (e.g. FDR, logFC...)

Sorry about my English. Have a great day!

design.matrix logFC intercept edgeR model.matrix • 2.0k views

ADD COMMENT • link 2.2 years ago JK Kim ▴ 20

0

Entering edit mode

The link you've entered for "A guide to creating design matrices for gene expression experiments" doesn't work. Could you please correct the URL.

ADD REPLY • link 2.2 years ago Gordon Smyth 51k

0

Entering edit mode

Oops. I just corrected the URL. Thank you Dr. Smyth.

ADD REPLY • link 2.2 years ago JK Kim ▴ 20

score 3 · Accepted Answer · 2022-05-25

3

Entering edit mode

Gordon Smyth 51k

@gordon-smyth

Last seen 25 minutes ago

WEHI, Melbourne, Australia

edgeR gives identical results whether you use an intercept or not. The two approaches between the intercept model and the no-intercept (group mean) models are explained in the edgeR and limma User's Guides.There is also a long tutorial here https://bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/designmatrices.html, although you say you have already read that.

The basic idea is you use the group-means parametrization and then form a contrast, or you use the intercept model and don't need to form a contrast. You can't just plug in a different design matrix, keep all the other code the same, and expect to get the same results. Obviously you have to use the code appropriate to the design matrix you use. Amongst other things, it makes no sense to create two different design matrices but with the same column names. That could never be right.

ADD COMMENT • link 2.2 years ago Gordon Smyth 51k

0

Entering edit mode

Thank you, Dr. Smyth. So it doesn't make sense to form a contrast for the intercept model, does it?

ADD REPLY • link 2.2 years ago JK Kim ▴ 20

1

Entering edit mode

You can still form contrasts in some circumstances, but not the contrast you used. In your case there is no need for a contrast. For the intercept model, you just use a coefficient instead of a contrast:

res1 <- glmQLFTest(fit1, coef=2)

This is explained in some length in the design matrices guide. The creation of design matrices from factors is basic R, not unique to edgeR or limma.