Question

single continuous factor

0

Entering edit mode

ashley.lu • 0

@ashleylu-15735

Last seen 6.6 years ago

Dear all,

I have a question on how to used design matrix in edgeR to perform differential expression analysis on continuous factor.

Here is the description of my data. I have one continuous factor measuring the level of protein A by immunostaining.

And the question i would like to ask is :

Is there any genes differentially expressed (highly affected by protein A) according to the level of protein A measured by immunostaining .

Here is what the data looks like,

> protein=c(rnorm(n = 10,mean = 10,1),rnorm(n = 10,mean = 5,1.5))
> design1=model.matrix(~protein)
> head(design1)
  (Intercept)   protein
1           1 11.165201
2           1  9.504538
3           1 10.516862
4           1 11.914443
5           1 10.842974
6           1 10.311306
> design2=model.matrix(~0+protein)
> head(design2)
    protein
1 11.165201
2  9.504538
3 10.516862
4 11.914443
5 10.842974
6 10.311306

I would like understand the differences and underlying assumptions between these two design matrix.

Does design2 assumes that when the expression of protein A is 0, the gene expression level is also zero?

Where as design1 gives more correct assumption that for each gene, there will be one estimated expression level at zero protein A?

In addition, i will perform glmQLFTest(fit_design1a, coef=2) to conduct the differential expression analysis. But i am not sure how to interpret the logFC calculated here. Since the factor here is continuous, do we still interpret it as logFC ?

For example, a gene Matal1 that is significantly with logFC 0.40409621 , do I interpret it as the expression level of Mata1 increases in 0.4 log fold changes for every unit increase of protein A?

edger design matrix continuous • 2.1k views

ADD COMMENT • link updated 3.2 years ago by Tongjun • 0 • written 6.9 years ago by ashley.lu • 0

score 2 · Accepted Answer · 2018-05-15

2

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 4 hours ago

The city by the bay

Does design2 assumes that when the expression of protein A is 0, the gene expression level is also zero?

Yes, design2 assumes that when your staining intensity for protein A, the log-average count is also zero, i.e., the expected count is 1.

Where as design1 gives more correct assumption that for each gene, there will be one estimated expression level at zero protein A?

Yes, this is handled by the intercept, which accommodates some non-zero log-average count at zero staining.

Since the factor here is continuous, do we still interpret it as logFC ?

Yes, it's the log-fold change in expression for every unit of increase in protein A staining.

ADD COMMENT • link 6.9 years ago Aaron Lun ★ 28k

0

Entering edit mode

I have a similar question about the logFC for continuous variable. Is logFC the log2(coefficient of the variable)? If the logFC is negative, does it mean the gene expression is decreased with increase of a unit of the variable? How to get the original coefficient? My goal is to get the negative/positive association between gene expression and the variable. Thanks!

ADD REPLY • link 3.2 years ago Tongjun • 0

0

Entering edit mode

The logFC is the original coefficient. It's just a regular regression coefficient and negative values do mean inverse relationships. So it is what you want, the negative/positive association between the (log) gene expression and the variable.

ADD REPLY • link 3.2 years ago Gordon Smyth 52k

0

Entering edit mode

Thank you so much! logFC is the original coefficient! If I get the coefficient at the original expression scale (not the log(gene expression)), 2^logfc would be. Am I right?

ADD REPLY • link 3.2 years ago Tongjun • 0

0

Entering edit mode

Exponentiating the regression coefficient by 2^logFC gives you the fold-change in expression instead of the log2-fold-change in expression corresponding to a unit change in your continuous covariate, see:

https://stats.stackexchange.com/questions/487563/term-for-expbeta-from-a-gamma-glm/507795#507795

If you convert to 2^logFC then obviously there will no longer be any negative values and the direction of change may become harder to interpret.

ADD REPLY • link 3.2 years ago Gordon Smyth 52k

0

Entering edit mode

I see. Thank you for your explanation in detail!

ADD REPLY • link 3.2 years ago Tongjun • 0

score 1 · Accepted Answer · 2018-05-16

1

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 3 hours ago

WEHI, Melbourne, Australia

Ashley,

design1 is the standard design matrix for simple linear regression, so it is the one to use. design2 is the design matrix for "regression through the origin", which (as you have guessed) is not likely to be what you want.

ADD COMMENT • link 6.9 years ago Gordon Smyth 52k