the design matrix in edgeR users Guide
4
0
Entering edit mode
yifan.mo • 0
@yifanmo-9773
Last seen 8.2 years ago

I am reading the latest version of edgeR user guide (https://www.bioconductor.org/packages/3.3/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf)

My question is about the the design matrix in 3.2.5. Here it said "Now the first coefficient will measure the baseline logCPM expression level in the first treatment condition (here group A)".

However, I think the intercept is not for the treatment A if you set all of the elements to 1.  In ANOVA treatment effect model, if you use the intercept item it is for the overall mean not for a specific treatment. And also the design matrix is not like that in 3.2.5. 

From my understanding, the design matrix here is just the linear regression model while using the dummy variable. So the coef of groupB is not for the comparison of B and the baseline.  

Did I misunderstand anything. Please comments.

edgeR • 4.6k views
ADD COMMENT
0
Entering edit mode

The intercept item should not bet he average of one treatment. It should be the overall mean of all the treatment. There are two traditional model in ANOVA, one is called cell mean model and the other is treatment effect model.....

ADD REPLY
2
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 3 hours ago
The city by the bay

There's nothing wrong with the user's guide. The intercept represents the average log-expression in group A, as it is the only non-zero term in the linear system for samples in group A. groupB then represents the log-fold change of B over A, such that the sum of the intercept and groupB gives the average log-expression in group B. The same applies for groupC and the lone sample in group C.

I don't know what you're referring to with an "ANOVA treatment effect model", but if you have a different design matrix, then the interpretation of the coefficients will obviously be different.

ADD COMMENT
2
Entering edit mode
@gordon-smyth
Last seen 23 minutes ago
WEHI, Melbourne, Australia

You have misunderstood how linear models are parametrized in R. If you try fitting a linear model yourself in R, using lm(y~group) say, you will see that the claims you are making are incorrect.

Here is a little example:

> y <- c(1,2,3,5,6,7)
> group <- factor(c(1,1,1,2,2,2))
> fit <- lm(y~group)
> coefficients(summary(fit))
            Estimate Std. Error t value Pr(>|t|)
(Intercept)        2      0.577    3.46  0.02572
group2             4      0.816    4.90  0.00805

You see that the intercept is the mean of the first group (mean of 1,2,3) and the second coefficient is the difference between the means of the two groups.

We can change the contrast parametrization to get the classical parametrization where the intercept is the grand mean:

> contrasts(group) <- contr.sum(levels(group))
> fit <- lm(y~group)
> coefficients(summary(fit))
            Estimate Std. Error t value Pr(>|t|)
(Intercept)        4      0.408     9.8 0.000608
group1            -2      0.408    -4.9 0.008050

On the other hand, we can go back to the default parametrization in R for which the intercept is the mean of the first group:

> contrasts(group) <- contr.treatment(levels(group))
> fit <- lm(y~group)
> coefficients(summary(fit))
            Estimate Std. Error t value Pr(>|t|)
(Intercept)        2      0.577    3.46  0.02572
group2             4      0.816    4.90  0.00805

To get the so-called "cell means" parameterization, we would remove the intercept:

> fit <- lm(y~0+group)
> coefficients(summary(fit))
       Estimate Std. Error t value Pr(>|t|)
group1        2      0.577    3.46 0.025721
group2        6      0.577   10.39 0.000484

There are other possible parametrizations available in R: contr.sum and contr.treatment are not the only possibilities. Apart from the cell means model, the parametrizations are all equivalent in that they give the same anova table.

It is not clear what you mean by a "treatment effect model" because that is not a standard term in statistics. In R, the "treatment" contrast parametrization is the default whereby the intercept is the fitted value for the first group and the other coefficients are relative to the first group. In Stata software, a "treatment effects model" is a special purpose model for binary data, see:

  http://www.stata.com/products/stb/journals/stb55.pdf

Obviously that doesn't have a lot of relevance to oneway anova.

ADD COMMENT
0
Entering edit mode
yifan.mo • 0
@yifanmo-9773
Last seen 8.2 years ago

Thank you very much Prof.Smyth! I am sorry a mistake I made here. The term should be "factor effects model" rather than "treatment effects model". In factor effects model the intercept is for the grand mean and the other covariates are for the factor effects compared to the overall mean. Here in R, obviously is different definition. 

And another question is :

How to add a continuous "batch effect" in edgeR, such as "age". If I just include "age" in the design matrix, considering we have 3 levels of the treatment,  the intercept should no longer be the ref treatment (treatment1) mean. Thus, the coef of treatment2 and treatment3 are no longer their comparisons to the treatment1. Then how can I add this "batch effect" in edgeR? 

Thank you very much!

ADD COMMENT
0
Entering edit mode

OK, I see. Except for cell means, all the R models are "factor effects models". The different parametrizations simply correspond to different constraints applied to the factor effects.

Regarding the batch effect, you can in fact simply add "age" as a column of the design matrix. The coefficients for treatment2 and treatment3 will still estimate the logFC of between those treatments and treatment1. This is analogous to analysis of covariance.

ADD REPLY
0
Entering edit mode
yifan.mo • 0
@yifanmo-9773
Last seen 8.2 years ago

Thanks for instruction. One more comment: If we refer to it as ANCOVA, I think we need to subtract the age with mean(age in treatment1) otherwise the intercept item has different meaning. Recall if we refer intercept to overall mean, when implement the ANCOVA ,we need minus the continuous covariate with its overall mean of all the samples...

ADD COMMENT
0
Entering edit mode

The treatment2 or treatment3 coefficients will remain the same whether or not the 'age' covariate is mean corrected.

What you say regarding the intercept is correct, but it doesn't make any difference in practice to the analysis because one should not be testing hypotheses about the intercept anyway.

ADD REPLY

Login before adding your answer.

Traffic: 676 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6