Question

Limma when to use zero intercept

0

Entering edit mode

robert.chen • 0

@robertchen-14677

Last seen 4.1 years ago

Hi everyone -

I was hoping to get your help understanding when it is appropriate to use a 0-intercept model. I have log2-transformed, quantile-normalized fluorescence values off of a series of microarrays. I would like to determine the effects of treatment X vs treatment Y on each gene measured in a patient at compared to baseline. I have 50 patients undergoing treatment X, and a different set of 50 patients undergoing treatment Y (n=50 in each arm, n=100 in the experiment). I have measurements for each patient at baseline, and two subsequent timepoints (I will call these timepoints 1 [baseline], 2, and 3), with timepoint 3 being the end of the study.

My mapping data looks like the following:

                PID Treatment Timepoint Treat_Time
P01C5553301 P01C555         Y         1        Y_1
P01C5553303 P01C555         Y         3        Y_3
P01C6663301 P01C666         X         1        X_1
P01C6663303 P01C666         X         3        X_3
P01C7773301 P01C777         X         1        X_1
P01C7773303 P01C777         X         3        X_3

To set up my linear model design, I read through several answers on Bioconductor and came up with the following:

design1 <- model.matrix(~ 0 + Treat_Time + PID, 
                        data = map.samples1)
colnames(design1)[1:4] <- levels(Treat_Time)
fit1 <- lmFit(eset1, design1)
cont.dif <- makeContrasts(
  DeltaDelta = (X_2 - X_1) - (Y_2 - Y_1), 
  levels = design1)

My goal was to eventually develop a contrast matrix that would look at the difference between the two treatment groups, accounting for the random effects of patient. I have two questions:

First question: Does using a 0-intercept model make sense here? There are no probes that have 0 as the mean value, even in the absence of any covariates or main effects. Thus, to me it doesn't make sense to use a 0-intercept. However, so many of the answers that I have seen on Bioconductor regarding time-series or interaction terms include the 0 intercept, so I was starting to have doubts. In addition, if I don't use a 0-intercept, X_1 does not make it into the design matrix, i.e.

 design1 <- model.matrix(~ 0 + Treat_Time + PID, 
                            data = map.samples1)
head(colnames(design1))
"Treat_TimeX_1" "Treat_TimeX_2" "Treat_TimeY_1" "Treat_TimeY_2" "PIDP01C666"    "PIDP01C777"   

design2 <- model.matrix(~ + Treat_Time + PID)
head(colnames(design2))
"(Intercept)"   "Treat_TimeX_2" "Treat_TimeY_1" "Treat_TimeY_2" "PIDP01C666"    "PIDP01C777"

Therefore, I wouldn't be able to make the interaction contrast if I didn't have the 0 intercept. The more general question is: when is it appropriate to utilize a 0-intercept model? Does one need to center-scale/standardize the expression data first?

Second question: Am I modeling the interaction between treatment and time appropriately, while accounting for random effects? I have seen two methods of modeling the random effects of PID. The first method adds PID to the model as a covariate ( https://support.bioconductor.org/p/125149 ) while the second method uses duplicateCorrelation() ( https://support.bioconductor.org/p/70070 ). When is each method appropriate?

Thank you so much for all of your help. If there are any references that you feel I should read based on where you think my understanding is at with linear models and utilizing limma in general, please let me know!

R

limma regression differential expression linear model • 6.4k views

ADD COMMENT • link updated 5.2 years ago by Gordon Smyth 52k • written 5.2 years ago by robert.chen • 0

score 4 · Accepted Answer · 2020-02-23

4

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 2 hours ago

WEHI, Melbourne, Australia

I appreciate the effort you've made here, but you are jumping to a few incorrect conclusions.

The choice of whether or not to include an intercept is the model is purely a matter of mathematical convenient in how you setup your analysis. You will get the same DE results in the end regardless of whether you include the intercept or not. The choice has nothing to do with whether any probes have zero log-expression or whether the data is standardized (it shouldn't be). You can easily form interaction effects whether you estimate an intercept or not.

Your experiment is a paired comparison (Time 3 vs Time 1) with two treatments. My personal preference for a design like this is to estimate the X and Y effects separately. I would proceed like this. First, define the baseline effect for the patients:

design <- model.matrix(~PID)

Then add the Time 3 vs Time 1 effect for Treatment X:

design <- cbind(design, X= Treatment=="X" & Timepoint==3)

Then add the Time 3 vs Time 1 effect for Treatment Y:

design <- cbind(design, Y= Treatment=="Y" & Timepoint==3)

With this formulation, coefficient "X" estimates the logFC for treatment X and coefficient "Y" estimates the logFC for treatment Y, adjusted for patient baselines. To compare Y to X, just form the Y-X contrast.

My formulation gives the same DE results as other formulations, I just find it clearer.

If there are any references that you feel I should read based on where you think my understanding is at with linear models and utilizing limma in general, please let me know!

The limma User's Guide is the obvious reference. Perhaps you have read that already, although you haven't mentioned doing so. Unfortunately, traditional statistics textbooks are not very useful for these sort of questions.

ADD COMMENT • link 5.2 years ago Gordon Smyth 52k

0

Entering edit mode

Thanks Gordon, I really appreciate that you respond to these threads. It means a lot to me (and I'm sure the entire community) that the original creator of these statistical frameworks is still willing to help clarify some fundamental concepts.

The choice of whether or not to include an intercept is the model is purely a matter of mathematical convenient in how you setup your analysis. You will get the same DE results in the end regardless of whether you include the intercept or not. The choice has nothing to do with whether any probes have zero log-expression or whether the data is standardized (it shouldn't be). You can easily form interaction effects whether you estimate an intercept or not.

Thank you SO much for posting this. I had read through the Limma guide already and am now doubly convinced I would have never figured this out, given that every "stats" reference/textbook I had read explicitly describes the 0-intercept model as having an assumption that when x=0, y=0 (which isn't true for my data). This clarifies a lot, and I will always include it so that the model.matrix will output the necessary columns.

Your experiment is a paired comparison (Time 3 vs Time 1) with two treatments. My personal preference for a design like this is to estimate the X and Y effects separately.

Your design is very clear. If I'm understanding this correctly, by including PID and only the 3rd timepoint (the end timepoint), the differences in baseline between individuals is already modeled such that a timepoint 1 isn't necessary. But I do have one question: in your case, you don't include a 0-intercept, which makes me think that the first patient won't be included in the model (Intercept, Patient 2, Patient 3... etc). Understanding that the 0-intercept notation allows one to capture all patients without dropping out the first patient, is the appropriate set of steps:

design <- model.matrix(~ 0 + PID)
design <- cbind(design, X = Treatment=="X" & Timepoint==3)
design <- cbind(design, Y = Treatment=="Y" & Timepoint == 3)

If not, then I believe I'm still misunderstanding something fundamental about the purpose of including " ~ 0 + Cov1 .." vs "~ Cov 1 + Cov2 ..."

Thanks for all that you do.

R

ADD REPLY • link 5.2 years ago robert.chen • 0

3

Entering edit mode

It makes absolutely no difference whether you use ~PID or 0+PID. Both versions adjust for all the patients, including the first patient.

What I think may be confusing you is the following:

If x is a numerical covariate, then ~x and ~0+x are different models. The first model has 2 coefficients (intercept and slope) whereas the second only has 1 coefficient (slope). The second model does indeed assume that the expected response is zero when x is zero.
If A is a factor, then ~A and ~0+A are the same model, just parametrized slightly different. Both versions have the same number of parameters. The first version does not remove the first level of A, it simply renames the first level of A to be the Intercept. The second version does not actually remove the intercept, it just renames it to be the first level of A. I tried to explain this in Section 9.2 of the limma User's Guide. Is the section not clear?

The statistics textbook discussion of intercepts that you have read is for the regression, not for linear models or anova. The textbooks are assuming a context in which you only have numeric covariates and no factors.

ADD REPLY • link 5.2 years ago Gordon Smyth 52k

0

Entering edit mode

Ah yes, I was definitely confused by the difference you mentioned, namely that for categorical variables, the models are the same but for continuous variables, the two models are different. Thanks for clarifying this discrepancy!

R

ADD REPLY • link 5.2 years ago robert.chen • 0