Hi everyone -
I was hoping to get your help understanding when it is appropriate to use a 0-intercept model. I have log2-transformed, quantile-normalized fluorescence values off of a series of microarrays. I would like to determine the effects of treatment X vs treatment Y on each gene measured in a patient at compared to baseline. I have 50 patients undergoing treatment X, and a different set of 50 patients undergoing treatment Y (n=50 in each arm, n=100 in the experiment). I have measurements for each patient at baseline, and two subsequent timepoints (I will call these timepoints 1 [baseline], 2, and 3), with timepoint 3 being the end of the study.
My mapping data looks like the following:
PID Treatment Timepoint Treat_Time
P01C5553301 P01C555 Y 1 Y_1
P01C5553303 P01C555 Y 3 Y_3
P01C6663301 P01C666 X 1 X_1
P01C6663303 P01C666 X 3 X_3
P01C7773301 P01C777 X 1 X_1
P01C7773303 P01C777 X 3 X_3
To set up my linear model design, I read through several answers on Bioconductor and came up with the following:
design1 <- model.matrix(~ 0 + Treat_Time + PID,
data = map.samples1)
colnames(design1)[1:4] <- levels(Treat_Time)
fit1 <- lmFit(eset1, design1)
cont.dif <- makeContrasts(
DeltaDelta = (X_2 - X_1) - (Y_2 - Y_1),
levels = design1)
My goal was to eventually develop a contrast matrix that would look at the difference between the two treatment groups, accounting for the random effects of patient. I have two questions:
First question: Does using a 0-intercept model make sense here? There are no probes that have 0 as the mean value, even in the absence of any covariates or main effects. Thus, to me it doesn't make sense to use a 0-intercept. However, so many of the answers that I have seen on Bioconductor regarding time-series or interaction terms include the 0 intercept, so I was starting to have doubts. In addition, if I don't use a 0-intercept, X_1 does not make it into the design matrix, i.e.
design1 <- model.matrix(~ 0 + Treat_Time + PID,
data = map.samples1)
head(colnames(design1))
"Treat_TimeX_1" "Treat_TimeX_2" "Treat_TimeY_1" "Treat_TimeY_2" "PIDP01C666" "PIDP01C777"
design2 <- model.matrix(~ + Treat_Time + PID)
head(colnames(design2))
"(Intercept)" "Treat_TimeX_2" "Treat_TimeY_1" "Treat_TimeY_2" "PIDP01C666" "PIDP01C777"
Therefore, I wouldn't be able to make the interaction contrast if I didn't have the 0 intercept. The more general question is: when is it appropriate to utilize a 0-intercept model? Does one need to center-scale/standardize the expression data first?
Second question: Am I modeling the interaction between treatment and time appropriately, while accounting for random effects? I have seen two methods of modeling the random effects of PID. The first method adds PID to the model as a covariate ( https://support.bioconductor.org/p/125149 ) while the second method uses duplicateCorrelation() ( https://support.bioconductor.org/p/70070 ). When is each method appropriate?
Thank you so much for all of your help. If there are any references that you feel I should read based on where you think my understanding is at with linear models and utilizing limma in general, please let me know!
R
Thanks Gordon, I really appreciate that you respond to these threads. It means a lot to me (and I'm sure the entire community) that the original creator of these statistical frameworks is still willing to help clarify some fundamental concepts.
Thank you SO much for posting this. I had read through the Limma guide already and am now doubly convinced I would have never figured this out, given that every "stats" reference/textbook I had read explicitly describes the 0-intercept model as having an assumption that when x=0, y=0 (which isn't true for my data). This clarifies a lot, and I will always include it so that the model.matrix will output the necessary columns.
Your design is very clear. If I'm understanding this correctly, by including PID and only the 3rd timepoint (the end timepoint), the differences in baseline between individuals is already modeled such that a timepoint 1 isn't necessary. But I do have one question: in your case, you don't include a 0-intercept, which makes me think that the first patient won't be included in the model (Intercept, Patient 2, Patient 3... etc). Understanding that the 0-intercept notation allows one to capture all patients without dropping out the first patient, is the appropriate set of steps:
If not, then I believe I'm still misunderstanding something fundamental about the purpose of including " ~ 0 + Cov1 .." vs "~ Cov 1 + Cov2 ..."
Thanks for all that you do.
R
It makes absolutely no difference whether you use
~PID
or0+PID
. Both versions adjust for all the patients, including the first patient.What I think may be confusing you is the following:
If
x
is a numerical covariate, then~x
and~0+x
are different models. The first model has 2 coefficients (intercept and slope) whereas the second only has 1 coefficient (slope). The second model does indeed assume that the expected response is zero when x is zero.If
A
is a factor, then~A
and~0+A
are the same model, just parametrized slightly different. Both versions have the same number of parameters. The first version does not remove the first level of A, it simply renames the first level of A to be the Intercept. The second version does not actually remove the intercept, it just renames it to be the first level of A. I tried to explain this in Section 9.2 of the limma User's Guide. Is the section not clear?The statistics textbook discussion of intercepts that you have read is for the regression, not for linear models or anova. The textbooks are assuming a context in which you only have numeric covariates and no factors.
Ah yes, I was definitely confused by the difference you mentioned, namely that for categorical variables, the models are the same but for continuous variables, the two models are different. Thanks for clarifying this discrepancy!
R