Question: Modeling with 'discontinuous' continuous variables, or continuous variables with gaps
0
smurray10 wrote:

Hello,

I'm curious about the best way to work with a "spotty" continuous variable, ie a variable that is technically continuous, but there are some gaps in the values such that you might worry a model might not fit too well.

For example, say I have fifteen samples in three groups, with a continuous variable that has gaps as below.

DF <- data.frame(Group = c(rep("A",5), rep("B",5), rep("C",5)),
Variable  <- c(196, 272, 284, 395, 407, 631, 683, 715, 784, 928, 1176, 1177, 1193, 1234, 1240))

ggplot(DF, aes(x=Variable, y=Variable, color=Group))+
geom_point(shape=2, stroke=1, size=3)

One part of the analysis will be comparing gene expression between the three groups. Since the continuous variable differs by group, a between-group comparison will tell me a bit about gene expression that co-varies with my variable. However, I wonder if I'd get more information by modelling gene expression based on the continuous variable (especially since my group size is small and the continuous variable is very biologically interesting), either by including it in the model in limma or by using splines.

My question is, will either of those options (covariate or splines) be overly negatively affected by the gaps in the continuous variable?

Thanks for your help!

Answer: Modeling with 'discontinuous' continuous variables, or continuous variables with
2
Aaron Lun25k wrote:

I find your to-spline-or-not-to-spline question irrelevant in the face of this (partially) confounded design. A much more pressing concern is how you are proposing to reconcile Group and Variable in the same model. Given that both of these things are likely to have an effect and/or be interesting, it seems to me that the only reasonable approach is:

# Don't really need the 0 +, I just like no-intercept models.
design <- model.matrix(~0 + Group + Variable, data=DF)

You have greatly reduced power to detect differences between groups that follow the same pattern as the differences in Variable, and vice versa. That's just life, there's no way to distinguish between the two. It's not all bad because (i) differences between groups can still be detected if they aren't linear, and (ii) the effect of Variable can still be quantified within each group. So you should still be able get something from this analysis, though obviously not with as much power as one would hope for.

I should mention that (i) is seriously abrogated if you use splines, because non-linear changes (e.g., A increases to B and decreases to C) will be fitted by the spline without problems. This puts you back in the position where you can't distinguish the Group effect from the Variable effect. Having said that, if a spline is necessary (e.g., non-linear changes with respect to Variable within each group, or gradients for linear changes that differ across groups), you've not much choice.

ADD COMMENTlink modified 7 months ago • written 7 months ago by Aaron Lun25k

Hi Aaron-

Whoops! Let me re-ask my question in a way that doesn’t create a bigger problem than what I’m intending to ask about! (I’’ll leave my original question too, in case it helps anyone with that problem.)

Let’s say I have 15 samples from one population and have a continuous variable measurement for each - same data as above, but without the groups. The continuous variable has some natural gaps that make it easy to bin the samples into three bins - groups A, B, and C above are now bins based on Variable.

Would I be correct that modeling gene expression on Variable (as a continuous variable) across all samples would give me more power to detect associations than binning Variable and testing for DEG between bins? Is this true even if Variable has bigger gaps? Is there a rule of thumb for when you should bin a spotty continuous variable and do between-group testing and when you should use splines / include your continuous variable in the model? Will limma ever throw an error or warning if your continuous variable is not “continuous” enough to use splines?

Thanks for your thoughts on this!

There are two considerations here.

Power: Fitting a spline with more than 2 degrees of freedom should fit better than three groups, as you would be able to account for trends within each group. This should result in a decreased dispersion estimate and increased power. If you have enough spline parameters, you would also be able to detect trends within each group that would not be seen in a group-based analysis if the group-wise averages were the same.

Interpretation: Splines are a real pain to interpret, while groups are easy. You can compare one group to another easily, compute log-fold changes, etc. You can't do that with splines - all you will get is "does the continuous variable have a significant effect?" This is something that you need to think about; there's no point doing an analysis with lots of power if it becomes so much harder to interpret scientifically.

Is this true even if Variable has bigger gaps?

It's probably okay. It depends on the algorithm used for knot placement; if your knots land in intervals without points, they probably won't contribute much to the fit.

Is there a rule of thumb for when you should bin a spotty continuous variable and do between-group testing and when you should use splines / include your continuous variable in the model?

No hard number comes to mind. One could imagine doing cross-validation to determine the ideal number of spline d.f., and then checking whether the residual variance from a group-based analysis is much greater than this... that's a pain. Probably the easier thing is to ask, "Do I care about trends within groups?" If you do, then use a spline. If you don't, then use the groups.

Will limma ever throw an error or warning if your continuous variable is not “continuous” enough to use splines?

If you have fewer unique values than spline d.f., limma will warn about unestimable coefficients.