Hello,
I'm curious about the best way to work with a "spotty" continuous variable, ie a variable that is technically continuous, but there are some gaps in the values such that you might worry a model might not fit too well.
For example, say I have fifteen samples in three groups, with a continuous variable that has gaps as below.
DF <- data.frame(Group = c(rep("A",5), rep("B",5), rep("C",5)),
Variable <- c(196, 272, 284, 395, 407, 631, 683, 715, 784, 928, 1176, 1177, 1193, 1234, 1240))
ggplot(DF, aes(x=Variable, y=Variable, color=Group))+
geom_point(shape=2, stroke=1, size=3)
One part of the analysis will be comparing gene expression between the three groups. Since the continuous variable differs by group, a between-group comparison will tell me a bit about gene expression that co-varies with my variable. However, I wonder if I'd get more information by modelling gene expression based on the continuous variable (especially since my group size is small and the continuous variable is very biologically interesting), either by including it in the model in limma or by using splines.
My question is, will either of those options (covariate or splines) be overly negatively affected by the gaps in the continuous variable?
Thanks for your help!
Hi Aaron-
Whoops! Let me re-ask my question in a way that doesn’t create a bigger problem than what I’m intending to ask about! (I’’ll leave my original question too, in case it helps anyone with that problem.)
Let’s say I have 15 samples from one population and have a continuous variable measurement for each - same data as above, but without the groups. The continuous variable has some natural gaps that make it easy to bin the samples into three bins - groups A, B, and C above are now bins based on Variable.
Would I be correct that modeling gene expression on Variable (as a continuous variable) across all samples would give me more power to detect associations than binning Variable and testing for DEG between bins? Is this true even if Variable has bigger gaps? Is there a rule of thumb for when you should bin a spotty continuous variable and do between-group testing and when you should use splines / include your continuous variable in the model? Will limma ever throw an error or warning if your continuous variable is not “continuous” enough to use splines?
Thanks for your thoughts on this!
There are two considerations here.
Power: Fitting a spline with more than 2 degrees of freedom should fit better than three groups, as you would be able to account for trends within each group. This should result in a decreased dispersion estimate and increased power. If you have enough spline parameters, you would also be able to detect trends within each group that would not be seen in a group-based analysis if the group-wise averages were the same.
Interpretation: Splines are a real pain to interpret, while groups are easy. You can compare one group to another easily, compute log-fold changes, etc. You can't do that with splines - all you will get is "does the continuous variable have a significant effect?" This is something that you need to think about; there's no point doing an analysis with lots of power if it becomes so much harder to interpret scientifically.
It's probably okay. It depends on the algorithm used for knot placement; if your knots land in intervals without points, they probably won't contribute much to the fit.
No hard number comes to mind. One could imagine doing cross-validation to determine the ideal number of spline d.f., and then checking whether the residual variance from a group-based analysis is much greater than this... that's a pain. Probably the easier thing is to ask, "Do I care about trends within groups?" If you do, then use a spline. If you don't, then use the groups.
If you have fewer unique values than spline d.f., limma will warn about unestimable coefficients.