Question

Design matrix for multiple interacting continuous variables in limma

1

Entering edit mode

alexandria.andrayas ▴ 10

@alexandriaandrayas-11850

Last seen 7.2 years ago

So I am looking into how duration of smoking, intensity and years since last smoked impact DNA methylation. What i want to do is create some model or design matrix that takes into account all three of these interacting factors. I was thinking something like below;

design <- model.matrix(~0 + duration + intensity + years since quitting)

However I thought this might be a simplification and wont help if there is a non linear trend. Any help would be gladly appreciated.

limma limma design matrix • 1.1k views

ADD COMMENT • link updated 7.4 years ago by James W. MacDonald 65k • written 7.4 years ago by alexandria.andrayas ▴ 10

score 1 · Answer 1 · 2016-11-14

The model you are specifying is forcing the intercept through zero (which means you expect that DNA methylation will be zero when duration and intensity and years since quitting are all zero which isn't likely to be a valid assumption). Even though there probably isn't a subject for which all of the measurements are zero, there will still be an intercept, and there isn't a strong biological or statistical rationale for thinking that DNA methylation at that point should be zero.

In addition, you may be assuming that there are interactions, but there is no interaction between any of the covariates in your model. In other words, you could imagine that intensity and duration of smoking may interact so that someone who barely smoked for like 20 years might have a different level of methylation than someone who really went at it for five years and then quit. But you have to specify that as part of the design matrix (see ?formula). Higher order interactions (like duration:intensity:years since quitting) get pretty hard to interpret, so people usually try to restrict to two-way interactions for a subset of covariates. Which one(s) to use is part of the art of statistical analysis.

Adding in a non-linear trend makes interpretation, especially in light of any interactions, very difficult to interpret. If you are worried about non-linearity you might consider simplifying by grouping your continuous variables into quartiles or something like that. You lose some information by 'chotomizing' your data, but it's no help to have an 'optimal' model that you struggle to interpret.

You might consider using one of the packages that allow you to smooth data over genomic regions (minfi/bumphunter, dmrcate, DSS, etc) as individual CpG measurements tend to be pretty noisy, and a regional aggregate signal is often a better measure.

Do note that this support site is primarily intended to provide technical help for people who don't understand the software, rather than providing statistical analysis advice. The former is much easier than the latter, so any analysis advice is usually pretty vague. If you need real help, you should try to find a local statistician.