Question: Design that evaluates response to continuous and categorical variables
0
4.0 years ago by
European Union
Moritz E. Beber10 wrote:

Dear all,

I'm trying to use limma in order to analyze a large number of Affymetrix microarrays. The experimental setup includes 2-3 replicates, 2-4 different time points and 1-5 different dosage levels.

So a typical targets file could look like the following where the time and dose can be non-integer numbers in reality:

> my.targets = data.frame(group = c(rep("untreated", times = 6),
rep("treated", times = 18)),
time = rep(1:3, each = 2, times = 4),
dose = rep(0:3, each = 6),
replicate = rep(c("A", "B"), times = 12))
> my.targets
group time dose replicate
1  untreated    1    0         A
2  untreated    1    0         B
3  untreated    2    0         A
4  untreated    2    0         B
5  untreated    3    0         A
6  untreated    3    0         B
7    treated    1    1         A
8    treated    1    1         B
9    treated    2    1         A
10   treated    2    1         B
11   treated    3    1         A
12   treated    3    1         B
13   treated    1    2         A
14   treated    1    2         B
15   treated    2    2         A
16   treated    2    2         B
17   treated    3    2         A
18   treated    3    2         B
19   treated    1    3         A
20   treated    1    3         B
21   treated    2    3         A
22   treated    2    3         B
23   treated    3    3         A
24   treated    3    3         B


as you can see the "control" group is synonymous with no dosage.

I would like to evaluate the overall genetic response to treatment and time. So usually I would accomplish this with:

> my.design = model.matrix(~ time * dose, data = my.targets)

since I'm not sure whether the response is to dosage, time or their interaction.

However, I don't know what the baseline expression would be with this design and whether limma is aware of the within replicates variability. So is it enough to use the following design:

> my.design = model.matrix(~ group + time * dose, data = my.targets)

with which limma will then use the untreated samples to establish a baseline expression? Or do I need to explicitly include the replicates as well?

modified 4.0 years ago by James W. MacDonald50k • written 4.0 years ago by Moritz E. Beber10
2

1.) Are time and dose integers in that data.frame? If so, that is almost surely not what you want to do (unless e.g., time is say hours, or dose is say, 0, 1, 2, 3 mg/kg or something like that).

2.) What does replicate signify? If I were to naively look at the data you present, I would have to assume you had just two biological replicates, and you treated these two animals with lots of different levels at various times.

You don't want to use group and dose in the model matrix (and probably don't want to use group at all). Group is partially aliased with dose, so you have semi-redundant information in your design matrix.

To clarify James' first point; do you want to consider each time point as a separate group, or do you want to fit the model such that expression changes linearly with the time covariate (e.g., expression of 2 at the first time point, 3 at the second, 4 at the third, and so on)? The former is more flexible as it doesn't force genes to follow a linear trend. You end up with a larger model, but that's fine, as you've got plenty of residual d.f. The same argument applies to dose.

1. As mentioned time and dose can be non-integer numbers. Time is in the range of hours to a maximum of 4 weeks. Doses vary widely and can be anywhere from a fraction to several thousand. The units are different per separate compound considered. The shown targets are for one compound only.
2. I apologize for the replicate column being confusing. I meant to express that I have, in this example, two replicates per unique time and dose combination. Consider each row number referring to a separate CEL file and there being true biological replicates.

I agree that group contains redundant information but if I am interested in the change (differentially expressed genes and coefficients) as compared to the untreated samples, is it the most reasonable choice to simply fit a linear model that includes time and dose or should I rather be forming a contrast?

In essence this touches upon Aaron Lun's comment. I believe that it might be sufficient to model a linear response of gene expression with time and dose. I have also been considering, however, as described in the userguide, to treat each dosage as a different group and fit splines to the time points per dosage group. Or I could, as Aaron Lun suggested, form groups given by unique combinations of time and dose.

My problem with the last two approaches is that I would like to end up with an overall response per gene to one compound. That means, how strongly does it respond (absolute values of coefficients in the linear model) and certainty (p-value). With the last two approaches I would have to maybe take the mean of the coefficients and aggregate the p-values.

Answer: Design that evaluates response to continuous and categorical variables
1
4.0 years ago by
United States
James W. MacDonald50k wrote:

How you specify a design is dependent on what you are trying to show. If you want to be able to say that the expression of gene X increases some amount as the dosage increases by some amount, then you have to fit a linear model with dosage as a continuous variable. As a related point, if you have many different dosages and times, then collapsing to groups is probably not what you want to do.

If you condense your dose and time values into 'buckets' then you are losing information as well as degrees of freedom. And if you only have two replicates per time/dose combination, then you almost surely don't want to collapse to factors. But this question has much more to do with 'how should I analyze my data' rather than 'how do I get limma to analyze my data in the way I have chosen?' This support site is primarily dedicated to the latter, and we as a rule at least attempt to refrain from answering the former, as that requires much more knowledge of the experiment and its goals. And if you are asking random people on the intertubes how you should be analyzing your data, you will likely get what you pay for.

Fair point, I suppose. Thank you for your time and effort.