Question

Design that evaluates response to continuous and categorical variables

0

Entering edit mode

Moritz E. Beber ▴ 10

@moritz-e-beber-5727

Last seen 8.5 years ago

European Union

Dear all,

I'm trying to use limma in order to analyze a large number of Affymetrix microarrays. The experimental setup includes 2-3 replicates, 2-4 different time points and 1-5 different dosage levels.

So a typical targets file could look like the following where the time and dose can be non-integer numbers in reality:

> my.targets = data.frame(group = c(rep("untreated", times = 6),
                                  rep("treated", times = 18)),
                        time = rep(1:3, each = 2, times = 4),
                        dose = rep(0:3, each = 6),
                        replicate = rep(c("A", "B"), times = 12))

> my.targets
       group time dose replicate
1  untreated    1    0         A
2  untreated    1    0         B
3  untreated    2    0         A
4  untreated    2    0         B
5  untreated    3    0         A
6  untreated    3    0         B
7    treated    1    1         A
8    treated    1    1         B
9    treated    2    1         A
10   treated    2    1         B
11   treated    3    1         A
12   treated    3    1         B
13   treated    1    2         A
14   treated    1    2         B
15   treated    2    2         A
16   treated    2    2         B
17   treated    3    2         A
18   treated    3    2         B
19   treated    1    3         A
20   treated    1    3         B
21   treated    2    3         A
22   treated    2    3         B
23   treated    3    3         A
24   treated    3    3         B

as you can see the "control" group is synonymous with no dosage.

I would like to evaluate the overall genetic response to treatment and time. So usually I would accomplish this with:

> my.design = model.matrix(~ time * dose, data = my.targets)

since I'm not sure whether the response is to dosage, time or their interaction.

However, I don't know what the baseline expression would be with this design and whether limma is aware of the within replicates variability. So is it enough to use the following design:

> my.design = model.matrix(~ group + time * dose, data = my.targets)

with which limma will then use the untreated samples to establish a baseline expression? Or do I need to explicitly include the replicates as well?

Thank you for your insights.

limma design and contrast matrix design matrix limma design matrix • 1.7k views

ADD COMMENT • link updated 8.8 years ago by James W. MacDonald 65k • written 8.8 years ago by Moritz E. Beber ▴ 10

2

Entering edit mode

You will probably have to give more information.

1.) Are time and dose integers in that data.frame? If so, that is almost surely not what you want to do (unless e.g., time is say hours, or dose is say, 0, 1, 2, 3 mg/kg or something like that).

2.) What does replicate signify? If I were to naively look at the data you present, I would have to assume you had just two biological replicates, and you treated these two animals with lots of different levels at various times.

You don't want to use group and dose in the model matrix (and probably don't want to use group at all). Group is partially aliased with dose, so you have semi-redundant information in your design matrix.

ADD REPLY • link 8.8 years ago James W. MacDonald 65k

0

Entering edit mode

To clarify James' first point; do you want to consider each time point as a separate group, or do you want to fit the model such that expression changes linearly with the time covariate (e.g., expression of 2 at the first time point, 3 at the second, 4 at the third, and so on)? The former is more flexible as it doesn't force genes to follow a linear trend. You end up with a larger model, but that's fine, as you've got plenty of residual d.f. The same argument applies to dose.

ADD REPLY • link 8.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you for your comments.

As mentioned time and dose can be non-integer numbers. Time is in the range of hours to a maximum of 4 weeks. Doses vary widely and can be anywhere from a fraction to several thousand. The units are different per separate compound considered. The shown targets are for one compound only.
I apologize for the replicate column being confusing. I meant to express that I have, in this example, two replicates per unique time and dose combination. Consider each row number referring to a separate CEL file and there being true biological replicates.

I agree that group contains redundant information but if I am interested in the change (differentially expressed genes and coefficients) as compared to the untreated samples, is it the most reasonable choice to simply fit a linear model that includes time and dose or should I rather be forming a contrast?

In essence this touches upon Aaron Lun's comment. I believe that it might be sufficient to model a linear response of gene expression with time and dose. I have also been considering, however, as described in the userguide, to treat each dosage as a different group and fit splines to the time points per dosage group. Or I could, as Aaron Lun suggested, form groups given by unique combinations of time and dose.

My problem with the last two approaches is that I would like to end up with an overall response per gene to one compound. That means, how strongly does it respond (absolute values of coefficients in the linear model) and certainty (p-value). With the last two approaches I would have to maybe take the mean of the coefficients and aggregate the p-values.

ADD REPLY • link 8.8 years ago Moritz E. Beber ▴ 10

score 1 · Answer 1 · 2015-06-30

1

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 54 minutes ago

United States

How you specify a design is dependent on what you are trying to show. If you want to be able to say that the expression of gene X increases some amount as the dosage increases by some amount, then you have to fit a linear model with dosage as a continuous variable. As a related point, if you have many different dosages and times, then collapsing to groups is probably not what you want to do.

If you condense your dose and time values into 'buckets' then you are losing information as well as degrees of freedom. And if you only have two replicates per time/dose combination, then you almost surely don't want to collapse to factors. But this question has much more to do with 'how should I analyze my data' rather than 'how do I get limma to analyze my data in the way I have chosen?' This support site is primarily dedicated to the latter, and we as a rule at least attempt to refrain from answering the former, as that requires much more knowledge of the experiment and its goals. And if you are asking random people on the intertubes how you should be analyzing your data, you will likely get what you pay for.

ADD COMMENT • link 8.8 years ago James W. MacDonald 65k

0

Entering edit mode

Fair point, I suppose. Thank you for your time and effort.

ADD REPLY • link 8.8 years ago Moritz E. Beber ▴ 10

0

Entering edit mode

Just to provide a different perspective; we often handle such experimental designs by collapsing into factors, so that each time/dose combination represents a different group with several replicates. There's some advantages in ease of interpretation for the model coefficients, flexibility in setting up contrasts between any groups of interest, and accuracy in modelling by not constraining the response of time or dose to some trend. The model ends up being larger, but we've got plenty of residual d.f. to burn in a large experiment, so that's not too problematic. We tend to use continuous covariates in the design for smaller set-ups with limited residual d.f., or in cases where the time or dose are not discrete entities.

ADD REPLY • link 8.8 years ago Aaron Lun ★ 28k