Question

Paired samples, two time points, two groups and adjust for a continuous variable in edgeR

0

Entering edit mode

Sindre ▴ 70

@sindre-6693

Last seen 9.1 years ago

I want to find correlations between change in log2(gene expression) and change in log2(VO2max) from time A1 -> B1, adjusted for group and including the fact that the samples are paired. I get this error:

"Error in glmFit.default(y, design = design, dispersion = dispersion, offset = offset, :
Design matrix not of full rank. The following coefficients not estimable:
ID79"

Is this due to some linear dependencies? How get I get the results I want?

VO2max <- as.numeric(pData.myoglu[ 1:52, 6]) # (Random numbers as I can't post the original here)

Time <- factor(pData.myoglu[ 1:52, 4], levels=c("A1", "B1"))

Status <- factor(pData.myoglu[ 1:52, 2], levels=c("control", "IGT"))

ID <- factor(pData.myoglu[ 1:52, 1])

design <- model.matrix(~ 0 + log2(VO2max) + Status + Time + ID)

Here is the design (not enough space to paste):

http://imgur.com/4H9TUS2

edger limma • 2.5k views

ADD COMMENT • link updated 10.0 years ago by Gordon Smyth 53k • written 10.0 years ago by Sindre ▴ 70

1

Entering edit mode

It would be more concise and helpful to show us the three factors rather than the design matrix:

data.frame(Status,Time,ID)

ADD REPLY • link 10.0 years ago Gordon Smyth 53k

0

Entering edit mode

Here is data.frame(Status,Time,ID):

Status	Time	ID
IGT	A1	2
IGT	A1	4
IGT	A1	17
IGT	A1	21
IGT	A1	53
IGT	A1	55
IGT	A1	61
IGT	A1	62
IGT	A1	67
IGT	A1	73
IGT	A1	76
IGT	A1	77
IGT	A1	79
control	A1	1
control	A1	8
control	A1	13
control	A1	70
control	A1	71
control	A1	72
control	A1	75
control	A1	81
control	A1	82
control	A1	83
control	A1	86
control	A1	87
control	A1	88
IGT	B1	2
IGT	B1	4
IGT	B1	17
IGT	B1	21
IGT	B1	53
IGT	B1	55
IGT	B1	61
IGT	B1	62
IGT	B1	67
IGT	B1	73
IGT	B1	76
IGT	B1	77
IGT	B1	79
control	B1	1
control	B1	8
control	B1	13
control	B1	70
control	B1	71
control	B1	72
control	B1	75
control	B1	81
control	B1	82
control	B1	83
control	B1	86
control	B1	87
control	B1	88

ADD REPLY • link 10.0 years ago Sindre ▴ 70

score 2 · Answer 1 · 2016-02-28

2

Entering edit mode

Aaron Lun ★ 29k

@alun

Last seen 41 minutes ago

The city by the bay

Yes, it is because of linear dependencies. There's no way to tell what exactly the problem is, because you don't show what the experimental design actually is. I'd suspect that it is something to do with the three factors, i.e., time or status or ID, and how they correspond to each other. As a hypothetical example, if all samples at a certain time also had the same status, then there'd be no point putting both in the design matrix.

ADD COMMENT • link 10.0 years ago Aaron Lun ★ 29k

0

Entering edit mode

I included the three factors!

ADD REPLY • link 10.0 years ago Sindre ▴ 70

Gordon Smyth · Answer 2 · 2016-02-29

1

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 7 hours ago

WEHI, Melbourne, Australia

You have what is called a multi-level design in the limma and edgeR User Guides. You haven't told us what 'ID" means, but let's assume that ID corresponds to patient, for the sake of argument. The study has one within-patient factor (Time) and one between-patient factor (Status). You can't just add everything together in a linear model because ID is confounded with Status. Have a read of the User Guides.

All my comments about multi-level designs remain true whether or not the continuous variable log2(VO2max) is included in the model. But let me add that it is unusual for a continuous variable to be at all sensible in the paired design. You are basically trying to fit too many things at once to the same data. I suspect that you would to do better to remove either or VO2max or ID from the model.

Whenever you add a factor or covariate to the linear model, it means that all other coefficients in the model are adjusted for that factor. When you add ID to the model, it means that all the other coefficients can only be estimated from the within-patient comparisons. Since the same patient always has the same status, it becomes impossible to test any difference between IGT vs controls.

ADD COMMENT • link 10.0 years ago Gordon Smyth 53k

0

Entering edit mode

Ok, thank you. I want to get the genes correlating with VO2max. Specifically, what change in gene expression correlate with the change in VO2max from time 1 to 2.

You are correct about the ID. I will re-read the user Guides once more.

Thank you!

ADD REPLY • link updated 10.0 years ago by Gordon Smyth 53k • written 10.0 years ago by Sindre ▴ 70

1

Entering edit mode

If you want the overall correlation between VO2max and expression, then just use

design <- model.matrix(~log2(VO2max))

If you want to measure only the correlation within the status groups (this usually will be smaller) then:

design <- model.matrix(~Status+log2(VO2max))

If you want to measure the correlation separately for the two groups then:

design <- model.matrix(~Status+Status:log2(VO2max))

If you want to check that the VO2 and expression change in the same direction between the two times for individual patients, then:

design <- model.matrix(~ID+log2(VO2max))

Finally, if you want to correlate VO2 and expression for individual patients, and check whether this is different for the two groups of patients, then:

design <- model.matrix(~ID+Status:log2(VO2max))

ADD REPLY • link 10.0 years ago Gordon Smyth 53k

0

Entering edit mode

Thank you very much! This is indeed what I was looking for.

ADD REPLY • link 10.0 years ago Sindre ▴ 70

0

Entering edit mode

Hello,

Just hoping to ask a clarification question- your fourth model matrix above checks whether VO2max and expression change in the same direction- does it also check for genes whose expression changes in the opposite direction? I.e., if expression of a gene decreases between the two time points while VO2max goes up, will that gene show up as significant, just with a negative fold change/estimate?

Thank you, and thanks for the original question and answer!

ADD REPLY • link 7.0 years ago smurray ▴ 10