Question

EdgeR model not of full rank

0

Entering edit mode

philip.freda • 0

@philipfreda-13499

Last seen 5.6 years ago

I have data that has genetic line (line) and phenotype (pheno) as my factors and my data looks something like this:

sample_id	line	rep	pheno
1	358	1	LL
3	358	2	LL
14	380	1	LL
15	380	2	LL
25	441	1	HL
26	441	2	HL
27	441	3	HL
37	486	1	LL
38	486	2	LL
39	486	3	LL
49	832	1	HL
50	832	2	HL
51	832	3	HL
61	913	1	HL
62	913	2	HL
63	913	3	HL

The model I want to run is to look at both phenotype and line nested within phenotype which I believe is expressed syntactically as:

~ pheno+pheno:line

And the model matrix is:

	(Intercept)	phenoL0LL	phenoL0HL:lineL0380	phenoL0LL:lineL0380	phenoL0HL:lineL0441	phenoL0LL:lineL0441	phenoL0HL:lineL0486	phenoL0LL:lineL0486	phenoL0HL:lineL0832	phenoL0LL:lineL0832	phenoL0HL:lineL0913	phenoL0LL:lineL0913
droso_1_exp_count	1	1	0	0	0	0	0	0	0	0	0	0
droso_14_exp_count	1	1	0	1	0	0	0	0	0	0	0	0
droso_15_exp_count	1	1	0	1	0	0	0	0	0	0	0	0
droso_25_exp_count	1	0	0	0	1	0	0	0	0	0	0	0
droso_26_exp_count	1	0	0	0	1	0	0	0	0	0	0	0
droso_27_exp_count	1	0	0	0	1	0	0	0	0	0	0	0
droso_3_exp_count	1	1	0	0	0	0	0	0	0	0	0	0
droso_37_exp_count	1	1	0	0	0	0	0	1	0	0	0	0
droso_38_exp_count	1	1	0	0	0	0	0	1	0	0	0	0
droso_39_exp_count	1	1	0	0	0	0	0	1	0	0	0	0
droso_49_exp_count	1	0	0	0	0	0	0	0	1	0	0	0
droso_50_exp_count	1	0	0	0	0	0	0	0	1	0	0	0
droso_51_exp_count	1	0	0	0	0	0	0	0	1	0	0	0
droso_61_exp_count	1	0	0	0	0	0	0	0	0	0	1	0
droso_62_exp_count	1	0	0	0	0	0	0	0	0	0	1	0
droso_63_exp_count	1	0	0	0	0	0	0	0	0	0	1	0

However, I get this error when attempting to estimate dispersion:

Error in glmFit.default(sely, design, offset = seloffset, dispersion = 0.05, :
Design matrix not of full rank. The following coefficients not estimable:
phenoL0HL:lineL0380 phenoL0LL:lineL0441 phenoL0HL:lineL0486 phenoL0LL:lineL0832 phenoL0HL:lineL0913 phenoL0LL:lineL0913

Which i think is saying that since both phenotypes are not since within a line, there is not enough representation. But a line is either one phenotype or the other.

Any help would be appreciated.

EdgeR rank model • 1.5k views

ADD COMMENT • link 6.8 years ago philip.freda • 0

score 1 · Answer 1 · 2017-07-18

1

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 1 minute ago

WEHI, Melbourne, Australia

The problem here is an idiosyncracy of the model.matrix() function in R. When you try to nest one factor within another, it assumes that the levels of the nested factor start from scratch within each level of the higher factor. In your case, model.matrix() is creating extra column of 0s for each line that doesn't belong to the relevant phenotype.

Option 1. The correct way to fix this is to relabel the lines so that the numbering restarts within each phenotype:

sample_id line pheno
        1    1    LL
        3    1    LL
       14    2    LL
       15    2    LL
       25    1    HL
       26    1    HL
       27    1    HL
       37    3    LL
       38    3    LL
       39    3    LL
       49    2    HL
       50    2    HL
       51    2    HL
       61    3    HL
       62    3    HL
       63    3    HL

Then ~ pheno + pheno:line will work fine. Note that, using this formula, the LL and HL lines are treated as different even though they have the same numbering. The design matrix will have exactly 6 columns, because you have 6 lines in total.

You can see another example of this type of factor re-numbering on page 41 of the edgeR User's Guide.

Option 2. The alternative is to just use design <- model.matrix(~0+line), and then use appropriate contrasts to compare phenotypes and lines. As Aaron has pointed out, you can quite easily get any comparison from this flat model that you could have got from the nested model.

PS. Sorry, my original suggestion to remove the all-zero columns from the design matrix was too simplistic, as I should have known. It doesn't give a full rank matrix immediately and gives confusing columns, as you discovered and Aaron has explained.

ADD COMMENT • link 6.8 years ago Gordon Smyth 50k

0

Entering edit mode

I tried what you proposed but got this error. Any idea what the problem may be? The error is much shorter than previously with only one line (913) listed within the error

Error in glmFit.default(sely, design, offset = seloffset, dispersion = 0.05, :
Design matrix not of full rank. The following coefficients not estimable:
phenoL0HL:lineL0913

ADD REPLY • link 6.8 years ago philip.freda • 0

0

Entering edit mode

Here is the new model matrix:

	(Intercept)	phenoL0LL	phenoL0LL:lineL0380	phenoL0HL:lineL0441	phenoL0LL:lineL0486	phenoL0HL:lineL0832	phenoL0HL:lineL0913
droso_1_exp_count	1	1	0	0	0	0	0
droso_14_exp_count	1	1	1	0	0	0	0
droso_15_exp_count	1	1	1	0	0	0	0
droso_25_exp_count	1	0	0	1	0	0	0
droso_26_exp_count	1	0	0	1	0	0	0
droso_27_exp_count	1	0	0	1	0	0	0
droso_3_exp_count	1	1	0	0	0	0	0
droso_37_exp_count	1	1	0	0	1	0	0
droso_38_exp_count	1	1	0	0	1	0	0
droso_39_exp_count	1	1	0	0	1	0	0
droso_49_exp_count	1	0	0	0	0	1	0
droso_50_exp_count	1	0	0	0	0	1	0
droso_51_exp_count	1	0	0	0	0	1	0
droso_61_exp_count	1	0	0	0	0	0	1
droso_62_exp_count	1	0	0	0	0	0	1
droso_63_exp_count	1	0	0	0	0	0	1

ADD REPLY • link 6.8 years ago philip.freda • 0

0

Entering edit mode

As the error message says, the design matrix is not of full rank. You need to discard one of the coefficients corresponding to the HL lines. Let's say you discard phenoHL:line441 (though any one is permissible). The remaining coefficients are:

"(Intercept)"    
"phenoLL"        
"phenoLL:line380"
"phenoLL:line486"
"phenoHL:line832"
"phenoHL:line913"

The correct interpretation of these terms requires some contemplation:

The intercept represents the log-average expression of the HL 441 line.
Despite its name, phenoLL does not represent the main effect of LL vs HL (surprise!). Rather, it represents the log-fold change in expression of the LL 358 line over the HL 441 line.
phenoLL:line380 represents the log-fold change of LL 380 over LL 358.
phenoLL:line486 represents the log-fold change of LL 486 over LL 358.
phenoHL:line832 represents the log-fold change of HL 832 over HL 441.
phenoHL:line913 represents the log-fold change of HL 913 over HL 441.

Figuring out the meaning of each coefficient is often an enjoyable puzzle; but if you don't want to deal with that, just using ~line would give the same model but with coefficients that are much easier to interpret.

ADD REPLY • link 6.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

Sorry, you're right. My suggestion to remove the all-zero columns was too simple, as it removes only 5 columns instead of 6. As Aaron has pointed out, you can still fix the design matrix yourself, but this is overly complicated and prone to error. I have edited my answer to correct it.

ADD REPLY • link 6.8 years ago Gordon Smyth 50k

0

Entering edit mode

Another question: does renumbering them still keep them separate? In other words line 1 in LL and line 1 in HL are two totally different genetic lines

ADD REPLY • link 6.8 years ago philip.freda • 0

score 0 · Answer 2 · 2017-07-14

0

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 15 hours ago

The city by the bay

This model is impossible to fit uniquely as line is nested within pheno. Any differences between phenotypes will be absorbed into the line terms, so there is no unique solution for the model coefficients. You can only do:

~line

... or if you want to account for the replicate, then:

~rep + line

To determine the phenotype effect, you can then compare the average of HL lines with the average of the LL lines, see examples in Section 3.2.3 and 3.2.4 in the edgeR user's guide.

ADD COMMENT • link 6.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

One can still represent the lines in two stages: phenotype first, then line within phenotype. The model.matrix() function doesn't make that easy though -- see my answer.

ADD REPLY • link 6.8 years ago Gordon Smyth 50k