Question: EdgeR model not of full rank
0
2.3 years ago by
philip.freda0 wrote:

I have data that has genetic line (line) and phenotype (pheno) as my factors and my data looks something like this:

 sample_id line rep pheno 1 358 1 LL 3 358 2 LL 14 380 1 LL 15 380 2 LL 25 441 1 HL 26 441 2 HL 27 441 3 HL 37 486 1 LL 38 486 2 LL 39 486 3 LL 49 832 1 HL 50 832 2 HL 51 832 3 HL 61 913 1 HL 62 913 2 HL 63 913 3 HL

The model I want to run is to look at both phenotype and line nested within phenotype which I believe is expressed syntactically as:

~ pheno+pheno:line

And the model matrix is:

 (Intercept) phenoL0LL phenoL0HL:lineL0380 phenoL0LL:lineL0380 phenoL0HL:lineL0441 phenoL0LL:lineL0441 phenoL0HL:lineL0486 phenoL0LL:lineL0486 phenoL0HL:lineL0832 phenoL0LL:lineL0832 phenoL0HL:lineL0913 phenoL0LL:lineL0913 droso_1_exp_count 1 1 0 0 0 0 0 0 0 0 0 0 droso_14_exp_count 1 1 0 1 0 0 0 0 0 0 0 0 droso_15_exp_count 1 1 0 1 0 0 0 0 0 0 0 0 droso_25_exp_count 1 0 0 0 1 0 0 0 0 0 0 0 droso_26_exp_count 1 0 0 0 1 0 0 0 0 0 0 0 droso_27_exp_count 1 0 0 0 1 0 0 0 0 0 0 0 droso_3_exp_count 1 1 0 0 0 0 0 0 0 0 0 0 droso_37_exp_count 1 1 0 0 0 0 0 1 0 0 0 0 droso_38_exp_count 1 1 0 0 0 0 0 1 0 0 0 0 droso_39_exp_count 1 1 0 0 0 0 0 1 0 0 0 0 droso_49_exp_count 1 0 0 0 0 0 0 0 1 0 0 0 droso_50_exp_count 1 0 0 0 0 0 0 0 1 0 0 0 droso_51_exp_count 1 0 0 0 0 0 0 0 1 0 0 0 droso_61_exp_count 1 0 0 0 0 0 0 0 0 0 1 0 droso_62_exp_count 1 0 0 0 0 0 0 0 0 0 1 0 droso_63_exp_count 1 0 0 0 0 0 0 0 0 0 1 0

However, I get this error when attempting to estimate dispersion:

Error in glmFit.default(sely, design, offset = seloffset, dispersion = 0.05,  :
Design matrix not of full rank.  The following coefficients not estimable:
phenoL0HL:lineL0380 phenoL0LL:lineL0441 phenoL0HL:lineL0486 phenoL0LL:lineL0832 phenoL0HL:lineL0913 phenoL0LL:lineL0913

Which i think is saying that since both phenotypes are not since within a line, there is not enough representation. But a line is either one phenotype or the other.

Any help would be appreciated.

edger model rank • 487 views
modified 2.2 years ago • written 2.3 years ago by philip.freda0
Answer: EdgeR model not of full tank
1
2.3 years ago by
Gordon Smyth38k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth38k wrote:

The problem here is an idiosyncracy of the model.matrix() function in R. When you try to nest one factor within another, it assumes that the levels of the nested factor start from scratch within each level of the higher factor. In your case, model.matrix() is creating extra column of 0s for each line that doesn't belong to the relevant phenotype.

Option 1. The correct way to fix this is to relabel the lines so that the numbering restarts within each phenotype:

sample_id line pheno
1    1    LL
3    1    LL
14    2    LL
15    2    LL
25    1    HL
26    1    HL
27    1    HL
37    3    LL
38    3    LL
39    3    LL
49    2    HL
50    2    HL
51    2    HL
61    3    HL
62    3    HL
63    3    HL


Then ~ pheno + pheno:line will work fine. Note that, using this formula, the LL and HL lines are treated as different even though they have the same numbering. The design matrix will have exactly 6 columns, because you have 6 lines in total.

You can see another example of this type of factor re-numbering on page 41 of the edgeR User's Guide.

Option 2. The alternative is to just use design <- model.matrix(~0+line), and then use appropriate contrasts to compare phenotypes and lines. As Aaron has pointed out, you can quite easily get any comparison from this flat model that you could have got from the nested model.

PS. Sorry, my original suggestion to remove the all-zero columns from the design matrix was too simplistic, as I should have known. It doesn't give a full rank matrix immediately and gives confusing columns, as you discovered and Aaron has explained.

I tried what you proposed but got this error. Any idea what the problem may be? The error is much shorter than previously with only one line (913) listed within the error

Error in glmFit.default(sely, design, offset = seloffset, dispersion = 0.05,  :
Design matrix not of full rank.  The following coefficients not estimable:
phenoL0HL:lineL0913

Here is the new model matrix:

 (Intercept) phenoL0LL phenoL0LL:lineL0380 phenoL0HL:lineL0441 phenoL0LL:lineL0486 phenoL0HL:lineL0832 phenoL0HL:lineL0913 droso_1_exp_count 1 1 0 0 0 0 0 droso_14_exp_count 1 1 1 0 0 0 0 droso_15_exp_count 1 1 1 0 0 0 0 droso_25_exp_count 1 0 0 1 0 0 0 droso_26_exp_count 1 0 0 1 0 0 0 droso_27_exp_count 1 0 0 1 0 0 0 droso_3_exp_count 1 1 0 0 0 0 0 droso_37_exp_count 1 1 0 0 1 0 0 droso_38_exp_count 1 1 0 0 1 0 0 droso_39_exp_count 1 1 0 0 1 0 0 droso_49_exp_count 1 0 0 0 0 1 0 droso_50_exp_count 1 0 0 0 0 1 0 droso_51_exp_count 1 0 0 0 0 1 0 droso_61_exp_count 1 0 0 0 0 0 1 droso_62_exp_count 1 0 0 0 0 0 1 droso_63_exp_count 1 0 0 0 0 0 1

As the error message says, the design matrix is not of full rank. You need to discard one of the coefficients corresponding to the HL lines. Let's say you discard phenoHL:line441 (though any one is permissible). The remaining coefficients are:

"(Intercept)"
"phenoLL"
"phenoLL:line380"
"phenoLL:line486"
"phenoHL:line832"
"phenoHL:line913"

The correct interpretation of these terms requires some contemplation:

• The intercept represents the log-average expression of the HL 441 line.
• Despite its name, phenoLL does not represent the main effect of LL vs HL (surprise!). Rather, it represents the log-fold change in expression of the LL 358 line over the HL 441 line.
• phenoLL:line380 represents the log-fold change of LL 380 over LL 358.
• phenoLL:line486 represents the log-fold change of LL 486 over LL 358.
• phenoHL:line832 represents the log-fold change of HL 832 over HL 441.
• phenoHL:line913 represents the log-fold change of HL 913 over HL 441.

Figuring out the meaning of each coefficient is often an enjoyable puzzle; but if you don't want to deal with that, just using ~line would give the same model but with coefficients that are much easier to interpret.

Sorry, you're right. My suggestion to remove the all-zero columns was too simple, as it removes only 5 columns instead of 6. As Aaron has pointed out, you can still fix the design matrix yourself, but this is overly complicated and prone to error. I have edited my answer to correct it.

Another question: does renumbering them still keep them separate? In other words line 1 in LL and line 1 in HL are two totally different genetic lines

Answer: EdgeR model not of full tank
0
2.3 years ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

This model is impossible to fit uniquely as line is nested within pheno. Any differences between phenotypes will be absorbed into the line terms, so there is no unique solution for the model coefficients. You can only do:

~line

... or if you want to account for the replicate, then:

~rep + line

To determine the phenotype effect, you can then compare the average of HL lines with the average of the LL lines, see examples in Section 3.2.3 and 3.2.4 in the edgeR user's guide.