Thanks for using edgeR and for reading the User's Guide. We are planning a major revision of the User's Guide this year and, as part of that, we will expand Section 3.5 to cover the case where the two top-level groups have different numbers of subjects.
My preference for unbalanced multilevel models is not to fit the interaction model recommended in Section 3.5 but instead to build the design matrix up in a way that redundant columns are avoided in the first place. I have given advice about how to do this on this forum in the past but the relevant posts are hard to find given that search engine support for this forum is so poor.
It is true, in the context of Section 3.5, that you probably can safely remove the design columns that are all zero. You should appreciate though that this is a very special phenomenon that is specific to this particular experimental design and to the nested factorial models suggested in that section of the User's Guide. In the vast majority of cases where people get the
Design matrix is not full rank error, they have fitted an overparametrized model that is inappropriate for their experiment. Removing columns allows edgeR to run but yields results that might not be meaningful.
In equivalent situations, the limma package automatically removes design columns to make the design of full rank and issues a warning that it has done so. We could have taken the same approach in edgeR, but we decided to make it an error in order to force users to rethink their model. It's not clear which is the better approach. In my own work, I always make sure the design matrix is of full rank in the first place and do not rely on automatic fixes that might lead to unexpected effects.
A rewrite of Section 3.5
Here is a reanalysis of example in Section 3.5 of the User's Guide.
This new approach is equivalent to that in the User's when the disease groups are balanced but will work regardless of the number of patients in each disease group.
First, setup the relevant factors.
Note that in this approach we do not need to renumber the patients:
Patient <- factor(rep(1:9,each=2))
Treatment <- rep(c("None","Hormone"),9)
Disease <- rep(c("Healthy","Disease1","Disease2"),each=6)
Assemble the design matrix:
design <- model.matrix(~Patient)
Healthy.Hormone <- Disease=="Healthy" & Treatment=="Hormone"
Disease1.Hormone <- Disease=="Disease1" & Treatment=="Hormone"
Disease2.Hormone <- Disease=="Disease2" & Treatment=="Hormone"
design <- cbind(design, Healthy.Hormone, Disease1.Hormone, Disease2.Hormone)
After estimating the dispersions (code not shown), we can fit a linear model:
fit <- glmQLFit(y, design)
To find genes responding to the hormone in Healthy patients:
qlf <- glmQLFTest(fit, coef="Healthy.Hormone")
To find genes responding to the hormone in Disease1 patients:
qlf <- glmQLFTest(fit, coef="Disease1.Hormone")
To find genes responding to the hormone in Disease2 patients:
qlf <- glmQLFTest(fit, coef="Disease2.Hormone")
To find genes that respond to the hormone in any disease group:
qlf <- glmQLFTest(fit, coef=10:12)
To find genes that respond differently to the hormone in Disease1 vs Healthy patients:
qlf <- glmQLFTest(fit, contrast=c(0,0,0,0,0,0,0,0,0,-1,1,0))
To find genes that respond differently to the hormone in Disease2 vs Healthy patients:
qlf <- glmQLFTest(fit, contrast=c(0,0,0,0,0,0,0,0,0,-1,0,1))
To find genes that respond differently to the hormone in Disease2 vs Disease1 patients:
qlf <- glmQLFTest(fit, contrast=c(0,0,0,0,0,0,0,0,0,0,-1,1))