Dear Peter,
Thanks for using edgeR and for reading the User's Guide. We are planning a major revision of the User's Guide this year and, as part of that, we will expand Section 3.5 to cover the case where the two top-level groups have different numbers of subjects.
My preference for unbalanced multilevel models is not to fit the interaction model recommended in Section 3.5 but instead to build the design matrix up in a way that redundant columns are avoided in the first place. I have given advice about how to do this on this forum in the past but the relevant posts are hard to find given that search engine support for this forum is so poor.
It is true, in the context of Section 3.5, that you probably can safely remove the design columns that are all zero. You should appreciate though that this is a very special phenomenon that is specific to this particular experimental design and to the nested factorial models suggested in that section of the User's Guide. In the vast majority of cases where people get the Design matrix is not full rank
error, they have fitted an overparametrized model that is inappropriate for their experiment. Removing columns allows edgeR to run but yields results that might not be meaningful.
In equivalent situations, the limma package automatically removes design columns to make the design of full rank and issues a warning that it has done so. We could have taken the same approach in edgeR, but we decided to make it an error in order to force users to rethink their model. It's not clear which is the better approach. In my own work, I always make sure the design matrix is of full rank in the first place and do not rely on automatic fixes that might lead to unexpected effects.
A rewrite of Section 3.5
Here is a reanalysis of example in Section 3.5 of the User's Guide.
This new approach is equivalent to that in the User's when the disease groups are balanced but will work regardless of the number of patients in each disease group.
First, setup the relevant factors.
Note that in this approach we do not need to renumber the patients:
Patient <- factor(rep(1:9,each=2))
Treatment <- rep(c("None","Hormone"),9)
Disease <- rep(c("Healthy","Disease1","Disease2"),each=6)
Assemble the design matrix:
design <- model.matrix(~Patient)
Healthy.Hormone <- Disease=="Healthy" & Treatment=="Hormone"
Disease1.Hormone <- Disease=="Disease1" & Treatment=="Hormone"
Disease2.Hormone <- Disease=="Disease2" & Treatment=="Hormone"
design <- cbind(design, Healthy.Hormone, Disease1.Hormone, Disease2.Hormone)
After estimating the dispersions (code not shown), we can fit a linear model:
fit <- glmQLFit(y, design)
To find genes responding to the hormone in Healthy patients:
qlf <- glmQLFTest(fit, coef="Healthy.Hormone")
topTags(qlf)
To find genes responding to the hormone in Disease1 patients:
qlf <- glmQLFTest(fit, coef="Disease1.Hormone")
topTags(qlf)
To find genes responding to the hormone in Disease2 patients:
qlf <- glmQLFTest(fit, coef="Disease2.Hormone")
topTags(qlf)
To find genes that respond to the hormone in any disease group:
qlf <- glmQLFTest(fit, coef=10:12)
topTags(qlf)
To find genes that respond differently to the hormone in Disease1 vs Healthy patients:
qlf <- glmQLFTest(fit, contrast=c(0,0,0,0,0,0,0,0,0,-1,1,0))
topTags(qlf)
To find genes that respond differently to the hormone in Disease2 vs Healthy patients:
qlf <- glmQLFTest(fit, contrast=c(0,0,0,0,0,0,0,0,0,-1,0,1))
topTags(qlf)
To find genes that respond differently to the hormone in Disease2 vs Disease1 patients:
qlf <- glmQLFTest(fit, contrast=c(0,0,0,0,0,0,0,0,0,0,-1,1))
topTags(qlf)
I admit that it is annoying to find out the right answer in the internet. I already faced the same error with limma, but found quite quickly the answer. EdgeR uses the same logic as limma for specifying the design matrix, so it's a good idea to include limma in your research. This forum is open to everyone who already searched a solution on his/her own. Don't waste your time, post sooner next time. Best.
The error you experience is most likely due to linear dependencies that you created with your design, not due to odd numbers in sample sizes. I suggest reading https://support.bioconductor.org/p/68092/ as a starting point. If you post your group information and which comparisons you aim to make then people can help you putting together a design that is probably much simpler than a complicated interaction model, e.g. a factorial design while still giving the desired results.
Indeed, the design formula via the SeqAnswers link is overly complex and will result in problems irrespective of the program used:
I think the OP is talking about this situation, which does indeed result in (in this case) a single column with all zeros.