Let's define the grouping for your dataset as below; one replicate for each of A, B and C, and two replicates for each of D and E. The design matrix can then be defined using a one-way layout.
grouping <- c("A", "B", "C", "D", "D", "E", "E")
design <- model.matrix(~0+factor(grouping))
This approach is probably the best way of handling the replicates. I wouldn't treat the libraries in each group as replicates of each other, as I would expect systematic differences in gene expression to be present between species. Considering the variation between species would inflate the dispersion estimate (e.g., you'd get a large dispersion even if your experimental technique was perfectly reproducible). The
design described above only considers the variation between replicates for the same species.
The contrast between groups 1 and 2 can then be designed in several ways. One definition of the null hypothesis states that the average expression across all species in group 1 is equal to the average expression across all species in group 2 (note; species, not libraries). This can be performed for the specified
design after obtaining a
fit object from
glmFit, by setting:
contrast <- c(1/3, 1/3, 1/3, -1/2, -1/2)
result <- glmLRT(fit, contrast=contrast)
The log-fold changes from
glmLRT will represent the average log-fold-change of group 1 over group 2. Obviously, this only works with the averages within each group, and it won't guarantee that the expression of all species in group 1 is greater than that of group 2 (or vice versa). If you want the latter, you'll have to do comparisons for all pairs of species between groups, e.g., for D versus A:
contrast.DA <- c(-1, 0, 0, 1, 0)
result.DA <- glmLRT(fit, contrast=contrast.DA)
and so on, for D - B, D - C, E - A, E - B, and E - C. You can then find genes that are significantly up (or down) in all of these comparisons.