Question: Help, Statistical test for unbalanced data set
0
4.9 years ago by
China
wenming1260 wrote:

Hi, guys,

I need test whether a gene is differentially expressed between two groups, for example,

In group 1, we have species A, species B, species C. In group 2, we have species D, with individual D1 and D2, and species E, with individual E1 and E2.

So there are 2 sample in group 2, each with 2 biological replicates, while in group 1, we have 3 sample, but no replicates.

We would like to test whether a gene is significantly differentially expressed between group1 (including species A, B, C) and group2 (including species D, E).

I have used the edgeR to fit a generalized linear models (GLMs). But a reviewer says that it may be not appropriate to manage the biological replicates in group 2 in this way.

My questions are: What is the most proper way to manage the biological replicates? Is a mixed effect model suitable is this case?

Many thanks, any suggestion would be greatly appreciated.

linear model • 1.4k views
modified 4.9 years ago by Gordon Smyth38k • written 4.9 years ago by wenming1260
Answer: Help, Statistical test for unbalanced data set
3
4.9 years ago by
Aaron Lun24k
Cambridge, United Kingdom
Aaron Lun24k wrote:

Let's define the grouping for your dataset as below; one replicate for each of A, B and C, and two replicates for each of D and E. The design matrix can then be defined using a one-way layout.

grouping <- c("A", "B", "C", "D", "D", "E", "E")
design <- model.matrix(~0+factor(grouping))


This approach is probably the best way of handling the replicates. I wouldn't treat the libraries in each group as replicates of each other, as I would expect systematic differences in gene expression to be present between species. Considering the variation between species would inflate the dispersion estimate (e.g., you'd get a large dispersion even if your experimental technique was perfectly reproducible). The design described above only considers the variation between replicates for the same species.

The contrast between groups 1 and 2 can then be designed in several ways. One definition of the null hypothesis states that the average expression across all species in group 1 is equal to the average expression across all species in group 2 (note; species, not libraries). This can be performed for the specified design after obtaining a fit object from glmFit, by setting:

contrast <- c(1/3, 1/3, 1/3, -1/2, -1/2)
result <- glmLRT(fit, contrast=contrast)


The log-fold changes from glmLRT will represent the average log-fold-change of group 1 over group 2. Obviously, this only works with the averages within each group, and it won't guarantee that the expression of all species in group 1 is greater than that of group 2 (or vice versa). If you want the latter, you'll have to do comparisons for all pairs of species between groups, e.g., for D versus A:

contrast.DA <- c(-1, 0, 0, 1, 0)
result.DA <- glmLRT(fit, contrast=contrast.DA)

and so on, for D - B, D - C, E - A, E - B, and E - C. You can then find genes that are significantly up (or down) in all of these comparisons.

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by Aaron Lun24k

There is one important thing to note for this design, however: the dispersion estimation will only use information from species D and E, since there are no replicates for A, B, and C.

ADD REPLYlink written 4.9 years ago by Ryan C. Thompson7.4k
Answer: Help, Statistical test for unbalanced data set
0
4.9 years ago by
China
wenming1260 wrote:

Thank you very much

This best answered my question.

Thank you for your patience.