Help, Statistical test for unbalanced data set
2
1
Entering edit mode
wenming126 ▴ 10
@wenming126-6846
Last seen 6.9 years ago
China

Hi, guys,

I need test whether a gene is differentially expressed between two groups, for example,

In group 1, we have species A, species B, species C. In group 2, we have species D, with individual D1 and D2, and species E, with individual E1 and E2.

So there are 2 sample in group 2, each with 2 biological replicates, while in group 1, we have 3 sample, but no replicates.

We would like to test whether a gene is significantly differentially expressed between group1 (including species A, B, C) and group2 (including species D, E).

I have used the edgeR to fit a generalized linear models (GLMs). But a reviewer says that it may be not appropriate to manage the biological replicates in group 2 in this way.

My questions are: What is the most proper way to manage the biological replicates? Is a mixed effect model suitable is this case?

Many thanks, any suggestion would be greatly appreciated.

linear model • 2.0k views
2
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 15 hours ago
The city by the bay

Let's define the grouping for your dataset as below; one replicate for each of A, B and C, and two replicates for each of D and E. The design matrix can then be defined using a one-way layout.

grouping <- c("A", "B", "C", "D", "D", "E", "E")
design <- model.matrix(~0+factor(grouping))


This approach is probably the best way of handling the replicates. I wouldn't treat the libraries in each group as replicates of each other, as I would expect systematic differences in gene expression to be present between species. Considering the variation between species would inflate the dispersion estimate (e.g., you'd get a large dispersion even if your experimental technique was perfectly reproducible). The design described above only considers the variation between replicates for the same species.

The contrast between groups 1 and 2 can then be designed in several ways. One definition of the null hypothesis states that the average expression across all species in group 1 is equal to the average expression across all species in group 2 (note; species, not libraries). This can be performed for the specified design after obtaining a fit object from glmFit, by setting:

contrast <- c(1/3, 1/3, 1/3, -1/2, -1/2)
result <- glmLRT(fit, contrast=contrast)


The log-fold changes from glmLRT will represent the average log-fold-change of group 1 over group 2. Obviously, this only works with the averages within each group, and it won't guarantee that the expression of all species in group 1 is greater than that of group 2 (or vice versa). If you want the latter, you'll have to do comparisons for all pairs of species between groups, e.g., for D versus A:

contrast.DA <- c(-1, 0, 0, 1, 0)
result.DA <- glmLRT(fit, contrast=contrast.DA)

and so on, for D - B, D - C, E - A, E - B, and E - C. You can then find genes that are significantly up (or down) in all of these comparisons.

0
Entering edit mode

There is one important thing to note for this design, however: the dispersion estimation will only use information from species D and E, since there are no replicates for A, B, and C.

0
Entering edit mode
wenming126 ▴ 10
@wenming126-6846
Last seen 6.9 years ago
China

Thank you very much

0
Entering edit mode