Question

Help, Statistical test for unbalanced data set

1

Entering edit mode

wenming126 ▴ 10

@wenming126-6846

Last seen 9.5 years ago

China

Hi, guys,

I need test whether a gene is differentially expressed between two groups, for example,

In group 1, we have species A, species B, species C. In group 2, we have species D, with individual D1 and D2, and species E, with individual E1 and E2.

So there are 2 sample in group 2, each with 2 biological replicates, while in group 1, we have 3 sample, but no replicates.

We would like to test whether a gene is significantly differentially expressed between group1 (including species A, B, C) and group2 (including species D, E).

I have used the edgeR to fit a generalized linear models (GLMs). But a reviewer says that it may be not appropriate to manage the biological replicates in group 2 in this way.

My questions are: What is the most proper way to manage the biological replicates? Is a mixed effect model suitable is this case?

Many thanks, any suggestion would be greatly appreciated.

linear model • 2.7k views

ADD COMMENT • link updated 9.5 years ago by Gordon Smyth 50k • written 9.5 years ago by wenming126 ▴ 10

0

Entering edit mode

wenming126 ▴ 10

@wenming126-6846

Last seen 9.5 years ago

China

Thank you very much

This best answered my question.

Thank you for your patience.

ADD COMMENT • link 9.5 years ago wenming126 ▴ 10

0

Entering edit mode

I'm glad to hear that. Please accept the answer to resolve this thread.

ADD REPLY • link 9.5 years ago Aaron Lun ★ 28k

score 2 · Accepted Answer · 2014-10-09

Let's define the grouping for your dataset as below; one replicate for each of A, B and C, and two replicates for each of D and E. The design matrix can then be defined using a one-way layout.

grouping <- c("A", "B", "C", "D", "D", "E", "E")
design <- model.matrix(~0+factor(grouping))

This approach is probably the best way of handling the replicates. I wouldn't treat the libraries in each group as replicates of each other, as I would expect systematic differences in gene expression to be present between species. Considering the variation between species would inflate the dispersion estimate (e.g., you'd get a large dispersion even if your experimental technique was perfectly reproducible). The design described above only considers the variation between replicates for the same species.

The contrast between groups 1 and 2 can then be designed in several ways. One definition of the null hypothesis states that the average expression across all species in group 1 is equal to the average expression across all species in group 2 (note; species, not libraries). This can be performed for the specified design after obtaining a fit object from glmFit, by setting:

contrast <- c(1/3, 1/3, 1/3, -1/2, -1/2)
result <- glmLRT(fit, contrast=contrast)

The log-fold changes from glmLRT will represent the average log-fold-change of group 1 over group 2. Obviously, this only works with the averages within each group, and it won't guarantee that the expression of all species in group 1 is greater than that of group 2 (or vice versa). If you want the latter, you'll have to do comparisons for all pairs of species between groups, e.g., for D versus A:

contrast.DA <- c(-1, 0, 0, 1, 0)
result.DA <- glmLRT(fit, contrast=contrast.DA)

and so on, for D - B, D - C, E - A, E - B, and E - C. You can then find genes that are significantly up (or down) in all of these comparisons.