Question

edgeR uneven group sizes

0

Entering edit mode

Charles Determan Jr ▴ 140

@charles-determan-jr-5949

Last seen 10.6 years ago

United States

Hello,

I recently had a question regarding repeated measures RNA-seq analysis. This has been thoroughly answered through an extension of the edgeR manual section 3.5. However this has lead to me towards another question as I attempted to extend such concepts to another experiment wherein the sample size in each group is different. For example, here is a dataframe modified from the edgeR user manual concerning between and within subjects
comparisons (Section 3.5) and another containing specific times points to explain my point, both dataframes re-numbered as recommended by the manual.

> targets
    Disease Patient Treatment
1   Healthy    1        None
2   Healthy    1        Hormone
3   Healthy    2        None
4   Healthy    2        Hormone
5   Healthy    3        None
6   Healthy    3        Hormone
7   Disease1  1       None
8   Disease1  1       Hormone
9   Disease1  2       None
10 Disease1  2       Hormone
11 Disease2  1       None
12 Disease2  1       Hormone
13 Disease2  2       None
14 Disease2  2       Hormone
15 Disease2  3       None
16 Disease2  3       Hormone

> sample_data
    Condition Subject Time
1   control    1        0hr
2   control    1        1hr
3   control    1        2hr
4   control    2        0hr
5   control    2        1hr
6   control    2        2hr
7   control    3        0hr
8   control    3        1hr
9   control    3        2hr
10 control    4        0hr
11 control    4        1hr
12 control    4        2hr
13 Disease  1        0hr
14 Disease  1        1hr
15 Disease  1        2hr
16 Disease  2        0hr
17 Disease  2        1hr
18 Disease  2        2hr

I have read the initial posting that lead to this section of the manual and it said to drop the samples that don't have equal numbers. Now this doesn't seem to be a big deal if only dropping from one group a sample or two but could potentially be a problem such as above where dropping four or six samples seems more of a sacrifice. I begin to think of experiments
which (assuming repeated/dependent samples) group numbers very more significantly as a result of difficulty acquiring samples. Are there any recommendations from the community regarding such a situation? All I have found assumes that the samples within each group are equal.

Regards,
--
Charles Determan
Integrated Biosciences PhD Candidate
University of Minnesota

edgeR • 2.7k views

ADD COMMENT • link updated 11.0 years ago by Gordon Smyth 53k • written 12.5 years ago by Charles Determan Jr ▴ 140

Gordon Smyth · Answer 1 · 2013-07-05

1

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 7 minutes ago

WEHI, Melbourne, Australia

Dear Charles,

There is no requirement in edgeR for equal group sizes, and never has been. I am puzzled why you might think there is such an assumption. edgeR always allows you to use all the available data that is scientifically meaningful.

You say that you read "the initial posting that lead to this section of the manual and it said to drop the samples that don't have equal numbers" but I do not know what you are refering to. I have never seen such advice.

Best wishes
Gordon

ADD COMMENT • link 12.5 years ago • updated 11.0 years ago Gordon Smyth 53k

0

Entering edit mode

Gordon,

The reason I ask is because I get an error if I attempt to run a design formula of (~group + group:subject + group:time) and I run estimateGLMCommonDisp(dge, design) I get the error:

Error in glmFit.default(y, design = design, dispersion = dispersion,
offset = offset,  :
  Design matrix not of full rank.  The following coefficients not
estimable:

The mailing list post I am referring to, with the same error, is at the following link:

https://stat.ethz.ch/pipermail/bioconductor/2012-November/049055.html

Am I simply writing the design formula incorrectly to still account for the subject variation?

Regards,
Charles

ADD REPLY • link updated 11.0 years ago by Gordon Smyth 53k • written 12.5 years ago by Charles Determan Jr ▴ 140

0

Entering edit mode

Dear Charles,

The link you give is to a user question. I replied to that post explaining how to solve the problem without removing samples:

https://stat.ethz.ch/pipermail/bioconductor/2012-November/049087.html

The advice that I gave there applies also to your data.

The problem is that the model.matrix() function in R adds superfluous columns to the design matrix that have to removed manually. In your case you have to remove the design columns for disease patients 3 and 4, because there are no such patients. It is beyond the scope of the edgeR package to rewrite the model.matrix() function, which is maintained by R core, so I can only advise on work-arounds.

Best wishes
Gordon

ADD REPLY • link 12.5 years ago • updated 11.0 years ago Gordon Smyth 53k

0

Entering edit mode

My apologies, I feel rather silly that I misinterpreted your answer. I mistakenly read it as removing samples from the dataset and not from the design matrix. Thank you for clearing up that matter. You have answered my question completely.

Regards,
Charles

ADD REPLY • link updated 11.0 years ago by Gordon Smyth 53k • written 12.5 years ago by Charles Determan Jr ▴ 140