Hello,

I recently had a question regarding repeated measures RNA-seq analysis. This has been thoroughly answered through an extension of the edgeR manual section 3.5. However this has lead to me towards another question as I attempted to extend such concepts to another experiment wherein the sample size in each group is different. For example, here is a dataframe modified from the edgeR user manual concerning between and within subjects

comparisons (Section 3.5) and another containing specific times points to explain my point, both dataframes re-numbered as recommended by the manual.

> targets Disease Patient Treatment 1 Healthy 1 None 2 Healthy 1 Hormone 3 Healthy 2 None 4 Healthy 2 Hormone 5 Healthy 3 None 6 Healthy 3 Hormone 7 Disease1 1 None 8 Disease1 1 Hormone 9 Disease1 2 None 10 Disease1 2 Hormone 11 Disease2 1 None 12 Disease2 1 Hormone 13 Disease2 2 None 14 Disease2 2 Hormone 15 Disease2 3 None 16 Disease2 3 Hormone > sample_data Condition Subject Time 1 control 1 0hr 2 control 1 1hr 3 control 1 2hr 4 control 2 0hr 5 control 2 1hr 6 control 2 2hr 7 control 3 0hr 8 control 3 1hr 9 control 3 2hr 10 control 4 0hr 11 control 4 1hr 12 control 4 2hr 13 Disease 1 0hr 14 Disease 1 1hr 15 Disease 1 2hr 16 Disease 2 0hr 17 Disease 2 1hr 18 Disease 2 2hr

I have read the initial posting that lead to this section of the manual and it said to drop the samples that don't have equal numbers. Now this doesn't seem to be a big deal if only dropping from one group a sample or two but could potentially be a problem such as above where dropping four or six samples seems more of a sacrifice. I begin to think of experiments

which (assuming repeated/dependent samples) group numbers very more significantly as a result of difficulty acquiring samples. Are there any recommendations from the community regarding such a situation? All I have found assumes that the samples within each group are equal.

Regards,

--

Charles Determan

Integrated Biosciences PhD Candidate

University of Minnesota

Gordon,

The reason I ask is because I get an error if I attempt to run a design formula of (~group + group:subject + group:time) and I run estimateGLMCommonDisp(dge, design) I get the error:

The mailing list post I am referring to, with the same error, is at the following link:

https://stat.ethz.ch/pipermail/bioconductor/2012-November/049055.html

Am I simply writing the design formula incorrectly to still account for the subject variation?

Regards,

Charles

Dear Charles,

The link you give is to a user question. I replied to that post explaining how to solve the problem without removing samples:

https://stat.ethz.ch/pipermail/bioconductor/2012-November/049087.html

The advice that I gave there applies also to your data.

The problem is that the model.matrix() function in R adds superfluous columns to the design matrix that have to removed manually. In your case you have to remove the design columns for disease patients 3 and 4, because there are no such patients. It is beyond the scope of the edgeR package to rewrite the model.matrix() function, which is maintained by R core, so I can only advise on work-arounds.

Best wishes

Gordon

My apologies, I feel rather silly that I misinterpreted your answer. I mistakenly read it as removing samples from the dataset and not from the design matrix. Thank you for clearing up that matter. You have answered my question completely.

Regards,

Charles