Question: edgeR design matrix for 2 variables + batch effect
0
4 weeks ago by
ahsindoomilup0 wrote:

I am new to edgeR and am trying to create a design matrix for my dataset. I have read the manual and many discussion threads, but cant find a good match for my setup and am still unsure if I am using the correct design.

I have a disease variable (Control vs Patient), a developmental timepoint variable (Diff vs Undiff) and 2 unequal batches. I want to compare Patient vs Control in both Undiff and Diff states, but I must remove batch effects (MDS plot showed batch 1 and batch 2 clusters).

See below for the factors I created and the layout of the different groups.

Disease <- rep(factor(c("Ctrl", "Patient")), each=4)

Dev <- rep(factor(c("NPC","Differentiated")),each=2, times=2)

Batch <- factor(c("set1",rep("set2",times=3),"set1",rep("set2",times=3)))

Disease Dev Batch

Control Undiff 1

Control Undiff 2

Control --Diff 2

Control --Diff 2

Patient Undiff 1

Patient Undiff 2

Patient --Diff 2

Patient --Diff 2

Should I be using design1 or design2 below?

design1 <- model.matrix(~Disease + Disease:Batch + Disease:Dev)

design2 <- model.matrix(~Batch + Dev + Disease)

modified 4 weeks ago by Gordon Smyth39k • written 4 weeks ago by ahsindoomilup0

Does each row correspond to a biologically independent sample? By that I mean, do you have 4 different patients and four different controls or did you make more than one measurement on the same patient?

Answer: edgeR design matrix for 2 variables + batch effect
0
4 weeks ago by
Gordon Smyth39k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth39k wrote:

Subject to my question about replication (posted as a comment above), and apart from the batch effect, this appears to be a standard 2x2 factorial design. Why don't you do what everyone else does, which to combine the treatments into one factor and take contrasts:

Disease.Dev <- factor(paste(Disease,Dev,sep="."))
design <- model.matrix(~0+Disease.Dev+Batch)
colnames(design)[1:4] <- levels(Disease.Dev)


The designs you propose in your question are not correct, unless you want to assume that the time-point effect is the same for both controls and patients (which would give design2).

In reality, these samples are stem cells generated from 1 Control and 1 Patient, and harvested at an undifferentiated or at a differentiated state, in order to determine the effect of the disease on cell development. Therefore, in this dataset, there are are no true biological replicates, but we are just considering them as such for now. Combining the factors into a single-factor design and comparing individual contrasts (which is what you said everyone else does too) is what I had done myself, because it made most intuitive sense to me. In such a set up, the design matrix is easiest to interpret. The design (~0+Disease.Dev) allowed me to compare the disease effect at each developmental timepoint independently (aka, Patient vs. Control in Differentiated cells and Patient vs. Control in Undifferentiated cells).

However, I want to see if the disease has an effect on the development of the cells in terms of the DEGs expressed at the Undiff and Diff sample timepoints - aka, is there an interaction between disease and development. To do so, I could manually compare the DEGs generated at each time point to assess similarites/differences, but I thought there would be a way to code this into the design matrix...? I hope I am explaining what I want clearly here! In addition, I was not sure if the "+batch" term would appropriately correct for batch effects since the batches are not "equal" - meaning, Batch 1 contains only 2 samples (Control-Undiff and Patient-Undiff) and Batch 2 contains 6 samples (Control-Undiff, Patient-Undiff, 2x Control-Diff, and 2x Patient-Diff).

I was hoping you would point out what was wrong with the designs I proposed in my original question and why (which you did for design2), because I am very shaky on how to interpret the different design matrices (except the single-factor design with separate contrasts!) and what exactly they are comparing. For instance, your response that my design2 assumes that the developmental timepoint effect is the same for both control and patient was very helpful because that is not what I thought was going on, and that is definitely not what I want in this case!

Thank you for your confirmation that the (~0+Disease.Dev+batch) design will appropriately account for batch effects in my data setup. I just also want to confirm that I can assess an "interaction" between disease and development using this design?