I have 5 individuals with a paired design - before treatment and after treatment. The RNA-seq was done in two batches, so I need to also correct for the batch effect. The data looks like this:
Subject
Batch
Treatment
1
2
0
2
1
0
3
2
0
4
1
0
5
2
0
1
2
1
2
1
1
3
2
1
4
1
1
5
2
1
I need to test the effect of treatment, adjusting for differences between the individuals and the batch effect. I am struggling with the design matrix. I had tried: design <- model.matrix(~grp+patient+batch). However, I am getting the error "Design matrix not of full rank. " . I would be really grateful if someone could help me with the design matrix.
Thank you for the quick response. In your opinion what would be a better approach - analyze both batches together or analyze batches 1 and 2 separately and take the common DE genes between both batches?
You want to analyze the data all together. Paired analyses are more powerful, because you are accounting for subject-level variability by virtue of the design.
Thank you. That makes sense. Just to make sure I understand the design matrix correctly - if I have the following data (I changed the batch for individual 3).
Subject
Batch
Treatment
1
2
0
2
1
0
3
1
0
4
1
0
5
2
0
1
2
1
2
1
1
3
2
1
4
1
1
5
2
1
Would the following test for the differences between the treatments while correcting for batches and patient-specific effects:
Patient 3 will contribute no useful information in this altered design. Consider this; if the "after" sample for patient 3 is sequenced in a different batch from the "before" sample, then the difference between the two samples for this patient will have contributions from both the treatment and batch effects. The latter will confound any attempt to extract information about the former.
Instead, the patient 3 samples will only provide information about the batch effect. The treatment effect can be estimated from the differences between the before and after samples of the other patients. This means that the batch effect can be estimated by subtracting the estimated treatment effect from the before/after difference of patient 3. I don't think that knowing the size of the batch effect is particularly useful, given that you're interested in looking at the effect of the treatment.
It's important to note that the final estimate of the batch effect will have no effect on the estimate of the treatment effect. The former depends on the latter, but not the other way around. Thus, you're not really "correcting" for the batch effect when you estimate the treatment in this altered design. In fact, you could drop the patient 3 samples altogether and get the same treatment estimates.
You can't put both batch and patient into your design at the same time, since they are confounded with each other. This means that you cannot separate out the patient effects from the batch effects. (Never mind, I didn't realize you had changed the batch variable.)
You want to analyze the data all together. Paired analyses are more powerful, because you are accounting for subject-level variability by virtue of the design.