I have seen quite a few different ways for constructing a design matrix with limma. My problem concerns RNAseq data, for which I want to find the top DE genes between "good" and "poor" responders. This meta-variable is stored in a meta data file, which has information about each cell. These cells are all present in the expression data which has counts for each gene. I also want to account for the several batches, as well as the patient origin of these cells.
I have created factors for all three variables, but I am not sure if these are necessary.
responder_group <- factor(meta_data$Responder_status)
patients <- factor(meta_data$Patient_id)
batches <- factor(meta_data$processing_date)
Here's the two design matrices I saw most frequently to model the differences between my responders, while correcting for any differences between batches and patients.
design <- model.matrix(~0 + responder_group + batches + patients, meta_data)
design <- model.matrix(~0 + responder_group + processing_date + Patient_id, meta_data)
So option 1 uses the factors I created from the meta_data file, while option 2 does not.
Finally, I fit the model
fit <- lmFit(data_filtered, design, correlation = NULL)
cont_matrix <- makeContrasts("responder_grouppoor-responder_groupgood", levels=design)
fit2 <- contrasts.fit(fit, cont_matrix)
fit3 <- eBayes(fit2)
When I look at the toptable results, they differ for these two options. Can someone explain the difference and which one I should use?