Question

Should I use factors for covariate in limma design matrix or not?

0

Entering edit mode

msveldhuis96 • 0

@msveldhuis96-23194

Last seen 4.5 years ago

I have seen quite a few different ways for constructing a design matrix with limma. My problem concerns RNAseq data, for which I want to find the top DE genes between "good" and "poor" responders. This meta-variable is stored in a meta data file, which has information about each cell. These cells are all present in the expression data which has counts for each gene. I also want to account for the several batches, as well as the patient origin of these cells.

I have created factors for all three variables, but I am not sure if these are necessary.

responder_group <- factor(meta_data$Responder_status)
patients <- factor(meta_data$Patient_id)
batches <- factor(meta_data$processing_date)

Here's the two design matrices I saw most frequently to model the differences between my responders, while correcting for any differences between batches and patients.

design <- model.matrix(~0 + responder_group + batches + patients, meta_data)
design <- model.matrix(~0 + responder_group + processing_date + Patient_id, meta_data)

So option 1 uses the factors I created from the meta_data file, while option 2 does not.

Finally, I fit the model

fit <- lmFit(data_filtered, design, correlation = NULL)
cont_matrix <- makeContrasts("responder_grouppoor-responder_groupgood",  levels=design)
fit2 <- contrasts.fit(fit, cont_matrix)
fit3 <- eBayes(fit2)

When I look at the toptable results, they differ for these two options. Can someone explain the difference and which one I should use?

limma • 841 views

ADD COMMENT • link updated 4.5 years ago by James W. MacDonald 67k • written 4.5 years ago by msveldhuis96 • 0

score 2 · Answer 1 · 2020-03-27

You want both the processingdate and Patientid to be factor. If they are numeric you will fit them as continuous predictors, which doesn't make sense in this context.

Put a different way, the design matrix should have N - 1 columns for the batches and patients (where N is the number of batches and the number of patients, respectively), with just 1s and 0s. If you have a single column with numbers, then R is fitting that as a continuous variable.