Hi Tonja,
given your example, I understand that you only have two replicates per class: two treated and two controls. If that's the case, then you cannot include both the known batch effect and a factor of RUV because you don't have enough degrees of freedom in the model.
In fact, what do you mean here by batch? The design that you are describing seems a paired design to me, i.e., you have two _pairs_ of treated, control experiments. You may want to take advantage of that in your analysis, by specifying a paired design in edgeR (see chapter 3.4 of the edgeR User's guide).
A minimal example would be:
expt <- factor(c(1, 1, 2, 2))
treatment <- factor(c("t", "c", "t", "c"))
design <- model.matrix(~expt+treatment)
Note that if you do that, you need to estimate three regression coefficients: for the intercept, for experiment (batch), and for treatment. Then you need to estimate the variance as well. Hence, you don't have any degrees of freedom left (4 observations, 4 parameters). And it is not possible to add any additional factor in the design, i.e., you cannot use RUV in this case.
Obviously, you could drop the expt variable, and use one factor from RUV instead, but it seems to be a better strategy in this case to make use of the information that the samples are paired.
I guess I did not express myself well enough. The minimal example for edgeR analysis is clear for me and you got it right: Ctl1 and Trt1 were processed together, and Ctl2 and Trt2 together: so the bach effect would be between the first pair of treated/control and the second pair.
If that's the case, then you cannot include both the known batch effect and a factor of RUV because you don't have enough degrees of freedom in the model.
I do not want to do this. In RUVg one performs the first pass (step A) of edgeR analysis to identify the most non-differential genes. The design matrix for this first pass is (as suggested in RUVseq manual)
one perform first pass of edgeR analysis , then the most non-differential genes are introduced to RUVg then one obtains the factos from RUVseq and inputs them into a new design matrix for the downstream analysis with edgeR (step B).
My question was if it makes sense to modify design matrix to account for batch effect already in step A: in the first pass of edgeR in RUVseq to identify the most non-differential genes.
Meaning, following:
step A:
Perform DE analysis with edgeR to find most non differential genes and input them into RUVg. Get factors (W_1) from RUVg and now the design matrix would look like following for the downstream analysis with edgeR (step B):
The answer to your question depends on the design of the experiment and on what you want the factors of unwanted variation to capture.
*In general* (i.e., with an enough number of observations) you could have a step A that looks like this:
and a step B that looks like this:
This would mean that W captures _additional_ variation that is not captured by batch and treatment. With only two observations per class you cannot fit this model, hence you have to remove either W or expt.
In this case, I'm not sure it would help to use expt in the design of step A, because you want W to capture the batch effect as well. So I would just use the simpler step A: