Hello, I'm performing a diifferential expression analysis on RNAseq data and I'd like to adjust my design to remove the effect of unwanted variation by RUVSeq package. The problem is that I have two important phenotypes in my data, let's say: Pheno_1, Pheno_2; Condition_A and Condition_B, so that my samples can be labelled (by aggregating the two variables) as: Pheno_1-Condition_A, Pheno_1-Condition_B, Pheno_2-Condition_A and Pheno_2-Condition_B.

Now, I'm wondering what is the best way to estimate the "latent variables". Should I aggregate the two variables and perform RUVg or RUVr methods for this aggregated variable with four levels or there is something else I can do?

Thank you very much in advance.

In general, the way to handle nuisance effects like batch for DE analysis is

notto alter the counts, but to include the batch information as an element in the design, so that that information is included in the GLM.Hi,

I'm not sure that I completely understand the question, but note that

`RUVg`

does not require you to specify your design matrix to estimate the factors of unwanted variation. E.g., you will run something like:where

`x`

is your gene expression matrix,`neg_controls`

is a list of negative control genes and`k`

is the number of factors that you need to estimate.Then you will need to pass these k factors to a differential expression model (such as those implemented in

`edgeR`

or`DESeq2`

) which usually support a multi-factor design, e.g., by specifying:where W contains the factors estimated by RUVg.

For RUVr, you need to first run a first-pass DE model and then run RUV on the residuals, but again the design matrix of your DE model is where you would specify the factors of interest.

Dear Davide, for the sake of the synthesis, I was not probably very clear. The problem is that I don’t have a a priori known negative control genes but I need to obtain a set of empirical control genes as described in the paragraph 2.4 of RUVSeq vignette. In this case, but please correct me if I’m wrong, I need to define the variable of interest for model matrix designing and, then, identify the “

least significantly DE genes based on a first-pass DE analysis performed prior to RUVg normalization.” So, this is the problem with the fact that I have two main factor variables in my experiment, which I defined above as “Phenotype” and “Condition”. To identify the least significantly DE genes I have to set a contrast. In my opinion, the option of comparison (i.e. considering either the “Phenotype” or the “Condition” as the primary variable of interest or the aggregated variables “Phenotype/Condition”) during thisfirst-pass DE analysiswill substantially alter certain least significantly DE genes that can be used to infer the factors of unwanted variation (i.e. W). From this consideration comes my uncertainty about how to obtain the empirical control genes to estimate the Ws by RUVg. Similarly, RUVr requires the definition of the variable of interest into the design matrix and, again, I have uncertainty when using Phenotype, Condition or the aggregate variable Phenotype-Condition to estimate the residuals. I’m not sure if I was clear enough now but your feedback will be very much appreciated. Thank you very much again for your attention.Hi Luca,

I understand better your question now, thanks!

My suggestion would be to fit the model with both factors, e.g.

against the null model with only the intercept. You can do that with a likelihood ratio test in both edgeR and DESeq2.

Looking at the least DE genes for this test will allow you to exclude genes that are DE for either effect. I guess this is in fact the same as concatenating the two variables.

So I guess my suggestion is to use both phenotype and condition in the first pass DE analysis to define the negative control genes, as this seems safer than the alternatives, for which you would risk to use as negative control genes that are DE between conditions (if you use phenotype in the first pass model) or between phenotypes (if you use condition). Does this make sense?

Hi Davide, first of all, thank you for your kind reply. I think your suggestion makes sense, and if I correctly understood, I should do something like this:

then select the least DE genes from “res” and, finally, run the RUVg function with the above identified empirical control genes. Is it right? Thanks again.

Sorry for the late reaction, but yes, that's what I meant!

Yes, I agree with your your general consideration about how to handle nuisance effects and, indeed, it was not my intention to alter the counts but include such latent variables (i.e. W) into the design.