Dear Lorena,

You are right to recognise that this design needs special treatment.

You can analyse all the data together, but you will need a statistical procedure that can account for the fact that the control samples in the two batches are biologically paired, whereas all the other samples are biologically independent. This requires a method that can fit random effects for the individuals or can estimate within individual correlations.

You cannot do this in edgeR, DESeq, DESeq2 or in any of the other RNA-seq packages based on the negative binomial distribution. These packages assume that all samples are statistically independent.

The only valid approach that I know of would be to use voom and the duplicate correlation approach of the limma package. First, read in your data (for both batches) and normalize your data as in the edgeR package. If y is your DGEList object, then

y <- calcNormFactors(y)

Then setup you design matrix as you would do for limma or edgeR. This includes effects for the batch and treatment, but not effects for patients. Something like

design <- model.matrix(~batch+condition)

where batch takes two levels (A and B) and condition takes three levels (control, preclinical, clinical).

Then transform to logCPM using the voom function.

v <- voom(y, design, plot=TRUE)

Then estimate the correlation between the repeat samples for the same individual:

dupcor <- duplicateCorrelation(v, design, block=individual)

Here 'individual' is a vector or factor indicating which individual each sample comes from. It has length equal to the number of samples and takes 21 levels, one for each individual in your data (7 controls, 7 preclinic patients, 7 clinical patients).

Now it it best to repeat the two previous steps so that the voom weights and correlation converge:

v <- voom(y, design, block=individual, correlation=dupcor$consensus)
dupcor <- duplicateCorrelation(v, design, block=individual)

Then conduct a differential expression analysis:

fit <- lmFit(v, design, block=individual, correlation=dupcor$consensus)
fit <- eBayes(fit)

Be sure to check that the value of dupcor$consensus is positive. This analysis keeps track of the fact that the two samples from each control individual are correlated.

Then you can use the topTable() function in the limma package to extract genes that follow any pattern of differential expression that you are interested in.

Best wishes

Gordon

Thanks a lot!

I think that is clear for me now!

Lo