Hi all,
I have a problem in figuring out how to include the batch effects in the design formula of DESeq. In particular I have to test the effect of one condition (Low protein diet vs Control diet) on gene expression over 48 mouse liver samples (24 for each condition). The problem is that I sequenced the samples in two separate batches (always keeping Treated vs Control in each batch) and different sequencer. So I have 2 big sources of batch effect:
- Different biological replicate (different mothers)
- Different sequencing run (this is the biggest batch effect)
I wanted to run DESeq with the following design formula: ~Replicate+Sequencing_run+condition
but I got the error of "Not full rank model matrix". Here is the chunk of code that I run:
##CREATE THE METADATA FILE##
sample_names <- colnames(cts)
condition <- factor(rep(c(rep("Control",12), rep("Treated", 12)), 2), levels = c("Control", "Treated"))
Replicate <- factor(c(rep(1:3, 4), rep(4:6, 4), rep(7:9, 4), rep(10:12, 4)), levels=c(1,2,3,4,5,6,7,8,9,10,11,12))
Sequencing_run <- factor(c(rep("A", 24), rep("B",24)), levels=c("A","B"))
coldata <- data.frame(condition, Replicate, sex, Sequencing_run, row.names = sample_names)
coldata
condition Replicate Sequencing_run
CTRL1_Rep1 Control 1 A
CTRL1_Rep2 Control 2 A
CTRL1_Rep3 Control 3 A
CTRL2_Rep1 Control 1 A
CTRL2_Rep2 Control 2 A
CTRL2_Rep3 Control 3 A
CTRL3_Rep1 Control 1 A
CTRL3_Rep2 Control 2 A
CTRL3_Rep3 Control 3 A
CTRL4_Rep1 Control 1 A
CTRL4_Rep3 Control 3 A
tRFGG1_Rep1 Treated 4 A
tRFGG1_Rep2 Treated 5 A
tRFGG1_Rep3 Treated 6 A
tRFGG2_Rep1 Treated 4 A
tRFGG2_Rep2 Treated 5 A
tRFGG2_Rep3 Treated 6 A
tRFGG3_Rep1 Treated 4 A
tRFGG3_Rep2 Treated 5 A
tRFGG3_Rep3 Treated 6 A
tRFGG4_Rep1 Treated 4 A
tRFGG4_Rep2 Treated 5 A
tRFGG4_Rep3 Treated 6 A
CTRL1_Rep4 Control 7 B
CTRL1_Rep5 Control 8 B
CTRL1_Rep6 Control 9 B
CTRL2_Rep4 Control 7 B
CTRL2_Rep5 Control 8 B
CTRL2_Rep6 Control 9 B
CTRL3_Rep4 Control 7 B
CTRL3_Rep5 Control 8 B
CTRL3_Rep6 Control 9 B
CTRL4_Rep4 Control 7 B
CTRL4_Rep5 Control 8 B
CTRL4_Rep6 Control 9 B
tRFGG1_Rep4 Treated 10 B
tRFGG1_Rep5 Treated 11 B
tRFGG1_Rep6 Treated 12 B
tRFGG2_Rep4 Treated 10 B
tRFGG2_Rep5 Treated 11 B
tRFGG2_Rep6 Treated 12 B
tRFGG3_Rep4 Treated 10 B
tRFGG3_Rep5 Treated 11 B
tRFGG3_Rep6 Treated 12 B
tRFGG4_Rep4 Treated 10 B
tRFGG4_Rep5 Treated 11 B
tRFGG4_Rep6 Treated 12 B
##RUN DESeq##
dds <- DESeqDataSetFromMatrix(countData = cts,
colData = coldata,
design = ~Replicate+Sequencing_run+condition)
Error in checkFullRank(modelMatrix) :
the model matrix is not full rank, so the model cannot be fit as specified.
One or more variables or interaction terms in the design formula are linear
combinations of the others and must be removed.
I think this is the linear dependency problem between the variables that want to include in the design formula, discussed in the DESeq2 vignette.
- Is there a way to solve this problem and tell DESeq to take into account the batch effects when fitting the model?
- Regardless whether the problem can be solved, would you analyze all the samples together or would you suggest me to analyze the samples sequenced in the 2 runs as independent experiments and at the end try to correlate the results observed?
Thank you in advance for your help!
Best regards,
Flavio
Drop the replicate information in the design, as that is automatically inferred from the columns "condition" and "Sequencing_run". Samples that have the same level in these columns are treated as replicates.