I'm trying to decide on what variables to include in my design formula for the DESeq2 dataset matrix, and am confused by how "controlling for" various variables is affecting my results.
For example, I am wondering if it makes sense to include "batch" in my design. Before doing any analysis, I made PCA plots from my data, and the samples do not seem to cluster by batch. If I run DESeq to compare across disease state without controlling for batch in my design and then generate a heatmap with hierarchical clustering, the samples from the same batch do not cluster together either. However, if I do add batch to my DESeq dataset design, run DESeq, and then create the same hierarchical heatmap, now the samples from the same batch cluster together.
Why would samples from the same batch start to cluster together after "controlling for" batch? I see similar changes in clustering behavior between the samples if I add other variables to my design such as subject age or sex. Does this mean it is not appropriate for me to control for these variables, or am I fundamentally not understanding what it means to "control" for variables by adding them to the design formula?
My design formula when comparing across disease state and controlling for batch is:
meta <- fread("variables.csv")
meta <- as.data.frame(meta)
ddsMat <- DESeqDataSetFromTximport(txi,
colData = meta,
design = ~ Batch + Disease)
I am new to RNA seq analysis and the only one in my lab working on it, so I have been googling away to try and answer my own questions but am still confused by this phenomenon.
Thank you so much!