Our lab routinely uses DESeq2 to analyze our RNA-seq data without major issues -- but one thing we often discuss is: with DESeq2, how can we tell if collinearity is occurring when we adjust for multiple variables from the 'colData' (e.g. sex, age, batch) by including these variables in the 'design' parameter of 'DESeqDataSetFromMatrix' ?
Also, I've come across the DESeq2 warning: "Including numeric variables with large mean can induce collinearity with the intercept. Users should center and scale numeric variables in the design to improve GLM convergence." This warning can be resolved by scaling and centering the flagged variable(s) in the 'coldata' object before feeding into the 'design' formula. But how are we supposed to tell which design model is closest to the 'ground truth'?
Lastly, is it expected or typical that the number of genes called as significantly differentially expressed increases as the number of variables included in the 'design' formula also increases?
Example set up to fix 'age' variable that produced DESeq warning mentioned above.
ColData$Age<-scale(ColData$Age, center = TRUE)
dds<-DESeqDataSetFromMatrix(countData = CountData,
colData= ColData,
design= ~sex+age+condition)
dds<-DESeq(dds)
res<-lfcShrink(dds, coef="condition_CASE_vs_CONTROL", type="apeglm")