Question

DESeq2 collinearity, adjusting for covariates, and the ground truth

0

Entering edit mode

Anonymous Genetics Gnome • 0

@73b13df5

Last seen 4 months ago

United States

Our lab routinely uses DESeq2 to analyze our RNA-seq data without major issues -- but one thing we often discuss is: with DESeq2, how can we tell if collinearity is occurring when we adjust for multiple variables from the 'colData' (e.g. sex, age, batch) by including these variables in the 'design' parameter of 'DESeqDataSetFromMatrix' ?

Also, I've come across the DESeq2 warning: "Including numeric variables with large mean can induce collinearity with the intercept. Users should center and scale numeric variables in the design to improve GLM convergence." This warning can be resolved by scaling and centering the flagged variable(s) in the 'coldata' object before feeding into the 'design' formula. But how are we supposed to tell which design model is closest to the 'ground truth'?

Lastly, is it expected or typical that the number of genes called as significantly differentially expressed increases as the number of variables included in the 'design' formula also increases?

Example set up to fix 'age' variable that produced DESeq warning mentioned above.


  ColData$Age<-scale(ColData$Age, center = TRUE)



  dds<-DESeqDataSetFromMatrix(countData = CountData,
                              colData= ColData,
                              design= ~sex+age+condition)


  dds<-DESeq(dds)



  res<-lfcShrink(dds, coef="condition_CASE_vs_CONTROL", type="apeglm")

DESeq2 • 575 views

ADD COMMENT • link updated 4 months ago by Michael Love 41k • written 4 months ago by Anonymous Genetics Gnome • 0

score 0 · Answer 1 · 2023-12-20

A design isn't close to the ground truth or not. It's just a way to specify covariates. In your case, age is a nuisance variable that you won't be interpreting, so centering and scaling is the way to go.

It's not necessarily true that the number of significant genes increases as you add covariates. It may or may not occur, depending on the covariates and whether or not they are capturing excess variance that might impact the covariate of interest.

score 0 · Answer 2 · 2023-12-20

Collinearity means correlated variables in the design matrix and is best avoided (you can do so in the experiment design and/or by computing PCs or other orthogonal factors for nuisance covariates).

You can compute

cor(model.matrix(design, data))

where data is the colData and design is the formula you are planning.

I try to avoid collinearity in these cases by estimating global factors with RUV and then exploring the correlation between these and the batches. I would include age and sex separately (and estimate RUV factors orthogonal to age and sex or other biological variables). This is really important though to include such covariates that explain variation not due to the condition or else the model will be mis-specified and the results can be misleading or spurious. Ignoring batch is the most common mistake people make when using DE tools I find.