Centering and scaling numeric variables
1
1
Entering edit mode
MiKappa ▴ 30
@mikappa-23113
Last seen 4.1 years ago

What does this message mean ?

" the design formula contains one or more numeric variables that have mean or standard deviation larger than 5 (an arbitrary threshold to trigger this message). it is generally a good idea to center and scale numeric variables in the design to improve GLM convergence.")

What if the mean or the standard deviation is higher than 5? Why would you have to scale and center your numeric variables? I am including age and BMI in my design as my continuous variables (sex and group are categorical). I also cut the continuous variables into small bins as it is recommended under the FAQ "How can I include a continuous covariate in the design formula?". My design = ~InsulinResistance + sex + bmi + age and I want to perform differential gene expression analysis comparing insulin resistant and insulin sensitive phenotypes corrected for sex, BMI and age.

Any help is much appreciated!

deseq2 • 3.8k views
ADD COMMENT
0
Entering edit mode

Hi, I have the same problem you faced. Category and gender are categorical variables where as the rest are continuos. This is how I did my design:

design = ~ category + gender + scale(age, center = TRUE) + scale(fatmass, center = TRUE) + scale(bmi, center = TRUE) + scale(fastingGlucose, center = TRUE))

But still the message is coming as "..the design formula contains one or more numeric variables that have mean or standard deviation larger than 5 (an arbitrary threshold to trigger this message). it is generally a good idea to center and scale numeric variables in the design to improve GLM convergence."

Therefore, could please help in the following my questions: 1. What is wrong in my design? 2. How and where exactly I can use cut function ? you may use the above design to show how to use cut()? Do you think you can explaine me? Thanks!

ADD REPLY
1
Entering edit mode

Can you instead create new scaled variables instead of using scale() directly in the design? This helps DESeq2's warning and error code for checking for design issues.

ADD REPLY
0
Entering edit mode

Thanks Michael! The message is not coming anymore!

However, did not see difference in the gene expression level. I mean the number and type of expressed genes are the same in my analysis (scaling could not make my genes to be different in number or type comparing with my DESeq analysis with the same colData but unscaled).

ADD REPLY
0
Entering edit mode

The warning is to help with model fitting. You may obtain the same fit but faster for example, or it may fail with badly scaled covariates.

ADD REPLY
1
Entering edit mode

Hi there :)

I think the main problem comes from not creating new scaled variables. That is the issue I had with the cut variables. Even though I was cutting them, I wasn't creating a new variable. Make sure to add the new variables to your colData! Hope it works!

ADD REPLY
1
Entering edit mode
@mikelove
Last seen 17 hours ago
United States

It means I recommend to have numeric covariates on a similar scale as the other factor variables, not too small and not too large, for improving the model fitting process.

If you discretize the continuous variables then you won't see this message.

ADD COMMENT
0
Entering edit mode

Thank you for your response Michael!

I have one more question: is discretizing and cutting a numeric variable the same ?

I have tested several designs. In one of them I do center and scale the numeric variables and indeed the message is not there anymore. In the one I mentioned in my previous post, I cut the numerical values into 3 bins and the message is still there. I have also tried with 5 bins and the message remains.

ADD REPLY
1
Entering edit mode

Yes, cut() discretizes.

The message will not appear if you cut() all the numeric variables. You may be accidentally leaving one in?

ADD REPLY
1
Entering edit mode

Good news is that cut() works indeed. The problem was elsewhere, I will try my best to explain in case someone else faces the same problem. Two things are very important: 1. Create a new cut variable (in my case cutAge) and make sure you include this variable in your "colData" 2. As cut changes numeric variables to categorical and turns them into factors, make sure to change the labels of this cut variable to meaningful factor names.

Thank you again Michael!

ADD REPLY

Login before adding your answer.

Traffic: 667 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6