Let’s use as an example the simple model for mpg below. When cyl = 0 & hp = 0, the expected value of mpg is 36.9 (the intercept). For realistic values of cyl = 6 and hp = 125, the expected values of mpg is \(36.9+-2.26*6-0.01912*125=20.93\).
summary(fit<-lm(mpg~cyl+hp,mtcars))
Call:
lm(formula = mpg ~ cyl + hp, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.4948 -2.4901 -0.1828 1.9777 7.2934
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.90833 2.19080 16.847 < 2e-16 ***
cyl -2.26469 0.57589 -3.933 0.00048 ***
hp -0.01912 0.01500 -1.275 0.21253
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.173 on 29 degrees of freedom
Multiple R-squared: 0.7407, Adjusted R-squared: 0.7228
F-statistic: 41.42 on 2 and 29 DF, p-value: 3.162e-09
predict(fit,data.frame(cyl=6,hp=125))
1
20.92996
When we scale, the mean of each scaled variable is 0. Predicting with the model with scaled predictors with predictors set at zero, and with the model with unscaled predictors set at their means, is equivalent.
At cyl = 0 and hp = 0, p(am=1) = 0.9998871 = 0.9998. Of course, this is a non-sensical prediction (can’t have 0 cylinders/hp). With 4 cyl and 97 hp, 0.9105248 = 0.91.
Call:
glm(formula = vs ~ cyl + hp, family = "binomial", data = mtcars)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2986 -0.2086 -0.0446 0.4041 1.2349
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 9.08901 3.29417 2.759 0.0058 **
cyl -0.68950 0.78939 -0.873 0.3824
hp -0.04135 0.03630 -1.139 0.2546
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43.860 on 31 degrees of freedom
Residual deviance: 16.024 on 29 degrees of freedom
AIC: 22.024
Number of Fisher Scoring iterations: 7
Call:
glm(formula = vs ~ cyl + hp, family = "binomial", data = mtcars_s)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2986 -0.2086 -0.0446 0.4041 1.2349
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.243 1.065 -1.167 0.243
cyl -1.231 1.410 -0.873 0.382
hp -2.835 2.489 -1.139 0.255
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43.860 on 31 degrees of freedom
Residual deviance: 16.024 on 29 degrees of freedom
AIC: 22.024
Number of Fisher Scoring iterations: 7
c(coef(fit)[2] *sd(mtcars$cyl),coef(fit_s)[2])
cyl cyl
-1.231395 -1.231395
When showing coefficients side by side, scaling allows to directly compare magnitude. Below: -2.26 vs -0.019 doesn’t tell us anything, -4.04 vs -1.3 tells us cyl is a better predictor.
Of course, it’s problematic when one or more variable is not continuous
coef(lm(mpg~cyl+hp+vs,mtcars_s))
(Intercept) cyl hp vs
20.674240 -4.501959 -1.416454 -1.333978
Gelman scaling
Gelman suggests to scale by dividing by twice the standard deviation. “We recommend the general practice of scaling numeric inputs by dividing by two standard deviations, which allows the coefficients to be interpreted in the same way as with binary inputs. (We leave binary inputs unscaled because their coefficients can already be interpreted directly.)”
(Intercept) cyl hp vs
20.674240 -9.003918 -2.832907 -1.333978
These are now comparable.
Alternatively, we can scale all variables. I’ve seen this before. If we use the coeficients as indicators of the strength of each predictor, it’s roughly equivalent (-9.003918/-1.333978 = 6.74 vs -4.5019591/-0.6723465 = 6.69). Note that it’s not exactly equal. Gelman’s suggestion only puts the coefficients exactly on the same scale if the proportion of the binary variable is 0.5. Yet it has the advantage of keeping the interpretability of the coefficient on the binary input.