Scaling in regression

OLS

Let’s use as an example the simple model for mpg below. When cyl = 0 & hp = 0, the expected value of mpg is 36.9 (the intercept). For realistic values of cyl = 6 and hp = 125, the expected values of mpg is \(36.9+-2.26*6-0.01912*125=20.93\).

summary(fit<-lm(mpg~cyl+hp,mtcars))


Call:
lm(formula = mpg ~ cyl + hp, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4948 -2.4901 -0.1828  1.9777  7.2934 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 36.90833    2.19080  16.847  < 2e-16 ***
cyl         -2.26469    0.57589  -3.933  0.00048 ***
hp          -0.01912    0.01500  -1.275  0.21253    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.173 on 29 degrees of freedom
Multiple R-squared:  0.7407,    Adjusted R-squared:  0.7228 
F-statistic: 41.42 on 2 and 29 DF,  p-value: 3.162e-09

predict(fit,data.frame(cyl=6,hp=125))

       1 
20.92996

When we scale, the mean of each scaled variable is 0. Predicting with the model with scaled predictors with predictors set at zero, and with the model with unscaled predictors set at their means, is equivalent.

mean(c(scale(mtcars$hp))==(mtcars$hp-mean(mtcars$hp))/sd(mtcars$hp))

[1] 1

mtcars_s <- mtcars
mtcars_s$cyl <- c(scale(mtcars_s$cyl));mtcars_s$hp <- c(scale(mtcars_s$hp));
summary(fit_s<-lm(mpg~cyl+hp,mtcars_s))


Call:
lm(formula = mpg ~ cyl + hp, data = mtcars_s)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4948 -2.4901 -0.1828  1.9777  7.2934 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  20.0906     0.5609  35.817  < 2e-16 ***
cyl          -4.0446     1.0285  -3.933  0.00048 ***
hp           -1.3110     1.0285  -1.275  0.21253    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.173 on 29 degrees of freedom
Multiple R-squared:  0.7407,    Adjusted R-squared:  0.7228 
F-statistic: 41.42 on 2 and 29 DF,  p-value: 3.162e-09

c(predict(fit_s,data.frame(cyl=0,hp=0)),
  predict(fit,data.frame(cyl=mean(mtcars$cyl),hp=mean(mtcars$hp))))

       1        1 
20.09062 20.09062

When scaling, the coefficient represents the impact of increasing that predictor by on standard deviation.

c(coef(fit)[2] * sd(mtcars$cyl),
  coef(fit_s)[2])

      cyl       cyl 
-4.044565 -4.044565

Further, let’s say I have a car with 4 cyl and 97 hp, the predicted value for mpg is 26.

predict(fit,data.frame(cyl=4,hp=97))

       1 
25.99475

Equivalently

predict(fit_s,data.frame(cyl=(4-mean(mtcars$cyl))/sd(mtcars$cyl),
                         hp=(97-mean(mtcars$hp))/sd(mtcars$hp)))

       1 
25.99475

And again, increasing cyl by one sd in both models is equivalent.

predict(fit,data.frame(cyl=4+sd(mtcars$cyl),hp=97))

       1 
21.95019

predict(fit_s,
        data.frame(cyl=(4-mean(mtcars$cyl))/sd(mtcars$cyl) + 1,
                   hp=(97-mean(mtcars$hp))/sd(mtcars$hp)))

       1 
21.95019

Logistic regression

At cyl = 0 and hp = 0, p(am=1) = 0.9998871 = 0.9998. Of course, this is a non-sensical prediction (can’t have 0 cylinders/hp). With 4 cyl and 97 hp, 0.9105248 = 0.91.

summary(fit<-glm(vs~cyl+hp,mtcars,family='binomial'))


Call:
glm(formula = vs ~ cyl + hp, family = "binomial", data = mtcars)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2986  -0.2086  -0.0446   0.4041   1.2349  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  9.08901    3.29417   2.759   0.0058 **
cyl         -0.68950    0.78939  -0.873   0.3824   
hp          -0.04135    0.03630  -1.139   0.2546   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 43.860  on 31  degrees of freedom
Residual deviance: 16.024  on 29  degrees of freedom
AIC: 22.024

Number of Fisher Scoring iterations: 7

predict(fit,data.frame(cyl=4,hp=97),type='response')

        1 
0.9105076

Scaling

mtcars_s <- mtcars
mtcars_s$cyl <- c(scale(mtcars_s$cyl));mtcars_s$hp <- c(scale(mtcars_s$hp));
summary(fit_s<-glm(vs~cyl+hp,mtcars_s,family='binomial'))


Call:
glm(formula = vs ~ cyl + hp, family = "binomial", data = mtcars_s)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2986  -0.2086  -0.0446   0.4041   1.2349  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   -1.243      1.065  -1.167    0.243
cyl           -1.231      1.410  -0.873    0.382
hp            -2.835      2.489  -1.139    0.255

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 43.860  on 31  degrees of freedom
Residual deviance: 16.024  on 29  degrees of freedom
AIC: 22.024

Number of Fisher Scoring iterations: 7

c(coef(fit)[2] * sd(mtcars$cyl),
  coef(fit_s)[2])

      cyl       cyl 
-1.231395 -1.231395

When showing coefficients side by side, scaling allows to directly compare magnitude. Below: -2.26 vs -0.019 doesn’t tell us anything, -4.04 vs -1.3 tells us cyl is a better predictor.

cbind(coef(lm(mpg~cyl+hp,mtcars)),
      coef(lm(mpg~cyl+hp,mtcars_s)))

                  [,1]      [,2]
(Intercept) 36.9083305 20.090625
cyl         -2.2646936 -4.044565
hp          -0.0191217 -1.311038

Of course, it’s problematic when one or more variable is not continuous

coef(lm(mpg~cyl+hp+vs,mtcars_s))

(Intercept)         cyl          hp          vs 
  20.674240   -4.501959   -1.416454   -1.333978

Gelman scaling

Gelman suggests to scale by dividing by twice the standard deviation. “We recommend the general practice of scaling numeric inputs by dividing by two standard deviations, which allows the coefficients to be interpreted in the same way as with binary inputs. (We leave binary inputs unscaled because their coefficients can already be interpreted directly.)”

mtcars_s <- mtcars
mtcars_s$cyl <- (mtcars_s$cyl-mean(mtcars_s$cyl)) / (2*sd(mtcars_s$cyl))
mtcars_s$hp <- (mtcars_s$hp-mean(mtcars_s$hp)) / (2*sd(mtcars_s$hp))
coef(lm(mpg~cyl+hp+vs,mtcars_s))

(Intercept)         cyl          hp          vs 
  20.674240   -9.003918   -2.832907   -1.333978

These are now comparable.

Alternatively, we can scale all variables. I’ve seen this before. If we use the coeficients as indicators of the strength of each predictor, it’s roughly equivalent (-9.003918/-1.333978 = 6.74 vs -4.5019591/-0.6723465 = 6.69). Note that it’s not exactly equal. Gelman’s suggestion only puts the coefficients exactly on the same scale if the proportion of the binary variable is 0.5. Yet it has the advantage of keeping the interpretability of the coefficient on the binary input.

mtcars_s <- mtcars
mtcars_s[,c("cyl","hp","vs")] <- 
  lapply(mtcars_s[,c("cyl","hp","vs")],scale)
coef(lm(mpg~cyl+hp+vs,mtcars_s))

(Intercept)         cyl          hp          vs 
 20.0906250  -4.5019591  -1.4164536  -0.6723465