Linear regression

PUBPOL 750 Data Analysis for Public Policy I: Week 9

Justin Savoie

MPP-DS McMaster

2023-11-15

Introduction to Linear Regression

  • Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables.

  • It’s a fundamental tool in data analysis in all social science.

  • It quantifies the average effect of changes in independent variables on a dependent variable.

  • It can be used for prediction, description or causal explanation.

The Simple Linear Regression Model

\[Y = \beta_0 + \beta_1X_1 + \epsilon\]
  • Models the relationship between a dependent variable and one independent variable in a linear way
  • \(Y\) is the dependent variable
  • \(X_1\) is the independent variable
  • \(\beta_0\) is the intercept
  • \(\beta_1\) is the slope coefficient
  • \(\epsilon\) represents the error term
  • Here, it’s a simple linear regression because there is one independent variable, \(X_1\)

For example, for an income of 50,000$, the predicted value for life satisfaction is 3.37. Of course, it’s not because you have an income of 50,000$ that your life satisfaction is 3.37. That’s why there’s error \(ϵ\) in the model.

This is made up data. The true relation would not be as clear.

Typically, \(Y = \beta_0 + \beta_1X_1 + \epsilon\) will be used for the general model. \(y = b_0 + b_1x_1 + \epsilon\) refers to the estimated regression equation based on sample data.

The residual is the distance between the prediction (i.e. the line of best fit) and the true observed value. The residuals are shown with the black lines.

“error” and “residuals” are sometimes used interchangeably but there’s a subtle difference. The “residual” is the observed difference when you fit the line. The “error” is the equivalent in the theoretical model, but it’s unobservable.

Estimation Methods (FYI)

The most common method for estimating the coefficients of a linear regression model is Ordinary Least Squares (OLS). This method *minimizes the sum of the squared differences between the observed values and the values predicted by the model: the sum of the squared residuals.

This is the best model of the form \(y=b0+b1+ϵ\) because it minimizes the square of the residual.

In contrast, this is NOT the best model of the form \(y=b0+b1+ϵ\) because it does not minimizes the square residual.

In R

lm(satisfaction~income,df)

Call:
lm(formula = satisfaction ~ income, data = df)

Coefficients:
(Intercept)       income  
     0.9311       0.5247  

On average, when income increases by 1, satisfaction increases by 0.52. On average, when income is 0, satisfaction is 0.93.

lm_fit <- lm(satisfaction~income,df)
summary(lm_fit)

Call:
lm(formula = satisfaction ~ income, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5868 -0.4679  0.0993  0.5708  3.2122 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.93111    0.19094   4.876 4.17e-06 ***
income       0.52470    0.03232  16.232  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9682 on 98 degrees of freedom
Multiple R-squared:  0.7289,    Adjusted R-squared:  0.7261 
F-statistic: 263.5 on 1 and 98 DF,  p-value: < 2.2e-16

The standard error is the uncertainty around the estimate. When it’s small you have more confidence in the estimate. We usually say it’s statisticaly significant if Pr(>|t|) (the “p-value”) is below 0.05. The p-value is the probability of obtaining an estimate at least as extreme as the estimate actually observed, if there was no effect. We can get a confidence interval around the estimate by adding +- 1.96 * the standard error.

Binary Independent Variable

Key: treat everything as numbers (0 and 1).

lm_fit <- lm(satisfaction~community,df)
summary(lm_fit)

Call:
lm(formula = satisfaction ~ community, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-3.397 -1.377  0.023  1.249  4.833 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      4.0769     0.2799  14.563   <2e-16 ***
communityurban  -0.8178     0.3676  -2.225   0.0284 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.814 on 98 degrees of freedom
Multiple R-squared:  0.04808,   Adjusted R-squared:  0.03837 
F-statistic:  4.95 on 1 and 98 DF,  p-value: 0.02838
lm_fit <- lm(satisfaction~education,df)
summary(lm_fit)

Call:
lm(formula = satisfaction ~ education, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5709 -1.4955  0.0437  1.3485  5.2117 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     3.0946     0.3086  10.026   <2e-16 ***
educationCollege Trade School   0.6032     0.4630   1.303   0.1957    
educationUniversity             0.9163     0.4306   2.128   0.0359 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.826 on 97 degrees of freedom
Multiple R-squared:  0.0456,    Adjusted R-squared:  0.02592 
F-statistic: 2.317 on 2 and 97 DF,  p-value: 0.104

The Multiple Linear Regression Model

\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + ... +\epsilon\] In practice, linear regression will have multiple predictors. Life satisfaction will be modelled as a function of multiple factors.

Perhaps, we could have: \[life\_satisfaction = \beta_0 + \beta_1*income + \beta_2*physical\_health + \\ \beta_3*mental\_health + \beta_4*quality\_of\_infrastructures + ... +\epsilon\]

In our example, we can have: \[life\_satisfaction = \beta_0 + \beta_1*income + \beta_2*communityurban + \\ \beta_3*educationCollegeTradeSchool + \beta_3*educationUniversity + ... +\epsilon\]

summary(fit<-lm(satisfaction~income+community+education,df))

Call:
lm(formula = satisfaction ~ income + community + education, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.54264 -0.44395  0.03309  0.54851  3.03890 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    1.17894    0.25357   4.649 1.07e-05 ***
income                         0.52328    0.03445  15.190  < 2e-16 ***
communityurban                -0.28474    0.19988  -1.425    0.158    
educationCollege Trade School -0.06170    0.24983  -0.247    0.805    
educationUniversity           -0.15727    0.23983  -0.656    0.514    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9701 on 95 degrees of freedom
Multiple R-squared:  0.7362,    Adjusted R-squared:  0.7251 
F-statistic: 66.27 on 4 and 95 DF,  p-value: < 2.2e-16

For multiple linear regression, we can plot the marginal effect, that is the average effect, holding all other variables constant.

library(ggeffects)
plot(ggeffect(fit,terms = "income"))

Why Have Multiple Predictors?

Binary Dependent Variable

  • We can still fit a linear regression model if the dependent variable is binary. This is called the linear probability model.
  • Here, everything is interpreted as probabilities. It’s the probability of being satisfied (where being satisfied means answer 5 or more).
summary(lm(satisfaction_yes~income,df))

Call:
lm(formula = satisfaction_yes ~ income, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.73789 -0.24289  0.01137  0.29121  0.61695 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.22910    0.06634  -3.454 0.000818 ***
income       0.09803    0.01123   8.729 6.89e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3364 on 98 degrees of freedom
Multiple R-squared:  0.4374,    Adjusted R-squared:  0.4317 
F-statistic:  76.2 on 1 and 98 DF,  p-value: 6.888e-14

Assumptions of Linear Regression (FYI)*

  1. Validity. The data you are analyzing should map to the research question you are trying to answer. This sounds obvious but is often overlooked or ignored because it can be inconvenient.
  2. Representativeness. A regression model is fit to data and is used to make inferences about a larger population, hence the implicit assumption in interpreting regression coefficients is that the sample is representative of the population.
  3. Additivity and linearity. Its deterministic component is a linear function of the separate predictors: i.e., that it actually makes sense to model it like this: \(y = \beta_0+\beta_1*X_1+\beta_2*X_2 ...\)
  4. Independence of errors. If you have repeated observations on some individuals, then this is violated and you will have to use other (related) models.
  5. Equal variance of errors (also called heteroscedasticity). If life satisfaction is much more variable for people with high income than people with low income.
  6. Normality of errors. The distribution of the error term is relevant when predicting individual data points.

When to Use Regression?

  • Linear regression can be used for prediction: as a machine simple learning model
  • Linear regression can be used for description: for group summaries or correlation
  • Linear regression can be used for data summary: you have several independent variables and you look at how they each affect the dependent variable
  • Linear regression can sometimes be used for the causal analysis of the effect of x on y: \[y = \beta_0+\beta_1*x + controls ... \]

For More Information