Week 10: Working with weighted data

class: center, middle, inverse, title-slide

.title[
# Week 10: Working with weighted data
]
.subtitle[
## PUBPOL 750 Data Analysis for Public Policy I
]
.author[
### Justin Savoie
]
.institute[
### MPP-DS McMaster
]
.date[
### 2023-11-22
]

---

class: inverse, center, middle

# Working with weighted data

---

# Weighted data

- Often, the data you will be working with is *weighted*
- Any data, in theory, can be weighted: some observations are made to count for more and some for less 
- Policy cases could be weighted, counties could be weighted
- When dealing with geographic data, weighting can adjust for over or under representation of certain areas
- In practice, weighted data is usually data collected on individuals (often public opinion research). Some individuals are less likely to be represented in your data, they may be harder to reach, and you want to increase their weight in your analysis
- Weights allow for a dataset to be corrected so that results more accurately represent the population being studied

---

# Sampling

- Why can we claim that a poll of 1000 respondents represent the views of 31,000,000 adult Canadians ?: **Random sampling**
- Random sampling: each member of the population must have an equal chance of being selected; the sample is likely to be a miniaturized version of the population
- How to get a random sample: random digit dialing + random selection within the household (but even that is not truly random)

- Complex survey designs: strata, clusters
- [Complex Surveys: a guide to analysis using R](https://r-survey.r-forge.r-project.org/svybook/)
- [Applied Survey Data Analysis](https://www.isr.umich.edu/src/smp/asda/)
- Complex surveys are still considered random

---

# Sampling when the sample is non-random

- Often we work with non-random samples, also called non-representative samples. 
- We treat them as random and use weights. Most polls conducted online are non-random samples.
- Recruitment for these polls can be done in various ways: opt-in, social media, ads, phone calls.
- Some polling companies do recruitment only by phone so they can say it's a representative.
- It's often said that we cannot have uncertainty estimates when the survey in non-representative sample.
- Sometimes it will say: "For comparison purposes, a probability sample of this size yields a margin of error no greater than ±2.4%, (19 times out of 20)"

---

# Working with existing weights

- Weights are often calculated by raking: adjusting weights algorithmically to match marginals distributions
- Marginal distributions are simply the general distribution in the population.
- If you weight on education and age and your sample is high on people with university education and old people then people with university education and old people will be down weighted.

- To calculate weights: you pick which variables you want to weight on. Provide the true numbers in the population for those distribution (marginal distribution). Let R algorithmically create weights.

---

# Working with existing weights

- Weights either have a mean of 1, sum to 1 or sum to the population size. 
- It's not very important which scale the weights are one. You can always scale them back (if the weights sum to 1, multiplying them by N has them average to 1, multiplying them by the population has them sum to the population, and vice versa).

---

# Simple weighted mean

```r
library(tidyverse)
library(haven)
library(survey)
df <- read_stata("~/Downloads/2019 Canadian Election Study - Phone Survey v1.1.dta")
# q14 -- How do you feel about the FEDERAL Liberal party?
# Has some negative values for missing which we want to remove to take the mean
df$q14[df$q14<0] <- NA
mean(df$q14,na.rm=TRUE)
```

```
## [1] 47.49962
```

```r
weighted.mean(df$q14,df$weight_CES,na.rm=TRUE)
```

```
## [1] 47.85755
```

```r
df <- df |> mutate(w1=1)
weighted.mean(df$q14,df$w1,na.rm=TRUE)
```

```
## [1] 47.49962
```

---

```r
# q31 -- Thinking about the economy, over the PAST YEAR, has CANADA's economy gott
df |>
  group_by(q31) |>
  summarise(nweighted=sum(weight_CES),n=n()) |>
  ungroup() |>
  mutate(wp=nweighted/sum(nweighted),p=n/sum(n))
```

```
## # A tibble: 5 × 5
##   q31                     nweighted     n      wp       p
##   <dbl+lbl>                   <dbl> <int>   <dbl>   <dbl>
## 1 -9 [(-9) Don't know]       113.     117 0.0282  0.0291 
## 2 -8 [(-8) Refused]            8.22     7 0.00204 0.00174
## 3  1 [(1) Better]            853.     818 0.212   0.203  
## 4  2 [(2) Worse]            1270.    1316 0.316   0.327  
## 5  3 [(3) About the same]   1776.    1763 0.442   0.438
```

```r
mean(df$q31==3)
```

```
## [1] 0.4384481
```

```r
weighted.mean(df$q31==3,df$weight_CES)
```

```
## [1] 0.4417099
```

---

The srvyr package has functions to calculate quantities of interest for you:

```r
# library(srvyr)
df |>
  as_survey_design(ids = 1, weight = weight_CES) |>
  filter(q31>0) |>
  summarise(survey_mean(q31))
```

```
## # A tibble: 1 × 2
##    coef  `_se`
##   <dbl>  <dbl>
## 1  2.24 0.0149
```

```
## # A tibble: 3 × 3
##   q31_factor          coef   `_se`
##   <fct>              <dbl>   <dbl>
## 1 (1) Better         0.219 0.00781
## 2 (2) Worse          0.326 0.00873
## 3 (3) About the same 0.455 0.00936
```

---

```
## # A tibble: 15 × 4
## # Groups:   age_range [5]
##    age_range               q31_factor          coef  `_se`
##    <dbl+lbl>               <fct>              <dbl>  <dbl>
##  1 1 [(1) 18-24 years old] (1) Better         0.151 0.0285
##  2 1 [(1) 18-24 years old] (2) Worse          0.394 0.0379
##  3 1 [(1) 18-24 years old] (3) About the same 0.455 0.0385
##  4 2 [(2) 25-34 years old] (1) Better         0.215 0.0210
##  5 2 [(2) 25-34 years old] (2) Worse          0.354 0.0245
##  6 2 [(2) 25-34 years old] (3) About the same 0.431 0.0254
##  7 3 [(3) 35-44 years old] (1) Better         0.199 0.0176
##  8 3 [(3) 35-44 years old] (2) Worse          0.357 0.0213
##  9 3 [(3) 35-44 years old] (3) About the same 0.444 0.0222
## 10 4 [(4) 45-54 years old] (1) Better         0.213 0.0180
## 11 4 [(4) 45-54 years old] (2) Worse          0.336 0.0203
## 12 4 [(4) 45-54 years old] (3) About the same 0.451 0.0217
## 13 5 [(5) 55+ years old]   (1) Better         0.238 0.0121
## 14 5 [(5) 55+ years old]   (2) Worse          0.294 0.0127
## 15 5 [(5) 55+ years old]   (3) About the same 0.468 0.0141
```

---

# Making your own weights

```r
# library(survey)
df_sub <- df |> select(q31,q4,q3,age_range) 
df_sub <- df_sub |> mutate_all(as_factor) |> mutate_all(as.character)

marginals <- c(`(Intercept)`= 1,
               c(#`q4(1) Newfoundland and Labrador` = 0.0115942028985507,
                 `q4(10) British Columbia` = 0.144927536231884,
                 `q4(2) Prince Edward Island` = 0.00289855072463768, 
                 `q4(3) Nova Scotia` = 0.0289855072463768,
                 `q4(4) New Brunswick` = 0.0289855072463768,
                 `q4(5) Quebec` = 0.231884057971014,
                 `q4(6) Ontario` = 0.376811594202899,
                 `q4(7) Manitoba` = 0.0289855072463768,
                 `q4(8) Saskatchewan` = 0.0289855072463768,
                 `q4(9) Alberta` = 0.115942028985507
),c(#"q3(1) Male"=0.48,
  "q3(2) Female"=0.48, "q3(3) Other"=0.04),
c(#"age_range(1) 18-24 years old"=0.1, 
  "age_range(2) 25-34 years old"=0.2, 
  "age_range(3) 35-44 years old"=0.15,
  "age_range(4) 45-54 years old"=0.2,
  "age_range(5) 55+ years old"=0.35))
```
---

```r
tmp_form <- paste(" ~ 1 + q4 + q3 + age_range")

surveyDesign <- svydesign(id = ~ 1, 
                          weights = ~ 1,
                          data = df_sub)
surveyDesign <- calibrate(design     = surveyDesign,
                          formula    = as.formula(tmp_form),
                          calfun     = "raking",
                          population = marginals,
                          maxit      = 2000)
df_sub$weight <- weights(surveyDesign)
```
---

```r
plot(df_sub$weight*nrow(df_sub),df$weight_CES)
```

![](Slides_files/figure-html/unnamed-chunk-7-1.png)
---

```r
df_sub |> arrange(desc(weight))
```

```
## # A tibble: 4,021 × 5
##    q31                q4                    q3         age_range          weight
##    <chr>              <chr>                 <chr>      <chr>               <dbl>
##  1 (3) About the same (10) British Columbia (3) Other  (2) 25-34 years … 4.00e-2
##  2 (3) About the same (6) Ontario           (2) Female (1) 18-24 years … 8.88e-4
##  3 (1) Better         (6) Ontario           (2) Female (1) 18-24 years … 8.88e-4
##  4 (2) Worse          (6) Ontario           (2) Female (1) 18-24 years … 8.88e-4
##  5 (1) Better         (6) Ontario           (2) Female (1) 18-24 years … 8.88e-4
##  6 (-9) Don't know    (6) Ontario           (2) Female (1) 18-24 years … 8.88e-4
##  7 (3) About the same (6) Ontario           (2) Female (1) 18-24 years … 8.88e-4
##  8 (3) About the same (6) Ontario           (2) Female (1) 18-24 years … 8.88e-4
##  9 (2) Worse          (6) Ontario           (2) Female (1) 18-24 years … 8.88e-4
## 10 (3) About the same (6) Ontario           (2) Female (1) 18-24 years … 8.88e-4
## # ℹ 4,011 more rows
```

```r
table(df_sub$q3)
```

```
## 
##   (1) Male (2) Female  (3) Other 
##       2272       1748          1
```
---

```r
df_sub$weight[df_sub$weight>quantile(df_sub$weight,0.99)] <- quantile(df_sub$weight,0.99)
plot(df_sub$weight*nrow(df_sub),df$weight_CES)
```

![](Slides_files/figure-html/unnamed-chunk-9-1.png)

---

# When and how to use weights ?

- If you are calculating descriptive statistics: always.
- Less clear when running regressions. See: https://www.jstor.org/stable/24735988#metadata_info_tab_contents
- If you run a regression, usually, you control for potential confounders. Usually control for the weighting variables? 
- In Canadian politics, often weighted by age, gender, mother tongue, region, education and presence of children in the household. Often weighted by a variable about politics.
- Weights aren't magic. If your sample is hopelessly non-representative weights perhaps won't be enough.

- Using weights is a "design-based" approach. There's another approach which is "model based". Popular recently. More complicated, more flexible.

---

```r
set.seed(232)
lm(mpg~hp,mtcars)
```

```
## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
## 
## Coefficients:
## (Intercept)           hp  
##    30.09886     -0.06823
```

```r
random_number <- rexp(32, rate = 10)
percent_cars <- random_number/sum(random_number)
mtcars$percent_cars <- percent_cars
lm(mpg~hp,mtcars,weights = percent_cars)
```

```
## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars, weights = percent_cars)
## 
## Coefficients:
## (Intercept)           hp  
##    28.10076     -0.04982
```

---

```r
ggplot(mtcars,aes(x=hp,y=mpg)) +
  geom_point(size=percent_cars*32) + 
  geom_smooth(method="lm",se=FALSE,color="blue") +
  geom_smooth(method="lm",se=FALSE,mapping = aes(weight = percent_cars),color="red")
```

![](Slides_files/figure-html/unnamed-chunk-11-1.png)

---

## Do my weights (or my approach if I'm using a model based approach) make sense?

- Does my adjustment procedure actually helps me represent the population from my sample?

- At some point it makes sense to compare with some ground truth.

- [Looking at the estimated distribution of the canary variable \[something with a known distribution that was not adjusted for in your model\], and then comparing to external knowledge, is a way of checking your sampling procedure. It’s not an infallible check—your sample, or your adjusted sample, can be representative for one variable but not another—but it’s something you can do.](https://statmodeling.stat.columbia.edu/2023/02/07/checking-survey-representativeness-by-looking-at-canary-variables/)

---

## For more information on weights.

- [Complex Surveys: a guide to analysis using R](https://r-survey.r-forge.r-project.org/svybook/): excellent book to understand survey weights
- [Applied Survey Data Analysis](https://www.isr.umich.edu/src/smp/asda/): excellent book, more applied, to understand survey weights
- [For Weighting Online Opt-In Samples, What Matters Most?](https://www.pewresearch.org/methods/2018/01/26/for-weighting-online-opt-in-samples-what-matters-most/)
- [Struggles with Survey Weighting](http://www.stat.columbia.edu/~gelman/research/published/STS226.pdf): To learn more on model-based approaches
- [MRP Case Studies](https://bookdown.org/jl5522/MRP-case-studies/introduction-to-mrp.html): The model-based approach most used.
- [What Are We Weighting For?](https://www.jstor.org/stable/24735988?seq=1): Three economists on whether weights make sense for causal inference
- [Weights in statistics](https://statmodeling.stat.columbia.edu/2021/01/17/weights-in-statistics/): Andrew Gelman blog post
- [Weights in statistics](https://notstatschat.rbind.io/2020/08/04/weights-in-statistics/): Thomas Lumley blog post

---

- http://www.ces-eec.ca/
- https://odesi.ca/en/results?q=*&l=ANY&f=2010&t=2023&s=rel&c=opinion-polls,cora,Ipsos,ang-reid,gallup,LEGER,forum&page=1
- https://www.worldvaluessurvey.org/WVSDocumentationWV7.jsp
- https://electionstudies.org/data-center/2020-time-series-study/ (need to register)
- https://mccourt.georgetown.edu/master-of-science-in-data-science-for-public-policy/data-science-community/public-policy-data-sets-and-resources/
- https://medium.com/metapolicy/how-do-you-practice-data-science-b4c9df59bc9c