Using and understanding survey weights

With an example using the Canadian Election study

I use survey data a lot and survey data always comes with weights. Unfortunately, it’s not always clear how to use weights. Further see here or here.

There is, however, a consensus that for descriptive statistics (using observed data to represent a larger population), one should use weights.

There are different kinds of weights, terminology can be confusing. Thomas Lumley, the author of the survey package, the main package for the analysis of surveys in R calls them precision weights, frequency weights, sampling weights. These different weights will lead to the same point estimate. However, they will differ in standard errors and here it’s where it can get complicated.

Here’s a nice deck by Prillaman. Lumley’s book is the reference.

Here, I focus on sampling weights because those are the ones we use in survey data; the ones that “describe how the sample can be scaled up to the population”. To be clear, the data we use in public opinion research and empirical social science is almost always non-probability sample. So all the theory about design effects and design weights discussed in the Lumley book and elsewhere doesn’t really apply.

That said, in practice we still weight. We usually use post-stratification: “the use of weights to assure that the sample totals equal some external total based on the target population. Essentially, it is “stratifying” after you have already run your sample.”

In practice when do post-stratification we use raking. Post-stratification and raking are both weighting techniques used to adjust survey samples to match known population characteristics, but they differ in how they handle the adjustment. Post-stratification assigns weights based on predefined population cells or strata (e.g., specific age-gender-education groups), requiring complete joint distribution data for these variables in the population. In contrast, raking adjusts weights iteratively to match marginal distributions of variables (e.g., age and gender and education separately), making it more flexible and applicable when joint distribution data are unavailable or sparse. So in practice if you have reason to believe that e.g., education varies a lot by age or gender, raking might not be appropriate. In practice, raking is used extensively.

I will use 2021 Canadian Election Study v1.0.dta available here which you can load using haven::read_stata.

library(haven)
library(tidyverse)
library(srvyr)
library(survey)
df <- read_stata("~/Downloads/dataverse_files_CES2021/2021 Canadian Election Study v1.0.dta")

In descriptive statistics for survey data, we usually do two things: we calculate averages and we calculate proportions. I say “two things” but proportions can be seen as a special case of averages. A proportion represents the mean of a binary variable, where the variable takes a value of 1 for the presence of a characteristic and 0 for its absence. The average (mean) of this binary variable is equivalent to the proportion of respondents with the characteristic. I illustrate everything here using proportions but it’s all the same for means. The packages I use do both easily.

I’ll start without taking into accounts the weights. Let’s look at:

cps21_pol_eth It is important that politicians behave ethically in office.

Strongly disagree (1)
Somewhat disagree (2)
Somewhat agree (3)
Strongly agree (4)
Don’t know / Prefer not to answer (5)

That’s our proportions, unweighted:

prop.table(table(df$cps21_pol_eth))


          1           2           3           4           5 
0.009489814 0.010099251 0.090806199 0.873846422 0.015758314

In tidyverse style, this is just counting:

df %>% 
  select(cps21_pol_eth) %>%
  filter(complete.cases(.)) %>%
  group_by(cps21_pol_eth) %>%
  count() %>% 
  ungroup() %>%
  mutate(estimate=n/sum(n))

# A tibble: 5 × 3
  cps21_pol_eth                            n estimate
  <dbl+lbl>                            <int>    <dbl>
1 1 [Strongly disagree]                  109  0.00949
2 2 [Somewhat disagree]                  116  0.0101 
3 3 [Somewhat agree]                    1043  0.0908 
4 4 [Strongly agree]                   10037  0.874  
5 5 [Don't know/ Prefer not to answer]   181  0.0158

Counting, by the way, is the same by the way as using weights equal to 1 everywhere.

df %>% 
  mutate(weight = 1) %>%
  select(cps21_pol_eth,weight) %>%
  filter(complete.cases(.)) %>%
  group_by(cps21_pol_eth) %>%
  summarise(sum_weight=sum(weight)) %>%
  mutate(proportion=sum_weight/sum(sum_weight))

# A tibble: 5 × 3
  cps21_pol_eth                        sum_weight proportion
  <dbl+lbl>                                 <dbl>      <dbl>
1 1 [Strongly disagree]                       109    0.00949
2 2 [Somewhat disagree]                       116    0.0101 
3 3 [Somewhat agree]                         1043    0.0908 
4 4 [Strongly agree]                        10037    0.874  
5 5 [Don't know/ Prefer not to answer]        181    0.0158

The CES provides weights in the cps21_weight_general_all variable.

summary(df$cps21_weight_general_all)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.2000  0.5095  0.8047  1.0015  1.1812  5.0000      47

Weights used in public opinion surveys usually sum to the number of people in the survey (with a mean of 1). Here I have no idea why the mean is 1.0015. Normally it should equal 1.

Sometimes I’ve seen weights summing to 1 instead of summing to the number of obesrvations. I’ve also seen weights summing to the population of Canada. It’s all the same, if weights sum to 1, multiplying them by the number of respondents puts them on that scale. Makes no difference for proportion calculations. Most common is weights that sum to the number of obesrvations with average 1.

Here’s comparing both no weights (well weights all equal to 1) and real weights:

df %>% 
  mutate(weight = 1) %>%
  select(cps21_pol_eth,weight,cps21_weight_general_all) %>%
  filter(complete.cases(.)) %>%
  group_by(cps21_pol_eth) %>%
  summarise(sum_weight=sum(weight),sum_real_weight=sum(cps21_weight_general_all)) %>%
  mutate(sum_weight/sum(sum_weight),
         sum_real_weight/sum(sum_real_weight))

# A tibble: 5 × 5
  cps21_pol_eth                sum_weight sum_real_weight sum_weight/sum(sum_w…¹
  <dbl+lbl>                         <dbl>           <dbl>                  <dbl>
1 1 [Strongly disagree]               108            127.                0.00943
2 2 [Somewhat disagree]               116            163.                0.0101 
3 3 [Somewhat agree]                 1042           1247.                0.0910 
4 4 [Strongly agree]                10008           9653.                0.874  
5 5 [Don't know/ Prefer not t…        180            276.                0.0157 
# ℹ abbreviated name: ¹`sum_weight/sum(sum_weight)`
# ℹ 1 more variable: `sum_real_weight/sum(sum_real_weight)` <dbl>

The package srvyr has some functions to do all this.

to_model <- df %>% 
  select(cps21_ResponseId,cps21_pol_eth,cps21_genderid,
         weight=cps21_weight_general_restricted) %>%
  drop_na() 
# The only thing to pass is the weights, that's it
to_model %>% 
  as_survey_design(weight = weight) %>%
  group_by(cps21_pol_eth) %>%
  summarise(survey_mean(),survey_mean(vartype = "ci"))

# A tibble: 5 × 5
  cps21_pol_eth                          coef   `_se`  `_low` `_upp`
  <dbl+lbl>                             <dbl>   <dbl>   <dbl>  <dbl>
1 1 [Strongly disagree]                0.0105 0.00136 0.00782 0.0131
2 2 [Somewhat disagree]                0.0140 0.00175 0.0106  0.0175
3 3 [Somewhat agree]                   0.110  0.00414 0.102   0.118 
4 4 [Strongly agree]                   0.841  0.00490 0.832   0.851 
5 5 [Don't know/ Prefer not to answer] 0.0238 0.00219 0.0195  0.0281

We can of course imitate “no weights”:

to_model <- df %>% 
  select(cps21_ResponseId,cps21_pol_eth,cps21_genderid,
         weight=cps21_weight_general_restricted) %>%
  drop_na()  
# The only thing to pass is the weights, that's it
to_model %>% 
  mutate(weight=1) %>%
  as_survey_design(weight = weight) %>%
  group_by(cps21_pol_eth) %>%
  summarise(survey_mean(),survey_mean(vartype = "ci"))

# A tibble: 5 × 5
  cps21_pol_eth                           coef    `_se`  `_low` `_upp`
  <dbl+lbl>                              <dbl>    <dbl>   <dbl>  <dbl>
1 1 [Strongly disagree]                0.00890 0.000924 0.00709 0.0107
2 2 [Somewhat disagree]                0.00997 0.000977 0.00805 0.0119
3 3 [Somewhat agree]                   0.0914  0.00283  0.0858  0.0969
4 4 [Strongly agree]                   0.874   0.00326  0.868   0.880 
5 5 [Don't know/ Prefer not to answer] 0.0157  0.00122  0.0133  0.0181

We can break down by another variable e.g., gender

to_model <- df %>% 
  select(cps21_ResponseId,cps21_pol_eth,cps21_genderid,
         weight=cps21_weight_general_restricted) %>%
  drop_na() 
# The only thing to pass is the weights, that's it
to_model %>% 
  as_survey_design(weight = weight) %>%
  group_by(cps21_genderid,cps21_pol_eth) %>%
  summarise(survey_mean(),survey_mean(vartype = "ci"))

# A tibble: 15 × 6
# Groups:   cps21_genderid [4]
   cps21_genderid                  cps21_pol_eth    coef   `_se`   `_low` `_upp`
   <dbl+lbl>                       <dbl+lbl>       <dbl>   <dbl>    <dbl>  <dbl>
 1 1 [A man]                       1 [Strongly … 0.0120  0.00207  0.00796 0.0161
 2 1 [A man]                       2 [Somewhat … 0.0171  0.00292  0.0114  0.0228
 3 1 [A man]                       3 [Somewhat … 0.111   0.00615  0.0986  0.123 
 4 1 [A man]                       4 [Strongly … 0.845   0.00719  0.830   0.859 
 5 1 [A man]                       5 [Don't kno… 0.0157  0.00264  0.0105  0.0208
 6 2 [A woman]                     1 [Strongly … 0.00898 0.00176  0.00553 0.0124
 7 2 [A woman]                     2 [Somewhat … 0.0110  0.00191  0.00723 0.0147
 8 2 [A woman]                     3 [Somewhat … 0.109   0.00555  0.0983  0.120 
 9 2 [A woman]                     4 [Strongly … 0.839   0.00667  0.826   0.852 
10 2 [A woman]                     5 [Don't kno… 0.0321  0.00353  0.0252  0.0390
11 3 [Non-binary]                  3 [Somewhat … 0.191   0.0687   0.0568  0.326 
12 3 [Non-binary]                  4 [Strongly … 0.759   0.0775   0.607   0.911 
13 3 [Non-binary]                  5 [Don't kno… 0.0496  0.0478  -0.0441  0.143 
14 4 [Another gender, please spec… 3 [Somewhat … 0.111   0.0785  -0.0428  0.265 
15 4 [Another gender, please spec… 4 [Strongly … 0.889   0.0785   0.735   1.04

If you actually want Likert averages (here we should remove the 5s which are DKs) :

# The only thing to pass is the weights, that's it
to_model %>% 
  filter(cps21_pol_eth!=5) %>%
  as_survey_design(weight = weight) %>%
  group_by(cps21_genderid) %>%
  summarise(survey_mean(cps21_pol_eth))

# A tibble: 4 × 3
  cps21_genderid                       coef   `_se`
  <dbl+lbl>                           <dbl>   <dbl>
1 1 [A man]                            3.82 0.0102 
2 2 [A woman]                          3.84 0.00854
3 3 [Non-binary]                       3.80 0.0714 
4 4 [Another gender, please specify:]  3.89 0.0785

How to create weights?

“Raking weights” (or iterative proportional fitting) is a method to create weights.

In short, raking involves:

Selecting Marginals: First, you identify key demographic (or other relevant) characteristics where you know both the distribution in your sample and the true distribution in the population (e.g., age groups, gender, education, region).

Adjusting Weights: For each characteristic, you adjust the sample weights so that the weighted distribution matches the known population distribution.

After the process, each survey respondent will have a weight (or a multiplier) that’s used to make the sample data better resemble the actual population across the specified characteristics.

I don’t know what the marginals used for raking were here.

The documentation says: “Marginal values were successively weighted according to province, as well as gender, age group, and education level. All population data were taken from the 2016 Canadian census.”

So I could go back in the 2016 census and get them. But I can also just reverse engineer them from the weights.

Think about it. I can count the number of people (unweighted) in each group. I get the approximate number of Canadians in those groups based on the CES. If I count the number of people in each group, adjusted with the weights, I get the approximate number of Canadian in those groups, adjusted with the weights, so adjusted to the census.

By summing the weights over the categories on which the weights are probably created, I can get some marginals, that is, estimates inline with the census.

df %>%
  group_by(cps21_genderid) %>%
  summarise(sw=sum(cps21_weight_general_all,na.rm=TRUE)) %>%
  mutate(sw/sum(sw))

# A tibble: 4 × 3
  cps21_genderid                           sw `sw/sum(sw)`
  <dbl+lbl>                             <dbl>        <dbl>
1 1 [A man]                           10145.      0.484   
2 2 [A woman]                         10743.      0.513   
3 3 [Non-binary]                         49.6     0.00237 
4 4 [Another gender, please specify:]    14.5     0.000690

df %>%
  mutate(age=cut(cps21_age,c(0,17,24,34,44,54,64,74,130))) %>%
  group_by(age) %>%
  summarise(sw=sum(cps21_weight_general_all,na.rm=TRUE)) %>%
  mutate(sw/sum(sw))

# A tibble: 7 × 3
  age         sw `sw/sum(sw)`
  <fct>    <dbl>        <dbl>
1 (17,24]  1978.       0.0944
2 (24,34]  3740.       0.179 
3 (34,44]  3415.       0.163 
4 (44,54]  3288.       0.157 
5 (54,64]  4106.       0.196 
6 (64,74]  2599.       0.124 
7 (74,130] 1827.       0.0872

df %>%
  group_by(cps21_province) %>%
  summarise(sw=sum(cps21_weight_general_all,na.rm=TRUE)) %>%
  mutate(sw/sum(sw))

# A tibble: 13 × 3
   cps21_province                     sw `sw/sum(sw)`
   <dbl+lbl>                       <dbl>        <dbl>
 1  1 [Alberta]                   2351.       0.112  
 2  2 [British Columbia]          2843.       0.136  
 3  3 [Manitoba]                   738.       0.0352 
 4  4 [New Brunswick]              457.       0.0218 
 5  5 [Newfoundland and Labrador]  321.       0.0153 
 6  6 [Northwest Territories]        0        0      
 7  7 [Nova Scotia]                568.       0.0271 
 8  8 [Nunavut]                      0        0      
 9  9 [Ontario]                   8043.       0.384  
10 10 [Prince Edward Island]        85.9      0.00410
11 11 [Quebec]                    4915.       0.235  
12 12 [Saskatchewan]               631.       0.0301 
13 13 [Yukon]                        0        0

df %>%
  mutate(education=cut(cps21_education,c(0,5,7,12),labels=c("HS or below","College","University"))) %>%
  group_by(education) %>%
  summarise(sw=sum(cps21_weight_general_all,na.rm=TRUE)) %>%
  mutate(sw/sum(sw))

# A tibble: 3 × 3
  education      sw `sw/sum(sw)`
  <fct>       <dbl>        <dbl>
1 HS or below 7364.        0.351
2 College     6946.        0.332
3 University  6642.        0.317

Let’s use these marginals to create new weights, see if it makes sense.

We need to have marginals that fit exactly the questions in our survey. Everything needs to match exactly, that’s the most tricky part.

The marginals also need to be alphabetically ordered in the SAME WAY as the variable. So college could come first, before HS or below. The first time you do this it’s a pain.

Let’s just removed the territories because they are messing up my calculations, it’s just that you can’t calculate weights with either zero cases in the dataset or proportions of zero. Usually people in the territories are grouped either together or with people in the provinces below (YU->BC,NWT->AB,NU->MB).

temp <- df %>%
  filter(cps21_province %in% c(1:5,7,9:12)) %>%
  mutate(gender_weight=recode(as.numeric(cps21_genderid),
                              !!!c("1"="Male","2"="Female","3"="Other","4"="Other")),
         age_weight=cut(cps21_age,c(17,24,34,44,54,64,74,130)),
         education_weight=cut(cps21_education,c(0,5,7,12),
                              labels=c("HS or below","College","University")),
         province_weight=as_factor(cps21_province)) %>%
  select(cps21_ResponseId,cps21_weight_general_all,gender_weight,age_weight,education_weight,province_weight)

temp$province_weight <- droplevels(temp$province_weight)
temp$age_weight <- droplevels(temp$age_weight)
temp$education_weight <- as.character(temp$education_weight)

Let’s calculate again the marginals again on this.

Again, these are reversed engineered from the weights. In real life we would just get these real numbers in the census.

marginals_df <- temp %>%
  gather(key,value,-cps21_weight_general_all,-cps21_ResponseId) %>%
  group_by(key,value) %>% summarise(sw=sum(cps21_weight_general_all,na.rm=TRUE)) %>%
  group_by(key) %>% mutate(p=sw/sum(sw))

Warning: attributes are not identical across measure variables; they will be
dropped

`summarise()` has grouped output by 'key'. You can override using the `.groups`
argument.

Here is how we declare the marginals. It’s calculated with regression so that’s why we have a weird 1 as intercept.

We need to remove the first item in each category, like in regression when we have n-1 coefficients for categorical variables.

(marginals <- c(`(Intercept)`= 1,
               setNames(marginals_df$p,paste0(marginals_df$key,marginals_df$value))))

                             (Intercept) 
                             1.000000000 
                       age_weight(17,24] 
                             0.094422846 
                       age_weight(24,34] 
                             0.178500119 
                       age_weight(34,44] 
                             0.162978329 
                       age_weight(44,54] 
                             0.156916524 
                       age_weight(54,64] 
                             0.195958704 
                       age_weight(64,74] 
                             0.124031229 
                      age_weight(74,130] 
                             0.087192247 
                 education_weightCollege 
                             0.331512600 
             education_weightHS or below 
                             0.351483159 
              education_weightUniversity 
                             0.317004241 
                     gender_weightFemale 
                             0.512728336 
                       gender_weightMale 
                             0.484215704 
                      gender_weightOther 
                             0.003055961 
                  province_weightAlberta 
                             0.112200125 
         province_weightBritish Columbia 
                             0.135699820 
                 province_weightManitoba 
                             0.035199989 
            province_weightNew Brunswick 
                             0.021799962 
province_weightNewfoundland and Labrador 
                             0.015299948 
              province_weightNova Scotia 
                             0.027100011 
                  province_weightOntario 
                             0.383899990 
     province_weightPrince Edward Island 
                             0.004100000 
                   province_weightQuebec 
                             0.234600153 
             province_weightSaskatchewan 
                             0.030100003

marginals <- marginals[-grep("age",names(marginals))[1]]
marginals <- marginals[-grep("education",names(marginals))[1]]
marginals <- marginals[-grep("province",names(marginals))[1]]
marginals <- marginals[-grep("gender",names(marginals))[1]]

When creating weights, we need a complete dataset on those variables for which we weight.

Usually polling companies create weights on those observations where they have data on all weighting variables. They then assign w=1 to those observations where they have missing data.

The survey package creates weights that sum to 1, not the sample size. I’m not going to show it here but let’s say you have 1000 respondents and 50 have missing observations. My 950 weights would sum to 1. So I’d multiply them by 950. They’d now sum to 950. With the 50 weights summing to 50, all that would sum to 1.

temp <- temp %>% filter(complete.cases(.))

tmp_form <- paste(" ~ age_weight+education_weight+gender_weight+province_weight")
surveyDesign <- svydesign(id = ~ 1, 
                          weights = ~ 1,
                          data = temp)
surveyDesign <- calibrate(design     = surveyDesign,
                          formula    = as.formula(tmp_form),
                          calfun     = "raking",
                          population = marginals,
                          maxit      = 2000)

temp$weight_created_withR <- weights(surveyDesign)
temp$weight_created_withR_avg1 <- temp$weight_created_withR*nrow(temp)

The weights we created are highly correlated with the real weights (0.95). We see from the plot different lines. This probably means that I am not weighting we the exact same variable as the census. Maybe they weight on income, maybe they weight on 6 education category, etc. I don’t know, but this gives an idea. Each line on the plot is one group that the CES weights on, that I haven’t weighted on.

cor(temp$weight_created_withR_avg1,temp$cps21_weight_general_all)

[1] 0.9479647

plot(temp$weight_created_withR_avg1,temp$cps21_weight_general_all)

Kish’s Effective Sample Size

One trick I’ve used often is to calculate Kish’s Effective Sample Size to have a rough idea of what’s the effective sample. If all weights are equal, then the effective sample size is equal to the sample size.

# sample size
length(temp$weight_created_withR_avg1)

[1] 20921

# effective sample size
sum(temp$weight_created_withR_avg1)^2/sum(temp$weight_created_withR_avg1^2)

[1] 13664.15

We can then use these approximate sample size numbers to calculate standard errors and confidence intervals. For example, for a proportion, we can use the formulas here.

If POR surveys I’ve used collected by external providers, the effective sample size is often 60-80% of the sample size. That’s good to know. If you have a sample size of 3000, maybe the effective sample size is 2100 and that’s fine. But it’s good to know. If you have a weird opt-in survey with a sample size of 10000 and after weighting the effective sample is 1900 because some weights are super huge, it’s not great. Of course, the more you have categories in the marginal you use to weight (either more variables, or more categories in those variables) the effective sample size will be smaller. In general, the effective sample size is much smaller than the sample size when the data on the weighting variables is very different from the marginals from the census you are using to weight, and the variance on the weights is high.