library(haven)
library(tidyverse)
library(srvyr)
library(survey)
<- read_stata("~/Downloads/dataverse_files_CES2021/2021 Canadian Election Study v1.0.dta") df
Using and understanding survey weights
With an example using the Canadian Election study
I use survey data a lot and survey data always comes with weights. Unfortunately, it’s not always clear how to use weights. Further see here or here.
There is, however, a consensus that for descriptive statistics (using observed data to represent a larger population), one should use weights.
There are different kinds of weights, terminology can be confusing. Thomas Lumley, the author of the survey
package, the main package for the analysis of surveys in R calls them precision weights, frequency weights, sampling weights. These different weights will lead to the same point estimate. However, they will differ in standard errors and here it’s where it can get complicated.
Here’s a nice deck by Prillaman. Lumley’s book is the reference.
Here, I focus on sampling weights because those are the ones we use in survey data; the ones that “describe how the sample can be scaled up to the population”. To be clear, the data we use in public opinion research and empirical social science is almost always non-probability sample. So all the theory about design effects and design weights discussed in the Lumley book and elsewhere doesn’t really apply.
That said, in practice we still weight. We usually use post-stratification: “the use of weights to assure that the sample totals equal some external total based on the target population. Essentially, it is “stratifying” after you have already run your sample.”
In practice when do post-stratification we use raking. Post-stratification and raking are both weighting techniques used to adjust survey samples to match known population characteristics, but they differ in how they handle the adjustment. Post-stratification assigns weights based on predefined population cells or strata (e.g., specific age-gender-education groups), requiring complete joint distribution data for these variables in the population. In contrast, raking adjusts weights iteratively to match marginal distributions of variables (e.g., age and gender and education separately), making it more flexible and applicable when joint distribution data are unavailable or sparse. So in practice if you have reason to believe that e.g., education varies a lot by age or gender, raking might not be appropriate. In practice, raking is used extensively.
I will use 2021 Canadian Election Study v1.0.dta available here which you can load using haven::read_stata
.
In descriptive statistics for survey data, we usually do two things: we calculate averages and we calculate proportions. I say “two things” but proportions can be seen as a special case of averages. A proportion represents the mean of a binary variable, where the variable takes a value of 1 for the presence of a characteristic and 0 for its absence. The average (mean) of this binary variable is equivalent to the proportion of respondents with the characteristic. I illustrate everything here using proportions but it’s all the same for means. The packages I use do both easily.
I’ll start without taking into accounts the weights. Let’s look at:
cps21_pol_eth It is important that politicians behave ethically in office.
- Strongly disagree (1)
- Somewhat disagree (2)
- Somewhat agree (3)
- Strongly agree (4)
- Don’t know / Prefer not to answer (5)
That’s our proportions, unweighted:
prop.table(table(df$cps21_pol_eth))
1 2 3 4 5
0.009489814 0.010099251 0.090806199 0.873846422 0.015758314
In tidyverse style, this is just counting:
%>%
df select(cps21_pol_eth) %>%
filter(complete.cases(.)) %>%
group_by(cps21_pol_eth) %>%
count() %>%
ungroup() %>%
mutate(estimate=n/sum(n))
# A tibble: 5 × 3
cps21_pol_eth n estimate
<dbl+lbl> <int> <dbl>
1 1 [Strongly disagree] 109 0.00949
2 2 [Somewhat disagree] 116 0.0101
3 3 [Somewhat agree] 1043 0.0908
4 4 [Strongly agree] 10037 0.874
5 5 [Don't know/ Prefer not to answer] 181 0.0158
Counting, by the way, is the same by the way as using weights equal to 1 everywhere.
%>%
df mutate(weight = 1) %>%
select(cps21_pol_eth,weight) %>%
filter(complete.cases(.)) %>%
group_by(cps21_pol_eth) %>%
summarise(sum_weight=sum(weight)) %>%
mutate(proportion=sum_weight/sum(sum_weight))
# A tibble: 5 × 3
cps21_pol_eth sum_weight proportion
<dbl+lbl> <dbl> <dbl>
1 1 [Strongly disagree] 109 0.00949
2 2 [Somewhat disagree] 116 0.0101
3 3 [Somewhat agree] 1043 0.0908
4 4 [Strongly agree] 10037 0.874
5 5 [Don't know/ Prefer not to answer] 181 0.0158
The CES provides weights in the cps21_weight_general_all
variable.
summary(df$cps21_weight_general_all)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.2000 0.5095 0.8047 1.0015 1.1812 5.0000 47
Weights used in public opinion surveys usually sum to the number of people in the survey (with a mean of 1). Here I have no idea why the mean is 1.0015. Normally it should equal 1.
Sometimes I’ve seen weights summing to 1 instead of summing to the number of obesrvations. I’ve also seen weights summing to the population of Canada. It’s all the same, if weights sum to 1, multiplying them by the number of respondents puts them on that scale. Makes no difference for proportion calculations. Most common is weights that sum to the number of obesrvations with average 1.
Here’s comparing both no weights (well weights all equal to 1) and real weights:
%>%
df mutate(weight = 1) %>%
select(cps21_pol_eth,weight,cps21_weight_general_all) %>%
filter(complete.cases(.)) %>%
group_by(cps21_pol_eth) %>%
summarise(sum_weight=sum(weight),sum_real_weight=sum(cps21_weight_general_all)) %>%
mutate(sum_weight/sum(sum_weight),
/sum(sum_real_weight)) sum_real_weight
# A tibble: 5 × 5
cps21_pol_eth sum_weight sum_real_weight sum_weight/sum(sum_w…¹
<dbl+lbl> <dbl> <dbl> <dbl>
1 1 [Strongly disagree] 108 127. 0.00943
2 2 [Somewhat disagree] 116 163. 0.0101
3 3 [Somewhat agree] 1042 1247. 0.0910
4 4 [Strongly agree] 10008 9653. 0.874
5 5 [Don't know/ Prefer not t… 180 276. 0.0157
# ℹ abbreviated name: ¹`sum_weight/sum(sum_weight)`
# ℹ 1 more variable: `sum_real_weight/sum(sum_real_weight)` <dbl>
The package srvyr
has some functions to do all this.
<- df %>%
to_model select(cps21_ResponseId,cps21_pol_eth,cps21_genderid,
weight=cps21_weight_general_restricted) %>%
drop_na()
# The only thing to pass is the weights, that's it
%>%
to_model as_survey_design(weight = weight) %>%
group_by(cps21_pol_eth) %>%
summarise(survey_mean(),survey_mean(vartype = "ci"))
# A tibble: 5 × 5
cps21_pol_eth coef `_se` `_low` `_upp`
<dbl+lbl> <dbl> <dbl> <dbl> <dbl>
1 1 [Strongly disagree] 0.0105 0.00136 0.00782 0.0131
2 2 [Somewhat disagree] 0.0140 0.00175 0.0106 0.0175
3 3 [Somewhat agree] 0.110 0.00414 0.102 0.118
4 4 [Strongly agree] 0.841 0.00490 0.832 0.851
5 5 [Don't know/ Prefer not to answer] 0.0238 0.00219 0.0195 0.0281
We can of course imitate “no weights”:
<- df %>%
to_model select(cps21_ResponseId,cps21_pol_eth,cps21_genderid,
weight=cps21_weight_general_restricted) %>%
drop_na()
# The only thing to pass is the weights, that's it
%>%
to_model mutate(weight=1) %>%
as_survey_design(weight = weight) %>%
group_by(cps21_pol_eth) %>%
summarise(survey_mean(),survey_mean(vartype = "ci"))
# A tibble: 5 × 5
cps21_pol_eth coef `_se` `_low` `_upp`
<dbl+lbl> <dbl> <dbl> <dbl> <dbl>
1 1 [Strongly disagree] 0.00890 0.000924 0.00709 0.0107
2 2 [Somewhat disagree] 0.00997 0.000977 0.00805 0.0119
3 3 [Somewhat agree] 0.0914 0.00283 0.0858 0.0969
4 4 [Strongly agree] 0.874 0.00326 0.868 0.880
5 5 [Don't know/ Prefer not to answer] 0.0157 0.00122 0.0133 0.0181
We can break down by another variable e.g., gender
<- df %>%
to_model select(cps21_ResponseId,cps21_pol_eth,cps21_genderid,
weight=cps21_weight_general_restricted) %>%
drop_na()
# The only thing to pass is the weights, that's it
%>%
to_model as_survey_design(weight = weight) %>%
group_by(cps21_genderid,cps21_pol_eth) %>%
summarise(survey_mean(),survey_mean(vartype = "ci"))
# A tibble: 15 × 6
# Groups: cps21_genderid [4]
cps21_genderid cps21_pol_eth coef `_se` `_low` `_upp`
<dbl+lbl> <dbl+lbl> <dbl> <dbl> <dbl> <dbl>
1 1 [A man] 1 [Strongly … 0.0120 0.00207 0.00796 0.0161
2 1 [A man] 2 [Somewhat … 0.0171 0.00292 0.0114 0.0228
3 1 [A man] 3 [Somewhat … 0.111 0.00615 0.0986 0.123
4 1 [A man] 4 [Strongly … 0.845 0.00719 0.830 0.859
5 1 [A man] 5 [Don't kno… 0.0157 0.00264 0.0105 0.0208
6 2 [A woman] 1 [Strongly … 0.00898 0.00176 0.00553 0.0124
7 2 [A woman] 2 [Somewhat … 0.0110 0.00191 0.00723 0.0147
8 2 [A woman] 3 [Somewhat … 0.109 0.00555 0.0983 0.120
9 2 [A woman] 4 [Strongly … 0.839 0.00667 0.826 0.852
10 2 [A woman] 5 [Don't kno… 0.0321 0.00353 0.0252 0.0390
11 3 [Non-binary] 3 [Somewhat … 0.191 0.0687 0.0568 0.326
12 3 [Non-binary] 4 [Strongly … 0.759 0.0775 0.607 0.911
13 3 [Non-binary] 5 [Don't kno… 0.0496 0.0478 -0.0441 0.143
14 4 [Another gender, please spec… 3 [Somewhat … 0.111 0.0785 -0.0428 0.265
15 4 [Another gender, please spec… 4 [Strongly … 0.889 0.0785 0.735 1.04
If you actually want Likert averages (here we should remove the 5s which are DKs) :
# The only thing to pass is the weights, that's it
%>%
to_model filter(cps21_pol_eth!=5) %>%
as_survey_design(weight = weight) %>%
group_by(cps21_genderid) %>%
summarise(survey_mean(cps21_pol_eth))
# A tibble: 4 × 3
cps21_genderid coef `_se`
<dbl+lbl> <dbl> <dbl>
1 1 [A man] 3.82 0.0102
2 2 [A woman] 3.84 0.00854
3 3 [Non-binary] 3.80 0.0714
4 4 [Another gender, please specify:] 3.89 0.0785
How to create weights?
“Raking weights” (or iterative proportional fitting) is a method to create weights.
In short, raking involves:
Selecting Marginals: First, you identify key demographic (or other relevant) characteristics where you know both the distribution in your sample and the true distribution in the population (e.g., age groups, gender, education, region).
Adjusting Weights: For each characteristic, you adjust the sample weights so that the weighted distribution matches the known population distribution.
After the process, each survey respondent will have a weight (or a multiplier) that’s used to make the sample data better resemble the actual population across the specified characteristics.
I don’t know what the marginals used for raking were here.
The documentation says: “Marginal values were successively weighted according to province, as well as gender, age group, and education level. All population data were taken from the 2016 Canadian census.”
So I could go back in the 2016 census and get them. But I can also just reverse engineer them from the weights.
Think about it. I can count the number of people (unweighted) in each group. I get the approximate number of Canadians in those groups based on the CES. If I count the number of people in each group, adjusted with the weights, I get the approximate number of Canadian in those groups, adjusted with the weights, so adjusted to the census.
By summing the weights over the categories on which the weights are probably created, I can get some marginals, that is, estimates inline with the census.
%>%
df group_by(cps21_genderid) %>%
summarise(sw=sum(cps21_weight_general_all,na.rm=TRUE)) %>%
mutate(sw/sum(sw))
# A tibble: 4 × 3
cps21_genderid sw `sw/sum(sw)`
<dbl+lbl> <dbl> <dbl>
1 1 [A man] 10145. 0.484
2 2 [A woman] 10743. 0.513
3 3 [Non-binary] 49.6 0.00237
4 4 [Another gender, please specify:] 14.5 0.000690
%>%
df mutate(age=cut(cps21_age,c(0,17,24,34,44,54,64,74,130))) %>%
group_by(age) %>%
summarise(sw=sum(cps21_weight_general_all,na.rm=TRUE)) %>%
mutate(sw/sum(sw))
# A tibble: 7 × 3
age sw `sw/sum(sw)`
<fct> <dbl> <dbl>
1 (17,24] 1978. 0.0944
2 (24,34] 3740. 0.179
3 (34,44] 3415. 0.163
4 (44,54] 3288. 0.157
5 (54,64] 4106. 0.196
6 (64,74] 2599. 0.124
7 (74,130] 1827. 0.0872
%>%
df group_by(cps21_province) %>%
summarise(sw=sum(cps21_weight_general_all,na.rm=TRUE)) %>%
mutate(sw/sum(sw))
# A tibble: 13 × 3
cps21_province sw `sw/sum(sw)`
<dbl+lbl> <dbl> <dbl>
1 1 [Alberta] 2351. 0.112
2 2 [British Columbia] 2843. 0.136
3 3 [Manitoba] 738. 0.0352
4 4 [New Brunswick] 457. 0.0218
5 5 [Newfoundland and Labrador] 321. 0.0153
6 6 [Northwest Territories] 0 0
7 7 [Nova Scotia] 568. 0.0271
8 8 [Nunavut] 0 0
9 9 [Ontario] 8043. 0.384
10 10 [Prince Edward Island] 85.9 0.00410
11 11 [Quebec] 4915. 0.235
12 12 [Saskatchewan] 631. 0.0301
13 13 [Yukon] 0 0
%>%
df mutate(education=cut(cps21_education,c(0,5,7,12),labels=c("HS or below","College","University"))) %>%
group_by(education) %>%
summarise(sw=sum(cps21_weight_general_all,na.rm=TRUE)) %>%
mutate(sw/sum(sw))
# A tibble: 3 × 3
education sw `sw/sum(sw)`
<fct> <dbl> <dbl>
1 HS or below 7364. 0.351
2 College 6946. 0.332
3 University 6642. 0.317
Let’s use these marginals to create new weights, see if it makes sense.
We need to have marginals that fit exactly the questions in our survey. Everything needs to match exactly, that’s the most tricky part.
The marginals also need to be alphabetically ordered in the SAME WAY as the variable. So college could come first, before HS or below. The first time you do this it’s a pain.
Let’s just removed the territories because they are messing up my calculations, it’s just that you can’t calculate weights with either zero cases in the dataset or proportions of zero. Usually people in the territories are grouped either together or with people in the provinces below (YU->BC,NWT->AB,NU->MB).
<- df %>%
temp filter(cps21_province %in% c(1:5,7,9:12)) %>%
mutate(gender_weight=recode(as.numeric(cps21_genderid),
!!!c("1"="Male","2"="Female","3"="Other","4"="Other")),
age_weight=cut(cps21_age,c(17,24,34,44,54,64,74,130)),
education_weight=cut(cps21_education,c(0,5,7,12),
labels=c("HS or below","College","University")),
province_weight=as_factor(cps21_province)) %>%
select(cps21_ResponseId,cps21_weight_general_all,gender_weight,age_weight,education_weight,province_weight)
$province_weight <- droplevels(temp$province_weight)
temp$age_weight <- droplevels(temp$age_weight)
temp$education_weight <- as.character(temp$education_weight) temp
Let’s calculate again the marginals again on this.
Again, these are reversed engineered from the weights. In real life we would just get these real numbers in the census.
<- temp %>%
marginals_df gather(key,value,-cps21_weight_general_all,-cps21_ResponseId) %>%
group_by(key,value) %>% summarise(sw=sum(cps21_weight_general_all,na.rm=TRUE)) %>%
group_by(key) %>% mutate(p=sw/sum(sw))
Warning: attributes are not identical across measure variables; they will be
dropped
`summarise()` has grouped output by 'key'. You can override using the `.groups`
argument.
Here is how we declare the marginals. It’s calculated with regression so that’s why we have a weird 1 as intercept.
We need to remove the first item in each category, like in regression when we have n-1 coefficients for categorical variables.
<- c(`(Intercept)`= 1,
(marginals setNames(marginals_df$p,paste0(marginals_df$key,marginals_df$value))))
(Intercept)
1.000000000
age_weight(17,24]
0.094422846
age_weight(24,34]
0.178500119
age_weight(34,44]
0.162978329
age_weight(44,54]
0.156916524
age_weight(54,64]
0.195958704
age_weight(64,74]
0.124031229
age_weight(74,130]
0.087192247
education_weightCollege
0.331512600
education_weightHS or below
0.351483159
education_weightUniversity
0.317004241
gender_weightFemale
0.512728336
gender_weightMale
0.484215704
gender_weightOther
0.003055961
province_weightAlberta
0.112200125
province_weightBritish Columbia
0.135699820
province_weightManitoba
0.035199989
province_weightNew Brunswick
0.021799962
province_weightNewfoundland and Labrador
0.015299948
province_weightNova Scotia
0.027100011
province_weightOntario
0.383899990
province_weightPrince Edward Island
0.004100000
province_weightQuebec
0.234600153
province_weightSaskatchewan
0.030100003
<- marginals[-grep("age",names(marginals))[1]]
marginals <- marginals[-grep("education",names(marginals))[1]]
marginals <- marginals[-grep("province",names(marginals))[1]]
marginals <- marginals[-grep("gender",names(marginals))[1]] marginals
When creating weights, we need a complete dataset on those variables for which we weight.
Usually polling companies create weights on those observations where they have data on all weighting variables. They then assign w=1 to those observations where they have missing data.
The survey package creates weights that sum to 1, not the sample size. I’m not going to show it here but let’s say you have 1000 respondents and 50 have missing observations. My 950 weights would sum to 1. So I’d multiply them by 950. They’d now sum to 950. With the 50 weights summing to 50, all that would sum to 1.
<- temp %>% filter(complete.cases(.))
temp
<- paste(" ~ age_weight+education_weight+gender_weight+province_weight")
tmp_form <- svydesign(id = ~ 1,
surveyDesign weights = ~ 1,
data = temp)
<- calibrate(design = surveyDesign,
surveyDesign formula = as.formula(tmp_form),
calfun = "raking",
population = marginals,
maxit = 2000)
$weight_created_withR <- weights(surveyDesign)
temp$weight_created_withR_avg1 <- temp$weight_created_withR*nrow(temp) temp
The weights we created are highly correlated with the real weights (0.95). We see from the plot different lines. This probably means that I am not weighting we the exact same variable as the census. Maybe they weight on income, maybe they weight on 6 education category, etc. I don’t know, but this gives an idea. Each line on the plot is one group that the CES weights on, that I haven’t weighted on.
cor(temp$weight_created_withR_avg1,temp$cps21_weight_general_all)
[1] 0.9479647
plot(temp$weight_created_withR_avg1,temp$cps21_weight_general_all)
Kish’s Effective Sample Size
One trick I’ve used often is to calculate Kish’s Effective Sample Size to have a rough idea of what’s the effective sample. If all weights are equal, then the effective sample size is equal to the sample size.
# sample size
length(temp$weight_created_withR_avg1)
[1] 20921
# effective sample size
sum(temp$weight_created_withR_avg1)^2/sum(temp$weight_created_withR_avg1^2)
[1] 13664.15
We can then use these approximate sample size numbers to calculate standard errors and confidence intervals. For example, for a proportion, we can use the formulas here.
If POR surveys I’ve used collected by external providers, the effective sample size is often 60-80% of the sample size. That’s good to know. If you have a sample size of 3000, maybe the effective sample size is 2100 and that’s fine. But it’s good to know. If you have a weird opt-in survey with a sample size of 10000 and after weighting the effective sample is 1900 because some weights are super huge, it’s not great. Of course, the more you have categories in the marginal you use to weight (either more variables, or more categories in those variables) the effective sample size will be smaller. In general, the effective sample size is much smaller than the sample size when the data on the weighting variables is very different from the marginals from the census you are using to weight, and the variance on the weights is high.