Project 1 - Exploring data from the 2019 Canadian Electio Study

In this project, we use data from the 2019 Canadian Election Study (CES) to produce an exploratory data analysis. We start with a univariate exploratory data analysis. Then we move to bivariate analysis.

Section 1 is a code along. You just have to run the code. There’s no code to write. However, there are questions (marked QUESTION: in red) to answer. Answer them directly in the Rmd file.

In Section 2, I ask you to run the an analysis similar to the one in section 1, but on some other variables of your choice. You can pick any variables from the 2019 CES, or from another dataset if you prefer.

Project 1 is due on July 8. When you are done, knit this R Markdown file to html. Submit both the html file and this .Rmd (R Markdown) file.

Section 1 - Exploring variables (code along)

Loading packages, loading the data

library(tidyverse)
# The CES data provided is in Stata13 format, so we need readstata13
library(readstata13)
# We need e1071 for kurtosis and skewness
library(e1071)
# We need kableExtra to produce nice html data tables
library(kableExtra)
# Read in the data, assign to df
df <- read.dta13("~/Downloads/2019 Canadian Election Study - Phone Survey v1.0.dta")
# Let's make the data frame a tibble
df <- as_tibble(df)

Use the glimpse function on the dataset.

glimpse(df)
## Rows: 4,021
## Columns: 273
## $ sample_id              <int> 18, 32, 39, 59, 61, 69, 157, 158, 165, 167, 185…
## $ survey_end_CES         <chr> "2019-09-23 15:48:29-06", "2019-09-12 18:02:30-…
## $ survey_end_month_CES   <int> 9, 9, 9, 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 9, …
## $ survey_end_day_CES     <int> 23, 12, 10, 10, 12, 17, 12, 14, 10, 12, 16, 12,…
## $ num_attempts_CES       <int> 5, 1, 1, 6, 1, 1, 1, 4, 1, 1, 4, 1, 9, 2, 1, 1,…
## $ interviewer_id_CES     <int> 161182, 151152, 161182, 147601, 151152, 2503, 2…
## $ interviewer_gender_CES <chr> "Female", "Male", "Female", "Female", "Male", "…
## $ language_CES           <fct> (2) French, (1) English, (2) French, (2) French…
## $ phonetype_CES          <fct> (2) Wireless, (2) Wireless, (2) Wireless, (2) W…
## $ survey_end_PES         <chr> "2019-11-08 14:24:14-07", "", "2019-11-09 13:08…
## $ survey_end_month_PES   <int> 11, NA, 11, NA, 11, 10, 10, 11, NA, 10, 11, NA,…
## $ survey_end_day_PES     <int> 8, NA, 9, NA, 4, 28, 28, 7, NA, 25, 6, NA, NA, …
## $ num_attempts_PES       <int> 4, NA, 4, NA, 3, 1, 3, 3, NA, 0, 2, NA, NA, 0, …
## $ interviewer_id_PES     <int> 2503, NA, 161182, NA, 164893, 2504, 164893, 164…
## $ interviewer_gender_PES <chr> "Female", "", "Female", "", "Female", "Female",…
## $ language_PES           <fct> (2) French, NA, (2) French, NA, (2) French, (2)…
## $ phonetype_PES          <fct> (2) Wireless, NA, (2) Wireless, NA, (2) Wireles…
## $ mode_PES               <fct> (1) CATI, NA, (1) CATI, NA, (1) CATI, (1) CATI,…
## $ phone_type             <fct> (2) Wireless only, (2) Wireless only, (2) Wirel…
## $ weight_CES             <dbl> 0.9019529, 0.9019529, 0.9019529, 1.2334642, 0.9…
## $ weight_PES             <dbl> 1.030709, NA, 1.030709, NA, 1.030709, 1.030709,…
## $ c1                     <fct> (2) No, (2) No, (2) No, (2) No, (2) No, (2) No,…
## $ c2a                    <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ c3                     <fct> "(1) Yes, consent to continue", "(1) Yes, conse…
## $ q1                     <fct> (1) Yes, (1) Yes, (1) Yes, (1) Yes, (1) Yes, (1…
## $ q2                     <int> 1963, 1973, 1994, 2000, 1984, 1939, 1999, 1995,…
## $ q3                     <fct> (1) Male, (1) Male, (1) Male, (1) Male, (1) Mal…
## $ q4                     <fct> (5) Quebec, (5) Quebec, (5) Quebec, (5) Quebec,…
## $ q6                     <fct> (3) Not very satisfied, (2) Fairly satisfied, (…
## $ q7                     <chr> "economie", "Finances", "agriculture", "l'envir…
## $ q8                     <fct> "(1) Liberal (Grits)", "(8) None of these", "(3…
## $ q8_7_                  <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ q9                     <fct> (8), (10) Great deal of interest, (10) Great de…
## $ q10                    <fct> (1) Certain, (1) Certain, (1) Certain, (1) Cert…
## $ q11                    <fct> "(-9) Don't know / Undecided", "(-9) Don't know…
## $ q11_7_                 <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ q12                    <fct> "(-9) Don't know", "(-9) Don't know", NA, NA, N…
## $ q12_7_                 <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ q13                    <fct> (2) Fairly satisfied, (2) Fairly satisfied, (2)…
## $ q14                    <int> 60, 70, 70, 75, 10, 0, 50, 65, 50, 70, 15, 40, …
## $ q15                    <int> 40, 55, 60, 40, 10, 30, 20, 25, 80, 10, 50, 25,…
## $ q16                    <int> 40, 40, 55, 85, 90, 0, 70, 75, 10, 40, 20, 20, …
## $ q17                    <int> 40, 10, 50, 80, 49, 100, 40, 80, 50, 70, 0, 60,…
## $ q18                    <int> 30, 40, 50, 75, 10, 30, 70, 75, 0, 0, 0, 50, 25…
## $ q19                    <int> 10, 15, -6, 40, 0, 0, -6, 0, 0, -6, 95, 35, 25,…
## $ q20                    <int> 70, 50, 70, 70, 25, 0, 35, 70, 50, 70, 10, 10, …
## $ q21                    <int> 40, 50, 40, 55, 25, 30, 35, 15, 80, 40, 30, 25,…
## $ q22                    <int> 30, 45, 70, 90, 80, 0, 65, 80, -6, 60, 25, 15, …
## $ q23                    <int> 50, 10, 80, 50, -9, 100, -6, 85, 50, -6, 10, 40…
## $ q24                    <int> 70, 40, 60, 70, 40, 30, 50, 77, -6, 40, 10, 45,…
## $ q25                    <int> 70, 15, 30, 50, 20, 0, -6, 5, 10, 40, 98, 55, 5…
## $ q27_a                  <fct> NA, (3) About the same as now, (1) Spend more, …
## $ q27_b                  <fct> (3) About the same as now, (1) Spend more, (1) …
## $ q27_c                  <fct> (3) About the same as now, (3) About the same a…
## $ q27_d                  <fct> (3) About the same as now, (3) About the same a…
## $ q27_e                  <fct> (2) Spend less, (3) About the same as now, (3) …
## $ q31                    <fct> (2) Worse, (3) About the same, (1) Better, (3) …
## $ q32                    <fct> (3) Not made much difference, (3) Not made much…
## $ q33                    <fct> "(1) Liberal (Grits)", "(2) Conservatives (Tory…
## $ q33_7_                 <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ q34                    <fct> "(5) Green Party (Greens)", "(1) Liberal (Grits…
## $ q34_7_                 <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ q35                    <fct> (1) The Liberal party, (2) The Conservative Par…
## $ q35_7_                 <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ q36                    <fct> (2) The Conservative Party, (1) The Liberal par…
## $ q36_7_                 <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ q37                    <fct> (1) The Liberal party, (1) The Liberal party, (…
## $ q37_7_                 <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ q38                    <fct> (2) The Conservative Party, (2) The Conservativ…
## $ q38_7_                 <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ q39                    <fct> (3) Or about the same number of immigrants as n…
## $ q40                    <fct> (3) Or about the same number of refugees as now…
## $ q75                    <fct> (3) Not very well, (3) Not very well, (3) Not v…
## $ q44                    <fct> (3) 11 to 30 minutes, (3) 11 to 30 minutes, (4)…
## $ q76                    <fct> (1) Duty, (1) Duty, (1) Duty, (1) Duty, (1) Dut…
## $ q45                    <fct> (2) No, (1) Yes, (1) Yes, (1) Yes, (1) Yes, (2)…
## $ q46                    <fct> (3) Somewhat disagree, (3) Somewhat disagree, (…
## $ q47                    <fct> (3) About the same, (3) About the same, (3) Abo…
## $ q48                    <fct> (1) Correct, (1) Correct, (1) Correct, (1) Corr…
## $ q49                    <fct> (4) No / Don't know, (4) No / Don't know, (4) N…
## $ q52                    <fct> "(1) Liberal (Grits)", "(1) Liberal (Grits)", "…
## $ q52_7_                 <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ q53                    <fct> (2) Fairly strongly, (2) Fairly strongly, (2) F…
## $ q54                    <fct> (2) Fairly satisfied, (2) Fairly satisfied, (2)…
## $ q59                    <fct> (1) Yes, (1) Yes, (1) Yes, (3) Not eligible (to…
## $ q60                    <fct> "(1) Liberal (Grits)", "(1) Liberal (Grits)", "…
## $ q60_7_                 <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ q77                    <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ q43                    <fct> (2) Somewhat favourable, (4) Very opposed, (3) …
## $ q61                    <fct> "(9) Bachelor's degree", "(8) Some university",…
## $ q62                    <fct> "(6) Catholic/Roman Catholic/RC", "(6) Catholic…
## $ q62_22_                <chr> "", "", "", "", "", "", "Beisme", "", "", "", "…
## $ q63                    <fct> (1) Very important, (3) Not very important, NA,…
## $ q64                    <fct> "(2) Canada", "(2) Canada", "(2) Canada", "(2) …
## $ q64_13_                <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ q65                    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ q66a_1                 <fct> (0) Not Selected, (0) Not Selected, (0) Not Sel…
## $ q66a_2                 <fct> (0) Not Selected, (0) Not Selected, (0) Not Sel…
## $ q66a_3                 <fct> (0) Not Selected, (0) Not Selected, (0) Not Sel…
## $ q66a_4                 <fct> (0) Not Selected, (0) Not Selected, (0) Not Sel…
## $ q66a_5                 <fct> (0) Not Selected, (0) Not Selected, (1) Selecte…
## $ q66a_6                 <fct> (0) Not Selected, (0) Not Selected, (0) Not Sel…
## $ q66a_7                 <fct> (0) Not Selected, (0) Not Selected, (0) Not Sel…
## $ q66a_8                 <fct> (0) Not Selected, (0) Not Selected, (0) Not Sel…
## $ q66a_9                 <fct> (0) Not Selected, (0) Not Selected, (0) Not Sel…
## $ q66a_10                <fct> (0) Not Selected, (0) Not Selected, (0) Not Sel…
## $ q66a_11                <fct> (0) Not Selected, (0) Not Selected, (0) Not Sel…
## $ q66a_12                <fct> (0) Not Selected, (1) Selected, (0) Not Selecte…
## $ q66a_13                <fct> (0) Not Selected, (0) Not Selected, (0) Not Sel…
## $ q66a_14                <fct> (1) Selected, (0) Not Selected, (0) Not Selecte…
## $ q66a_15                <fct> (0) Not Selected, (0) Not Selected, (0) Not Sel…
## $ q66a_16                <fct> (0) Not Selected, (0) Not Selected, (0) Not Sel…
## $ q66a_17                <fct> (0) Not Selected, (0) Not Selected, (0) Not Sel…
## $ q66a_17_               <chr> "", "", "", "colon français, autochtone, canad…
## $ q66_1                  <fct> NA, NA, NA, NA, NA, NA, NA, NA, (-9) Don't know…
## $ q66_3                  <fct> NA, NA, NA, NA, NA, NA, NA, NA, (-9) Don't know…
## $ q66_4                  <fct> NA, NA, NA, NA, NA, NA, NA, NA, (-9) Don't know…
## $ q66_5                  <fct> NA, NA, NA, NA, NA, NA, NA, NA, (-9) Don't know…
## $ q66_6                  <fct> NA, NA, NA, NA, NA, NA, NA, NA, (-9) Don't know…
## $ q66_7                  <fct> NA, NA, NA, NA, NA, NA, NA, NA, (-9) Don't know…
## $ q66_8                  <fct> NA, NA, NA, NA, NA, NA, NA, NA, (-9) Don't know…
## $ q66_9                  <fct> NA, NA, NA, NA, NA, NA, NA, NA, (-9) Don't know…
## $ q66_10                 <fct> NA, NA, NA, NA, NA, NA, NA, NA, (-9) Don't know…
## $ q66_11                 <fct> NA, NA, NA, NA, NA, NA, NA, NA, (-9) Don't know…
## $ q66_12                 <fct> NA, NA, NA, NA, NA, NA, NA, NA, (-9) Don't know…
## $ q66_13                 <fct> NA, NA, NA, NA, NA, NA, NA, NA, (-9) Don't know…
## $ q66_14                 <fct> NA, NA, NA, NA, NA, NA, NA, NA, (-9) Don't know…
## $ q66_15                 <fct> NA, NA, NA, NA, NA, NA, NA, NA, (-9) Don't know…
## $ q66_16                 <fct> NA, NA, NA, NA, NA, NA, NA, NA, (-9) Don't know…
## $ q66_17                 <fct> NA, NA, NA, NA, NA, NA, NA, NA, (-9) Don't know…
## $ q66_18                 <fct> NA, NA, NA, NA, NA, NA, NA, NA, (-9) Don't know…
## $ q66_18_                <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ q67                    <fct> "(4) French", "(1) English", "(4) French", "(4)…
## $ q67_31_                <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ q68                    <fct> (1) Working for pay full-time, (1) Working for …
## $ q68_12_                <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ q69                    <int> 104000, 75000, 20000, 120000, 95000, 38000, 120…
## $ q70                    <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "(4) $6…
## $ q71                    <int> 2, 4, 1, 5, 1, 1, 1, 2, 2, 4, 3, 5, 3, 2, 4, 2,…
## $ q26a                   <fct> (2) No, (2) No, (2) No, (1) Yes, (2) No, (2) No…
## $ q26b                   <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ r1                     <fct> (1) Phone, (2) Online, (2) Online, (2) Online, …
## $ age                    <int> 56, 46, 25, 19, 35, 80, 20, 24, 56, 49, 41, 20,…
## $ age_range              <fct> (5) 55+ years old, (4) 45-54 years old, (2) 25-…
## $ q71r                   <fct> (2) 2, (4) 4, (1) 1, (5) 5 or more, (1) 1, (1) …
## $ q70r                   <fct> "(5) $90,001 to $110,000", "(4) $60,001 to $90,…
## $ q14r                   <fct> "(3) 41-60", "(4) 61-80", "(4) 61-80", "(4) 61-…
## $ q15r                   <fct> "(2) 21-40", "(3) 41-60", "(3) 41-60", "(2) 21-…
## $ q16r                   <fct> "(2) 21-40", "(2) 21-40", "(3) 41-60", "(5) 81-…
## $ q17r                   <fct> "(2) 21-40", "(1) 0-20", "(3) 41-60", "(4) 61-8…
## $ q18r                   <fct> "(2) 21-40", "(2) 21-40", "(3) 41-60", "(4) 61-…
## $ q19r                   <fct> "(1) 0-20", "(1) 0-20", "(7) Don't know party",…
## $ q20r                   <fct> "(4) 61-80", "(3) 41-60", "(4) 61-80", "(4) 61-…
## $ q21r                   <fct> "(2) 21-40", "(3) 41-60", "(2) 21-40", "(3) 41-…
## $ q22r                   <fct> "(2) 21-40", "(3) 41-60", "(4) 61-80", "(5) 81-…
## $ q23r                   <fct> "(3) 41-60", "(1) 0-20", "(4) 61-80", "(3) 41-6…
## $ q24r                   <fct> "(4) 61-80", "(2) 21-40", "(3) 41-60", "(4) 61-…
## $ q25r                   <fct> "(4) 61-80", "(1) 0-20", "(2) 21-40", "(3) 41-6…
## $ vote                   <fct> "(11) Don't know / Undecided", "(11) Don't know…
## $ q77eng                 <fct> NA, NA, NA, (2) No, NA, NA, NA, NA, NA, NA, NA,…
## $ q77fr                  <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pc1                    <fct> "(1) Yes, consent to continue", NA, "(1) Yes, c…
## $ p1                     <chr> "l'écologie", "", "laicité", "", "L'environne…
## $ p2                     <fct> (1) Yes, NA, (1) Yes, NA, (1) Yes, (1) Yes, (2)…
## $ p3                     <fct> (1) Liberal Party, NA, (4) Bloc Québécois, NA…
## $ p3_7_                  <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ p4                     <fct> (2) Fairly satisfied, NA, (1) Very satisfied, N…
## $ p5                     <fct> (2) A good job, NA, (2) A good job, NA, (3) A b…
## $ p6                     <fct> (2) 2, NA, (6) 6, NA, (0) 0 - Strongly dislike …
## $ p7                     <fct> (9) 9, NA, (7) 7, NA, (1) 1, (4) 4, (5) 5, (5) …
## $ p8                     <fct> (2) 2, NA, (8) 8, NA, (9) 9, (0) 0 - Strongly d…
## $ p9                     <fct> (6) 6, NA, (7) 7, NA, (8) 8, (1) 1, (7) 7, (5) …
## $ p10                    <fct> (7) 7, NA, (8) 8, NA, (7) 7, (10) 10 - Strongly…
## $ p11                    <fct> (1) 1, NA, (4) 4, NA, (0) 0 - Strongly dislike …
## $ p12                    <fct> (2) 2, NA, (5) 5, NA, (0) 0 - Strongly dislike …
## $ p13                    <fct> (9) 9, NA, (6) 6, NA, (0) 0 - Strongly dislike …
## $ p14                    <fct> (7) 7, NA, (7) 7, NA, (9) 9, (5) 5, (7) 7, (9) …
## $ p15                    <fct> (6) 6, NA, (5) 5, NA, (8) 8, (0) 0 - Strongly d…
## $ p16                    <fct> (7) 7, NA, (8) 8, NA, (10) 10 - Strongly like l…
## $ p17                    <fct> (8) 8, NA, (4) 4, NA, (0) 0 - Strongly dislike …
## $ p18                    <fct> (1) Get better, NA, (1) Get better, NA, (3) Sta…
## $ p19                    <fct> (3) Stay about the same, NA, (3) Stay about the…
## $ p20_a                  <fct> (4) Somewhat disagree, NA, (5) Strongly disagre…
## $ p20_b                  <fct> (5) Strongly disagree, NA, (4) Somewhat disagre…
## $ p20_c                  <fct> (1) Strongly agree, NA, (1) Strongly agree, NA,…
## $ p20_d                  <fct> (1) Strongly agree, NA, (1) Strongly agree, NA,…
## $ p20_e                  <fct> (3) Neither agree nor disagree, NA, (5) Strongl…
## $ p20_f                  <fct> (4) Somewhat disagree, NA, (4) Somewhat disagre…
## $ p20_g                  <fct> (2) Somewhat agree, NA, (2) Somewhat agree, NA,…
## $ p20_h                  <fct> (5) Strongly disagree, NA, (4) Somewhat disagre…
## $ p20_i                  <fct> (4) Somewhat disagree, NA, (4) Somewhat disagre…
## $ p20_j                  <fct> (4) Somewhat disagree, NA, (2) Somewhat agree, …
## $ p20_k                  <fct> (4) Somewhat disagree, NA, (4) Somewhat disagre…
## $ p20_l                  <fct> (3) Neither agree nor disagree, NA, (5) Strongl…
## $ p20_m                  <fct> (2) Somewhat agree, NA, (4) Somewhat disagree, …
## $ p20_n                  <fct> (4) Somewhat disagree, NA, (4) Somewhat disagre…
## $ p21_a                  <fct> (1) Strongly agree, NA, (2) Somewhat agree, NA,…
## $ p21_b                  <fct> (5) Strongly disagree, NA, (4) Somewhat disagre…
## $ p22_a                  <fct> (2) Somewhat agree, NA, (1) Strongly agree, NA,…
## $ p22_b                  <fct> (4) Somewhat disagree, NA, (4) Somewhat disagre…
## $ p22_c                  <fct> (4) Somewhat disagree, NA, (4) Somewhat disagre…
## $ p23                    <fct> (1) Yes, NA, (1) Yes, NA, (1) Yes, (1) Yes, (1)…
## $ p24                    <fct> (1) Liberal Party, NA, (3) NDP, NA, (3) NDP, (4…
## $ p24_7_                 <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ p25_a                  <fct> (2) Fairly important, NA, (4) Not important at …
## $ p25_b                  <fct> (2) Fairly important, NA, (4) Not important at …
## $ p25_c                  <fct> (1) Very important, NA, (1) Very important, NA,…
## $ p25_d                  <fct> (1) Very important, NA, (2) Fairly important, N…
## $ p26                    <fct> (2) Quite widespread, NA, (3) Not very widespre…
## $ p27                    <fct> (8) 8, NA, (9) 9, NA, (10) 10 - A great deal of…
## $ p28                    <fct> (2) Fairly closely, NA, (1) Very closely, NA, (…
## $ p29_a                  <fct> (4) More than five times, NA, (3) A few times, …
## $ p29_b                  <fct> (1) Never, NA, (1) Never, NA, (3) A few times, …
## $ p29_c                  <fct> (2) Just once, NA, (3) A few times, NA, (3) A f…
## $ p30                    <fct> (3) A few times, NA, (3) A few times, NA, (3) A…
## $ p31                    <fct> (2) No, NA, (1) Yes, NA, (1) Yes, (2) No, (2) N…
## $ p32                    <fct> (3) Stayed about the same, NA, (2) Gotten somew…
## $ p33                    <fct> (4) 4, NA, (5) 5 - It makes a big difference wh…
## $ p34                    <fct> (4) 4, NA, (5) 5 - Who people vote for can make…
## $ p35_a                  <fct> (3) About the same as now, NA, (3) About the sa…
## $ p35_b                  <fct> (3) About the same as now, NA, (2) Somewhat mor…
## $ p35_c                  <fct> (3) About the same as now, NA, (2) Somewhat mor…
## $ p36                    <fct> (8) 8, NA, (4) 4, NA, (8) 8, (12) Haven't heard…
## $ p37                    <fct> (7) 7, NA, (7) 7, NA, (9) 9, (12) Haven't heard…
## $ p38                    <fct> (7) 7, NA, (2) 2, NA, (3) 3, (12) Haven't heard…
## $ p39                    <fct> (4) 4, NA, (4) 4, NA, (4) 4, (12) Haven't heard…
## $ p40                    <fct> (3) 3, NA, (2) 2, NA, (4) 4, (12) Haven't heard…
## $ p41                    <fct> (7) 7, NA, (10) 10 - Right, NA, (10) 10 - Right…
## $ p42                    <fct> (7) 7, NA, (4) 4, NA, (0) 0 - Left, (12) Haven'…
## $ p43                    <fct> (3) About the same as now, NA, (3) About the sa…
## $ p44                    <fct> (2) Somewhat more, NA, (1) Much more, NA, (1) M…
## $ p45                    <fct> (2) No, NA, (2) No, NA, (1) Yes, (2) No, (2) No…
## $ p46                    <fct> (2) No, NA, (2) No, NA, NA, (1) Yes, (1) Yes, (…
## $ p47                    <fct> NA, NA, NA, NA, (3) NDP, (4) Bloc Québécois, …
## $ p47_7_                 <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ p48                    <fct> NA, NA, NA, NA, (1) Very close, (1) Very close,…
## $ p49                    <fct> (6) June, NA, (7) July, NA, (12) December, (7) …
## $ p50                    <fct> (1) Married, NA, (6) Never married, NA, (6) Nev…
## $ p51                    <fct> (2) No, NA, (2) No, NA, (1) Yes, (2) No, (1) Ye…
## $ p52                    <chr> "Gestionnaire dans le domaine funéraire", "", …
## $ p53                    <fct> (3) Mixed, NA, NA, NA, (1) Public sector, NA, N…
## $ p54                    <fct> (7) More than once a week, NA, (1) Never, NA, (…
## $ p55                    <fct> (2) No, NA, (2) No, NA, (2) No, (2) No, (2) No,…
## $ p56_1                  <fct> (0) Not Selected, NA, (0) Not Selected, NA, (0)…
## $ p56_2                  <fct> (1) Selected, NA, (1) Selected, NA, (1) Selecte…
## $ p56_3                  <fct> (0) Not Selected, NA, (0) Not Selected, NA, (0)…
## $ p56_4                  <fct> (0) Not Selected, NA, (0) Not Selected, NA, (0)…
## $ p56_5                  <fct> (0) Not Selected, NA, (0) Not Selected, NA, (0)…
## $ p56_6                  <fct> (0) Not Selected, NA, (0) Not Selected, NA, (0)…
## $ p56_7                  <fct> (0) Not Selected, NA, (0) Not Selected, NA, (0)…
## $ p56_8                  <fct> (0) Not Selected, NA, (0) Not Selected, NA, (0)…
## $ p56_9                  <fct> (0) Not Selected, NA, (0) Not Selected, NA, (0)…
## $ p56_9_                 <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
## $ p57                    <fct> (5) A large town or city (more than 50K people)…
## $ p6r                    <int> 2, NA, 6, NA, 0, 3, 2, 2, NA, 4, 6, NA, NA, 0, …
## $ p7r                    <int> 9, NA, 7, NA, 1, 4, 5, 5, NA, 8, 5, NA, NA, 6, …
## $ p8r                    <int> 2, NA, 8, NA, 9, 0, 8, 9, NA, 4, 4, NA, NA, 7, …
## $ p9r                    <int> 6, NA, 7, NA, 8, 1, 7, 5, NA, 0, 0, NA, NA, 7, …
## $ p10r                   <int> 7, NA, 8, NA, 7, 10, 5, 7, NA, 9, 2, NA, NA, 0,…
## $ p11r                   <int> 1, NA, 4, NA, 0, 0, -9, 0, NA, 0, 8, NA, NA, 0,…
## $ p12r                   <int> 2, NA, 5, NA, 0, 4, 2, 0, NA, 5, 4, NA, NA, 0, …
## $ p13r                   <int> 9, NA, 6, NA, 0, 5, 4, 6, NA, 7, 4, NA, NA, 5, …
## $ p14r                   <int> 7, NA, 7, NA, 9, 5, 7, 9, NA, 7, 5, NA, NA, 8, …
## $ p15r                   <int> 6, NA, 5, NA, 8, 0, 9, 3, NA, 0, 1, NA, NA, 7, …
## $ p16r                   <int> 7, NA, 8, NA, 10, 8, -9, 9, NA, 9, 2, NA, NA, 0…
## $ p17r                   <int> 8, NA, 4, NA, 0, 0, -9, 0, NA, 0, 8, NA, NA, 0,…
## $ p36r                   <int> 8, NA, 4, NA, 8, -5, 7, 5, NA, 5, 5, NA, NA, 4,…
## $ p37r                   <int> 7, NA, 7, NA, 9, -5, 9, 7, NA, 7, 6, NA, NA, 8,…
## $ p38r                   <int> 7, NA, 2, NA, 3, -5, 4, 3, NA, 3, 4, NA, NA, 4,…
## $ p39r                   <int> 4, NA, 4, NA, 4, -5, -9, 4, NA, 8, 2, NA, NA, 4…
## $ p40r                   <int> 3, NA, 2, NA, 4, -5, 3, 5, NA, 7, 5, NA, NA, 2,…
## $ p41r                   <int> 7, NA, 10, NA, 10, -5, 5, 9, NA, -9, 8, NA, NA,…
## $ p42r                   <int> 7, NA, 4, NA, 0, -5, 4, 1, NA, 7, 5, NA, NA, 3,…

QUESTION: How many individuals are there in the dataset? How many variables? What are the 4 column types present in the data (they are between “<>” in the output of the glimpse() function?

A: 4021, 273, (int,chr,fct,num)

Univariate analysis

Let’s look at the distribution of age.

ggplot(df,aes(x=age)) +
  geom_histogram()

Let’s calculate the number of values for which age is not missing, the mean and the median.

sample_size_age <- sum(!is.na(df$age))
my_mean <- mean(df$age,na.rm=TRUE)
my_median <- median(df$age,na.rm=TRUE)

Let’s redo our histogram, but adding a vertical line where the median is. We can add a caption to programmatically indicate the sample size.

ggplot(df,aes(x=age)) +
  geom_histogram(binwidth=1,fill="white",color="black") +
  theme_classic() +
  labs(x="Age",y="Count (in survey)",title="Age distribution in Canada",
       caption=paste0("Data from CES 2019; n = ",format(sample_size_age,big.mark   = ","))) +
  geom_vline(aes(xintercept=my_mean),linetype=2) +
  annotate("text", x = my_mean-2, y = 90, label = "mean",angle = 90)

QUESTION: In your own words, how would you describe the distribution of age?

A: Slight right skew. Truncated to the left at 18. Not normal; no real peak in the middle. Bimodal?

In the data, there’s a variable called age_range. Let’s look at it with the table function. The useNa=‘always’ argument is include NAs.

table(df$age_range,useNA = "always")
## 
##     (-9) Don't know        (-8) Refused        (-7) Skipped (1) 18-24 years old 
##                   0                   0                   0                 256 
## (2) 25-34 years old (3) 35-44 years old (4) 45-54 years old   (5) 55+ years old 
##                 561                 694                 728                1782 
##                <NA> 
##                   0

QUESTION: How many NAs are there? How many Don’t know’s, Refused, Skipped? Why do you think this is the case?

A: Survey weights were created. Age is a weighting variable, so those respondents were removed. Survey weights (covered next week) are inversely proportional to the probability of inclusion. If a demographic is underrepresented, survey weight is heigher.

Now, plot age_range.

ggplot(df,aes(x=age_range)) +
  geom_bar()

Now, imagine that these are not the groups you want. Rather, you want 18-34, 35-54, 55+. Recode the groups using the following code. You are using the cut() function.

df <- df %>%
  mutate(age_group=cut(age,breaks=c(-Inf,17,34,54,Inf),
                       label=c("0-17","18-34","35-54","55+")),
         age_group=droplevels(age_group))

Plot this using a bar graph.

ggplot(df,aes(x=age_group)) +
  geom_bar()

You can add labels with the count number like this.

ggplot(df,aes(x=age_group)) +
  geom_bar() + labs(x="",y="") +
  geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "white")

Let’s drop the empty levels from the age_range factor. You can use the recode() function if you want to recode (e.g. clean) them.

df <- df %>%
  mutate(age_range = droplevels(age_range))
df <- df %>% 
  mutate(age_range=recode(age_range,
                          "(1) 18-24 years old"="18-24",
                          "(2) 25-34 years old"="25-34",
                          "(3) 35-44 years old"="35-44",
                          "(4) 45-54 years old"="45-54",
                          "(5) 55+ years old"="55+"))

That can be plotted too.

ggplot(df,aes(x=age_range)) +
  geom_bar() +
  labs(x="",y="")

Lastly, instead of visualizing age with a graph, let’s use a table to get all the summary statistics. Use kable() to output these numbers.

age_summary <- df %>%
  summarize(mean_age = mean(age, na.rm = TRUE), 
            sd_age = sd(age, na.rm = TRUE), 
            min_age = min(age, na.rm = TRUE), 
            max_age = max(age, na.rm = TRUE), 
            median_age = median(age, na.rm = TRUE), 
            skew_age = skewness(age, na.rm = TRUE), 
            kurtosis_age = kurtosis(age, na.rm = TRUE), 
            n_age =  sum(!is.na(age)))

age_summary %>%
  kable(format = "simple") 
mean_age sd_age min_age max_age median_age skew_age kurtosis_age n_age
50.89033 16.83581 18 100 51 -0.0053535 -0.871748 4021
age_summary %>%
  t() %>%
  kable(format = "simple") 
mean_age 50.8903258
sd_age 16.8358082
min_age 18.0000000
max_age 100.0000000
median_age 51.0000000
skew_age -0.0053535
kurtosis_age -0.8717480
n_age 4021.0000000

QUESTION: What’s the mean/sd/min/max/median/skewnewss/kurtosis? Interpret the skewness and kurtosis?

A: See table. Skewness is pretty much zero. No real skew. Kurtosis < 0. Distribution is flatter.

QUESTION: t() function stands for transpose. What does t() do in practice?

A: Transposes from wide to long. Puts in one column instead of one row.

Now, let’s look at the variable household income.

summary(df$q69)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      -9       0   60000   80331  120000 2120000

Looking at the codebook, we see that -8 and -9 should be coded as NA.

df <- df %>%
  mutate(hincome=ifelse(q69 %in% c(-8,-9), NA, q69))
ggplot(df,aes(x=hincome)) +
  geom_histogram(binwidth = 5000)

It’s sort of annoying that the data is all clustered to the left because of those “1%” rich outliers. Let’s zoom in. Let’s also pick bins of 1000.

ggplot(df,aes(x=hincome)) +
  geom_histogram(binwidth = 1000) +
  coord_cartesian(xlim=c(0,500000)) +
  scale_x_continuous(breaks=seq(0,500000,by=100000),labels=c("0","100k","200k","300k","400k","500k"))

QUESTION: What do you notice? What’s the shape of the distribution? Why the weird spikes?

A: Not normal at all. Truncated at zero. Note that I’ve seen in the past a category “negative income”. Not here. Also what we call “zero-inflated”. The spikes mean that people declare “round” “approximate” values instead of the true value.There clearly are outliers.

QUESTION: By copy and pasting the code above for the summary table using kable() try to replicate the statistical analysis but on household income instead of age. Comment very briefly.

hincome_summary <- df %>%
  summarize(mean_age = mean(hincome, na.rm = TRUE), 
            sd_age = sd(hincome, na.rm = TRUE), 
            min_age = min(hincome, na.rm = TRUE), 
            max_age = max(hincome, na.rm = TRUE), 
            median_age = median(hincome, na.rm = TRUE), 
            skew_age = skewness(hincome, na.rm = TRUE), 
            kurtosis_age = kurtosis(hincome, na.rm = TRUE), 
            n_age =  sum(!is.na(hincome)))

hincome_summary %>%
  kable(format = "simple") 
mean_age sd_age min_age max_age median_age skew_age kurtosis_age n_age
105046.5 116844.9 0 2120000 80000 7.833776 109.1022 3075
hincome_summary %>%
  t() %>%
  kable(format = "simple") 
mean_age 1.050465e+05
sd_age 1.168449e+05
min_age 0.000000e+00
max_age 2.120000e+06
median_age 8.000000e+04
skew_age 7.833776e+00
kurtosis_age 1.091022e+02
n_age 3.075000e+03

QUESTION: Now on to another variable. Check q11 and q12 in the codebook. What are those variables?

Vote intention / leaning.

Take a look at these three frequency tables.

table(df$q11,useNA = "always")
## 
##                                 (-9) Don't know / Undecided 
##                                                         923 
##                                                (-8) Refused 
##                                                         235 
##                                                (-7) Skipped 
##                                                           0 
##                                         (1) Liberal (Grits) 
##                                                         909 
## (2) Conservatives (Tory, PCs, Conservative Party of Canada) 
##                                                         980 
##       (3) NDP (New Democratic Party, New Democrats, NDPers) 
##                                                         405 
##      (4) Bloc Québécois (BQ, PQ, Bloc, Parti Québécois) 
##                                                          98 
##                                    (5) Green Party (Greens) 
##                                                         287 
##                                          (6) People's Party 
##                                                          49 
##                                                   (7) Other 
##                                                          24 
##                                           (8) Will not vote 
##                                                           1 
##                                           (9) None of these 
##                                                          17 
##                                      (10) Will spoil ballet 
##                                                           9 
##                                                        <NA> 
##                                                          84
table(df$q12,useNA = "always")
## 
##                                             (-9) Don't know 
##                                                         424 
##                                                (-8) Refused 
##                                                           0 
##                                                (-7) Skipped 
##                                                           0 
##                                         (1) Liberal (Grits) 
##                                                         156 
## (2) Conservatives (Tory, PCs, Conservative Party of Canada) 
##                                                          97 
##       (3) NDP (New Democratic Party, New Democrats, NDPers) 
##                                                          81 
##      (4) Bloc Québécois (BQ, PQ, Bloc, Parti Québécois) 
##                                                          17 
##                                    (5) Green Party (Greens) 
##                                                          70 
##                                          (6) People's Party 
##                                                          10 
##                                                   (7) Other 
##                                                          18 
##                                           (8) Will not vote 
##                                                           1 
##                                           (9) None of these 
##                                                          49 
##                                      (10) Will spoil ballet 
##                                                           0 
##                                                        <NA> 
##                                                        3098
table(df$q12,df$q11,useNA = "always")
##                                                              
##                                                               (-9) Don't know / Undecided
##   (-9) Don't know                                                                     424
##   (-8) Refused                                                                          0
##   (-7) Skipped                                                                          0
##   (1) Liberal (Grits)                                                                 156
##   (2) Conservatives (Tory, PCs, Conservative Party of Canada)                          97
##   (3) NDP (New Democratic Party, New Democrats, NDPers)                                81
##   (4) Bloc Québécois (BQ, PQ, Bloc, Parti Québécois)                               17
##   (5) Green Party (Greens)                                                             70
##   (6) People's Party                                                                   10
##   (7) Other                                                                            18
##   (8) Will not vote                                                                     1
##   (9) None of these                                                                    49
##   (10) Will spoil ballet                                                                0
##   <NA>                                                                                  0
##                                                              
##                                                               (-8) Refused
##   (-9) Don't know                                                        0
##   (-8) Refused                                                           0
##   (-7) Skipped                                                           0
##   (1) Liberal (Grits)                                                    0
##   (2) Conservatives (Tory, PCs, Conservative Party of Canada)            0
##   (3) NDP (New Democratic Party, New Democrats, NDPers)                  0
##   (4) Bloc Québécois (BQ, PQ, Bloc, Parti Québécois)                 0
##   (5) Green Party (Greens)                                               0
##   (6) People's Party                                                     0
##   (7) Other                                                              0
##   (8) Will not vote                                                      0
##   (9) None of these                                                      0
##   (10) Will spoil ballet                                                 0
##   <NA>                                                                 235
##                                                              
##                                                               (-7) Skipped
##   (-9) Don't know                                                        0
##   (-8) Refused                                                           0
##   (-7) Skipped                                                           0
##   (1) Liberal (Grits)                                                    0
##   (2) Conservatives (Tory, PCs, Conservative Party of Canada)            0
##   (3) NDP (New Democratic Party, New Democrats, NDPers)                  0
##   (4) Bloc Québécois (BQ, PQ, Bloc, Parti Québécois)                 0
##   (5) Green Party (Greens)                                               0
##   (6) People's Party                                                     0
##   (7) Other                                                              0
##   (8) Will not vote                                                      0
##   (9) None of these                                                      0
##   (10) Will spoil ballet                                                 0
##   <NA>                                                                   0
##                                                              
##                                                               (1) Liberal (Grits)
##   (-9) Don't know                                                               0
##   (-8) Refused                                                                  0
##   (-7) Skipped                                                                  0
##   (1) Liberal (Grits)                                                           0
##   (2) Conservatives (Tory, PCs, Conservative Party of Canada)                   0
##   (3) NDP (New Democratic Party, New Democrats, NDPers)                         0
##   (4) Bloc Québécois (BQ, PQ, Bloc, Parti Québécois)                        0
##   (5) Green Party (Greens)                                                      0
##   (6) People's Party                                                            0
##   (7) Other                                                                     0
##   (8) Will not vote                                                             0
##   (9) None of these                                                             0
##   (10) Will spoil ballet                                                        0
##   <NA>                                                                        909
##                                                              
##                                                               (2) Conservatives (Tory, PCs, Conservative Party of Canada)
##   (-9) Don't know                                                                                                       0
##   (-8) Refused                                                                                                          0
##   (-7) Skipped                                                                                                          0
##   (1) Liberal (Grits)                                                                                                   0
##   (2) Conservatives (Tory, PCs, Conservative Party of Canada)                                                           0
##   (3) NDP (New Democratic Party, New Democrats, NDPers)                                                                 0
##   (4) Bloc Québécois (BQ, PQ, Bloc, Parti Québécois)                                                                0
##   (5) Green Party (Greens)                                                                                              0
##   (6) People's Party                                                                                                    0
##   (7) Other                                                                                                             0
##   (8) Will not vote                                                                                                     0
##   (9) None of these                                                                                                     0
##   (10) Will spoil ballet                                                                                                0
##   <NA>                                                                                                                980
##                                                              
##                                                               (3) NDP (New Democratic Party, New Democrats, NDPers)
##   (-9) Don't know                                                                                                 0
##   (-8) Refused                                                                                                    0
##   (-7) Skipped                                                                                                    0
##   (1) Liberal (Grits)                                                                                             0
##   (2) Conservatives (Tory, PCs, Conservative Party of Canada)                                                     0
##   (3) NDP (New Democratic Party, New Democrats, NDPers)                                                           0
##   (4) Bloc Québécois (BQ, PQ, Bloc, Parti Québécois)                                                          0
##   (5) Green Party (Greens)                                                                                        0
##   (6) People's Party                                                                                              0
##   (7) Other                                                                                                       0
##   (8) Will not vote                                                                                               0
##   (9) None of these                                                                                               0
##   (10) Will spoil ballet                                                                                          0
##   <NA>                                                                                                          405
##                                                              
##                                                               (4) Bloc Québécois (BQ, PQ, Bloc, Parti Québécois)
##   (-9) Don't know                                                                                                  0
##   (-8) Refused                                                                                                     0
##   (-7) Skipped                                                                                                     0
##   (1) Liberal (Grits)                                                                                              0
##   (2) Conservatives (Tory, PCs, Conservative Party of Canada)                                                      0
##   (3) NDP (New Democratic Party, New Democrats, NDPers)                                                            0
##   (4) Bloc Québécois (BQ, PQ, Bloc, Parti Québécois)                                                           0
##   (5) Green Party (Greens)                                                                                         0
##   (6) People's Party                                                                                               0
##   (7) Other                                                                                                        0
##   (8) Will not vote                                                                                                0
##   (9) None of these                                                                                                0
##   (10) Will spoil ballet                                                                                           0
##   <NA>                                                                                                            98
##                                                              
##                                                               (5) Green Party (Greens)
##   (-9) Don't know                                                                    0
##   (-8) Refused                                                                       0
##   (-7) Skipped                                                                       0
##   (1) Liberal (Grits)                                                                0
##   (2) Conservatives (Tory, PCs, Conservative Party of Canada)                        0
##   (3) NDP (New Democratic Party, New Democrats, NDPers)                              0
##   (4) Bloc Québécois (BQ, PQ, Bloc, Parti Québécois)                             0
##   (5) Green Party (Greens)                                                           0
##   (6) People's Party                                                                 0
##   (7) Other                                                                          0
##   (8) Will not vote                                                                  0
##   (9) None of these                                                                  0
##   (10) Will spoil ballet                                                             0
##   <NA>                                                                             287
##                                                              
##                                                               (6) People's Party
##   (-9) Don't know                                                              0
##   (-8) Refused                                                                 0
##   (-7) Skipped                                                                 0
##   (1) Liberal (Grits)                                                          0
##   (2) Conservatives (Tory, PCs, Conservative Party of Canada)                  0
##   (3) NDP (New Democratic Party, New Democrats, NDPers)                        0
##   (4) Bloc Québécois (BQ, PQ, Bloc, Parti Québécois)                       0
##   (5) Green Party (Greens)                                                     0
##   (6) People's Party                                                           0
##   (7) Other                                                                    0
##   (8) Will not vote                                                            0
##   (9) None of these                                                            0
##   (10) Will spoil ballet                                                       0
##   <NA>                                                                        49
##                                                              
##                                                               (7) Other
##   (-9) Don't know                                                     0
##   (-8) Refused                                                        0
##   (-7) Skipped                                                        0
##   (1) Liberal (Grits)                                                 0
##   (2) Conservatives (Tory, PCs, Conservative Party of Canada)         0
##   (3) NDP (New Democratic Party, New Democrats, NDPers)               0
##   (4) Bloc Québécois (BQ, PQ, Bloc, Parti Québécois)              0
##   (5) Green Party (Greens)                                            0
##   (6) People's Party                                                  0
##   (7) Other                                                           0
##   (8) Will not vote                                                   0
##   (9) None of these                                                   0
##   (10) Will spoil ballet                                              0
##   <NA>                                                               24
##                                                              
##                                                               (8) Will not vote
##   (-9) Don't know                                                             0
##   (-8) Refused                                                                0
##   (-7) Skipped                                                                0
##   (1) Liberal (Grits)                                                         0
##   (2) Conservatives (Tory, PCs, Conservative Party of Canada)                 0
##   (3) NDP (New Democratic Party, New Democrats, NDPers)                       0
##   (4) Bloc Québécois (BQ, PQ, Bloc, Parti Québécois)                      0
##   (5) Green Party (Greens)                                                    0
##   (6) People's Party                                                          0
##   (7) Other                                                                   0
##   (8) Will not vote                                                           0
##   (9) None of these                                                           0
##   (10) Will spoil ballet                                                      0
##   <NA>                                                                        1
##                                                              
##                                                               (9) None of these
##   (-9) Don't know                                                             0
##   (-8) Refused                                                                0
##   (-7) Skipped                                                                0
##   (1) Liberal (Grits)                                                         0
##   (2) Conservatives (Tory, PCs, Conservative Party of Canada)                 0
##   (3) NDP (New Democratic Party, New Democrats, NDPers)                       0
##   (4) Bloc Québécois (BQ, PQ, Bloc, Parti Québécois)                      0
##   (5) Green Party (Greens)                                                    0
##   (6) People's Party                                                          0
##   (7) Other                                                                   0
##   (8) Will not vote                                                           0
##   (9) None of these                                                           0
##   (10) Will spoil ballet                                                      0
##   <NA>                                                                       17
##                                                              
##                                                               (10) Will spoil ballet
##   (-9) Don't know                                                                  0
##   (-8) Refused                                                                     0
##   (-7) Skipped                                                                     0
##   (1) Liberal (Grits)                                                              0
##   (2) Conservatives (Tory, PCs, Conservative Party of Canada)                      0
##   (3) NDP (New Democratic Party, New Democrats, NDPers)                            0
##   (4) Bloc Québécois (BQ, PQ, Bloc, Parti Québécois)                           0
##   (5) Green Party (Greens)                                                         0
##   (6) People's Party                                                               0
##   (7) Other                                                                        0
##   (8) Will not vote                                                                0
##   (9) None of these                                                                0
##   (10) Will spoil ballet                                                           0
##   <NA>                                                                             9
##                                                              
##                                                               <NA>
##   (-9) Don't know                                                0
##   (-8) Refused                                                   0
##   (-7) Skipped                                                   0
##   (1) Liberal (Grits)                                            0
##   (2) Conservatives (Tory, PCs, Conservative Party of Canada)    0
##   (3) NDP (New Democratic Party, New Democrats, NDPers)          0
##   (4) Bloc Québécois (BQ, PQ, Bloc, Parti Québécois)         0
##   (5) Green Party (Greens)                                       0
##   (6) People's Party                                             0
##   (7) Other                                                      0
##   (8) Will not vote                                              0
##   (9) None of these                                              0
##   (10) Will spoil ballet                                         0
##   <NA>                                                          84

So, given this, if we want to merge these two variables, we should start with q12, and if q12 is NA, then put q11. This can be done using the coalesce() function. Think of two columns in excel. Column A and column B. coalesce(a,b) creates a new variable equal to a if a is not missing and equal to b if b is missing.

Specifically, we are coalescing q12 and q11 so that when q12 is missing q11 is used. Equivalently, we could check if q11 = “(-9) Don’t know / Undecided” and put values from q12 in using the ifelse() function.

In other words, for those who said “(-9) Don’t know / Undecided”, we want to plug in their answer to the follow asking “Is there a party you are leaning to?”

df <- df %>%
  mutate(vote_coalesced = coalesce(q12,q11))

Let’s clean this variable. You can use what we call a named vector with the recode function.

Recode is neat. If for instance you wanted to remove the Refused and Skipped, you could replace the “OTH” there by NA.

Note that in recode, the syntax is “old value”=“new value”. This is different than usually. Usually (e.g. using mutate) syntax is “new value”=“old value”.

When you use a function for the first time, make sure you read the help file or copy paste code using that function from a trusted resource (e.g. a stackoverflow answer with a lot of up votes).

my_cleaning_mapping <- c("(-9) Don't know"="DK",
  "(-8) Refused"="OTH",
  "(-7) Skipped"="OTH",
  "(1) Liberal (Grits)"="LIB", 
  "(2) Conservatives (Tory, PCs, Conservative Party of Canada)"="CONS", 
  "(3) NDP (New Democratic Party, New Democrats, NDPers)"="NDP",
  "(4) Bloc Québécois (BQ, PQ, Bloc, Parti Québécois)"="BQ", 
  "(5) Green Party (Greens)"="GRN",
  "(6) People's Party"="PPC",
  "(7) Other"="OTH", 
  "(8) Will not vote"="OTH",
  "(9) None of these"="OTH",
  "(10) Will spoil ballet"="OTH", 
  "(-9) Don't know / Undecided"="DK")

df <- df %>%
  mutate(vote_clean=recode(vote_coalesced,!!!my_cleaning_mapping))

When you are going to plot this, it’s nice if OTH and DK are at the end. You can set the factor levels like this.

df <- df %>%
  mutate(vote_clean = factor(vote_clean,levels=c("LIB", "CONS", "NDP", "BQ", "GRN", "PPC","OTH","DK")))

Plot this.

party_colors <- c("LIB"="red",
  "CONS"="darkblue",
  "NDP"="orange",
  "BQ"="lightblue",
  "GRN"="darkgreen",
  "PPC"="purple",
  "DK"="darkgrey",
  "OTH"="grey")

ggplot(df,aes(x=vote_clean,fill=vote_clean)) +
  geom_bar() + labs(x="",y="") +
  scale_fill_manual(values=party_colors) +
  theme(legend.position = "none")

QUESTION Given what you know about the 2019 election, does this distribution appear reasonable?

Somewhat, yes. Greens very high, though. Whether it’s “reasonable” is debatable. But it’s not way off like if you polled people in e.g. a sociology class.

Up to now, we’ve looked at two continuous variable (age and household income) and one nominal categorical variable (vote intention).

Let’s look at another categorical variable, this time an ordered categorical variable.

Q46 asks: Do you strongly agree, somewhat agree, somewhat disagree, or strongly disagree with the following statement: “Justin Trudeau kept the election promises he made in 2015.”

ggplot(df,aes(x=q46)) +
  geom_bar()

Let’s clean it like this:

my_cleaning_mapping <- c(
  "(-9) Don't know"="Don't know",
  "(-8) Refused"=NA,
  "(-7) Skipped"=NA,
  "(1) Strongly agree"="Strongly agree", 
  "(2) Somewhat agree"="Somewhat agree",
  "(3) Somewhat disagree"="Somewhat disagree",
  "(4) Strongly disagree"="Strongly disagree"
)

df <- df %>%
  mutate(q46=recode(q46,!!!my_cleaning_mapping),
         q46=factor(q46,levels=c("Strongly disagree","Somewhat disagree",
                                 "Somewhat agree","Strongly agree","Don't know")),
         q46=droplevels(q46))

Let’s plot it again.

ggplot(df,aes(x=q46)) +
  geom_bar()

Note that in the social science, an ordinal categorical variable is often be converted to numeric. Here, you would have Strongly disagree becomes 1, Somewhat disagree becomes 2, Somewhat agree becomes 3. Strongly agree becomes 4. Note, “lowest / disagree” always come first: we want the numeric scale from “less/disagree” to “more/agree. You also have to decide what to do with the DK. You could drop them (put NAs or code them as 2.5 (middle of scale). We are not going to use this further in this project, but this is how you would do it.

# This is very bad. Don't know are coded as 5. 
# Recall numbers follow the order of factors. Check with  levels(df$q46)
df <- df %>% 
  mutate(q46_numeric=as.numeric(q46))
# This is better, you could also have DK=2.5
df <- df %>% 
  mutate(q46_numeric=recode(q46,!!!c("Strongly disagree"=1,
                                     "Somewhat disagree"=2,
                                     "Somewhat agree"=3,
                                     "Strongly agree"=4,
                                     "Don't know"=NA)))

Up to now, we did univariate analysis on age, household income, vote intention and q46 (a question asking if Trudeau held his promises).

Bivariate analysis

Now, let’s do bivariate analysis. When doing bivariate analysis, you look at the distribution of two variables against each other. It’s important to understand that the bivariate analysis will be different depending on variable type. This is discussed in the book. You can have:

1 - two continuous variables

2 - one continuous variable and one categorical variable

3 - two categorical variables

Let’s explore the three possibilities in turn.

1 - two continuous variables

This is, in a way, the simplest case. We’ve all seen scatterplots. In the book, they look at carat and price.

Let’s plot age on x and household income on y.

ggplot(df,aes(x=age,y=hincome)) +
  geom_point()

All the points are clustered on top of each other and we don’t see much. You can use “alpha” to play with the opacity. Where many points overlap the color will be darker. Where not many points overlap the color will be lighter. This way, you see where the density is.

ggplot(df,aes(x=age,y=hincome)) +
  geom_point(alpha=0.1)

In the social sciences, we often ask questions of the type: “who’s richer, the young or the old”. One (extremely naive) way to ask this question is to fit a linear model where income = b0 + b1 * age + error. You’ll see details of this in the sequel to this class. In other words, you ask: when age increases, does household income increase or decrease (or stays the same). You can fit that straight line using a linear model “lm” smooth.

ggplot(df,aes(x=age,y=hincome)) +
  geom_point(alpha=0.1) + geom_smooth(method="lm", se=FALSE)

This suggests that income decreases with age. But, you can notice that this fit is terrible. Looking at the data, it actually seems that income first increases and then decreases with age.

Fitting a straight line is what we call a linear model of degree 1.

We can fit, instead, a local regression. The idea is that instead of having one straightline, you have many local straightlines so the line “moves”. This captures non-linearities.

ggplot(df,aes(x=age,y=hincome)) +
  geom_point(alpha=0.1) + geom_smooth(se=FALSE)

This seems more reasonable, though this statistical model still seems to be poor in the sense that you probably do not explain a lot of the variation in household income using only one variable (age). For now, that’s enough for two continuous variables. More on this next semester.

2 - one continuous variable and one categorical variable

One good way to visualize the relation between one continuous variable and one categorical variable is the boxplot.

ggplot(data = df, mapping = aes(x = vote_clean, y = hincome)) +
  geom_boxplot()

Above, we see the outliers really well, but the bottom is a bit clustered and hard to read. Let’s try to transform the y axis using a square root transformation.

ggplot(data = df, mapping = aes(x = vote_clean, y = hincome)) +
  geom_boxplot() +
  scale_y_sqrt(breaks=c(0,25000,50000,100000,500000,1000000,2000000))

Better, we still hard to read. Let’s just cut off the rich above 250,000 so we see clearly the distribution of the normal folks.

ggplot(data = df, mapping = aes(x = vote_clean, y = hincome)) +
  geom_boxplot() + coord_cartesian(ylim = c(0,250000))

QUESTION How would you interpret these findings from those three boxplots?

A: Liberals and Conservative a bit richer. No major difference, in average.

3 - two categorical variables

In the book, they use geom_count().

ggplot(data = df) +
  geom_count(mapping = aes(x = vote_clean, y = q46))

That sort of is interesting in the sense that we that many liberals think Trudeau held is promises and many conservatives think he did not.

In the social sciences, what we usually want to know is: “how many % of liberals think he did”, “how many % of conservatives think he did” etc. The simplest way to do this is as follow:

df %>%
  group_by(vote_clean,q46) %>%
  count() %>%
  group_by(vote_clean) %>%
  mutate(percent=100*n/sum(n)) %>%
  ggplot(aes(x=q46,y=percent,fill=vote_clean)) + 
  geom_bar(position="dodge",stat="identity") +
  geom_text(aes(label=round(percent)),position=position_dodge(width=0.9),vjust=-0.1)

QUESTION If you focus on the NDP, the Conservatives and the Liberal only (ignore the other parties/choices), for each party, how many % of respondents agree Trudeau his promises (agree = strongly agree + somewhat agree)?

A: 38, 15, 77

Section 2 - Your turn

Pick two variables of your choice. The variables can be in this dataset, or they can be in another dataset of your choice. The variables can be continuous or categorical.

Describe in a few words the variables you picked.

Make three figures. The first two figures are univariate analyses of your two variables (bar chart if categorical, histogram if continuous). The third figure should depict the two variables together (bivariate analysis); you should be able to adapt the code from above. Very succinctly, comment each figure.