class: center, middle, inverse, title-slide # Week 4: Importing data and tidy data ## PUBPOL 750 Data Analysis for Public Policy I ### Justin Savoie ### MPP-DS McMaster ### 2023-10-03 --- class: inverse, center, middle # Homework 1 Solution code for HW1 posted on the website tomorrow. We will discuss it next week. --- class: inverse, center, middle # Paths --- class: inverse, center, middle # Code style This is chapter 5 in the book. Optional reading. --- # Names - Use lowercase separated with `_`. - Use names that are easy to understand. - Be consistent. For example don't have: `dataset_ON`, `dataset_QC` and `datas_BC` if these three objects are three datasets for the three provinces. --- # Spaces - Put spaces on either side of mathematical operators apart from `^` (i.e. `+`, `-`, `==`, `<`, ...), and around the assignment operator (`<-`). ```r # Strive for z <- (a + b)^2 / d # Avoid z<-( a + b ) ^ 2/d ``` - Don’t put spaces inside or outside parentheses for regular function calls. Always put a space after a comma. ```r # Strive for mean(x, na.rm = TRUE) # Avoid mean (x ,na.rm=TRUE) ``` --- # Spaces - It’s OK to add extra spaces if it improves alignment. ```r flights |> mutate( speed = distance / air_time, dep_hour = dep_time %/% 100, dep_minute = dep_time %% 100 ) ``` --- # Style One advice: imitate the code "style" that you see in reputable sources for example in modern R texbooks. - https://r4ds.hadley.nz - https://moderndive.com - https://tellingstorieswithdata.com/ Try to be consistent and coherent. --- # The pipe (`|>`) - We use the pipe `|>` to emphasise a sequence of actions - The pipe means "THEN" - `|>` should always have a space before it and should typically ```r # Strive for flights |> filter(!is.na(arr_delay), !is.na(tailnum)) |> count(dest) # Avoid flights|>filter(!is.na(arr_delay),!is.na(tailnum))|>count(dest) ``` --- # The `|>` - If the function you’re piping into has named arguments (like mutate() or summarize()), put each argument on a new line. ```r flights |> group_by(tailnum) |> summarize( delay = mean(arr_delay, na.rm = TRUE), n = n() ) ``` --- # The `|>` - Don't write pipes that are too long. The book says 10-15 lines, use judgment. - Break into smaller tasks. --- # ggplot2 - Same rules apply to ggplot2 plots. ```r flights |> group_by(month) |> summarize( delay = mean(arr_delay, na.rm = TRUE) ) |> ggplot(aes(x = month, y = delay)) + geom_point() + geom_line() ``` --- class: inverse, center, middle # Data importing --- # Reading a csv ```r library(tidyverse) Parliamentarians_subset100 <- read_csv("https://www.justinsavoie.com/pubpol750-2022/homework1/Parliamentarians_subset100.csv") Parliamentarians_subset100 <- read_csv("~/Downloads/Parliamentarians_subset100.csv") Parliamentarians_subset100 <- read_csv("/Users/justinsavoie/Downloads/Parliamentarians_subset100.csv") Parliamentarians_subset100 <- read_csv("/Users/justinsavoie/Documents/personal_repos/personal_website/pubpol750-2022/Homework1/Parliamentarians_subset100.csv") Parliamentarians_subset100 <- read_csv("C:\\Users\\justinsavoie\\Downloads/Parliamentarians_subset100.csv") ``` # Reading an excel file ```r library(readxl) Parliamentarians_subset100 <- read_excel("~/Downloads/Parliamentarians.xlsx") ``` --- # Reading spss, sas, stata files ```r library(haven) #read_spss() #read_sas() #read_stata() library(readstata13) #readstata13::read.dta13() ``` --- class: inverse, center, middle # Tidy data --- # Tidy data (A) - Each variable is a column; each column is a variable. - Each observation is a row; each row is an observation. - Each value is a cell; each cell is a single value. (B) - Put each dataset in a tibble. - Put each variable in a column. --- <img src="images/fig6.1.png" width="90%" /> --- <img src="images/CES-screencap.png" width="90%" /> --- # Why tidy data? - **picking one consistent way of storing data** (and tidyverse functions work well on tidy data) - works well in R because of vectorization --- .pull-left[ ```r table1 ``` ``` ## # A tibble: 6 × 4 ## country year cases population ## <chr> <dbl> <dbl> <dbl> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 ``` ```r table2 ``` ``` ## # A tibble: 12 × 4 ## country year type count ## <chr> <dbl> <chr> <dbl> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population 20595360 ## 5 Brazil 1999 cases 37737 ## 6 Brazil 1999 population 172006362 ## 7 Brazil 2000 cases 80488 ## 8 Brazil 2000 population 174504898 ## 9 China 1999 cases 212258 ## 10 China 1999 population 1272915272 ## 11 China 2000 cases 213766 ## 12 China 2000 population 1280428583 ``` ] .pull-right[ ```r table3 ``` ``` ## # A tibble: 6 × 3 ## country year rate ## <chr> <dbl> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362 ## 4 Brazil 2000 80488/174504898 ## 5 China 1999 212258/1272915272 ## 6 China 2000 213766/1280428583 ``` ```r table4a ``` ``` ## # A tibble: 3 × 3 ## country `1999` `2000` ## <chr> <dbl> <dbl> ## 1 Afghanistan 745 2666 ## 2 Brazil 37737 80488 ## 3 China 212258 213766 ``` ```r table4b ``` ``` ## # A tibble: 3 × 3 ## country `1999` `2000` ## <chr> <dbl> <dbl> ## 1 Afghanistan 19987071 20595360 ## 2 Brazil 172006362 174504898 ## 3 China 1272915272 1280428583 ``` ] --- # Pivoting - Pivoting wider - Pivoting longer --- # Pivoting wider ```r table4a ``` ``` ## # A tibble: 3 × 3 ## country `1999` `2000` ## <chr> <dbl> <dbl> ## 1 Afghanistan 745 2666 ## 2 Brazil 37737 80488 ## 3 China 212258 213766 ``` ```r table4a |> pivot_longer( cols=c(`1999`,`2000`), names_to = 'year', values_to = 'cases' ) ``` ``` ## # A tibble: 6 × 3 ## country year cases ## <chr> <chr> <dbl> ## 1 Afghanistan 1999 745 ## 2 Afghanistan 2000 2666 ## 3 Brazil 1999 37737 ## 4 Brazil 2000 80488 ## 5 China 1999 212258 ## 6 China 2000 213766 ``` --- ```r (tidy4a <- table4a |> pivot_longer( c(`1999`, `2000`), names_to = "year", values_to = "cases" )) ``` ``` ## # A tibble: 6 × 3 ## country year cases ## <chr> <chr> <dbl> ## 1 Afghanistan 1999 745 ## 2 Afghanistan 2000 2666 ## 3 Brazil 1999 37737 ## 4 Brazil 2000 80488 ## 5 China 1999 212258 ## 6 China 2000 213766 ``` ```r (tidy4b <- table4b |> pivot_longer( c(`1999`, `2000`), names_to = "year", values_to = "population" )) ``` ``` ## # A tibble: 6 × 3 ## country year population ## <chr> <chr> <dbl> ## 1 Afghanistan 1999 19987071 ## 2 Afghanistan 2000 20595360 ## 3 Brazil 1999 172006362 ## 4 Brazil 2000 174504898 ## 5 China 1999 1272915272 ## 6 China 2000 1280428583 ``` --- # Pivoting wider ```r head(table2,3) ``` ``` ## # A tibble: 3 × 4 ## country year type count ## <chr> <dbl> <chr> <dbl> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ``` ```r table2 |> pivot_wider( names_from = type, values_from = count) ``` ``` ## # A tibble: 6 × 4 ## country year cases population ## <chr> <dbl> <dbl> <dbl> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 ``` --- # Missing values - Explicit vs Implicit ```r (stocks <- tibble( year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016), qtr = c( 1, 2, 3, 4, 2, 3, 4), return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66) )) ``` ``` ## # A tibble: 7 × 3 ## year qtr return ## <dbl> <dbl> <dbl> ## 1 2015 1 1.88 ## 2 2015 2 0.59 ## 3 2015 3 0.35 ## 4 2015 4 NA ## 5 2016 2 0.92 ## 6 2016 3 0.17 ## 7 2016 4 2.66 ``` --- - Force explicit ```r stocks |> pivot_wider( names_from = year, values_from = return) ``` ``` ## # A tibble: 4 × 3 ## qtr `2015` `2016` ## <dbl> <dbl> <dbl> ## 1 1 1.88 NA ## 2 2 0.59 0.92 ## 3 3 0.35 0.17 ## 4 4 NA 2.66 ``` --- ```r stocks |> pivot_wider( names_from = year, values_from = return ) |> pivot_longer( cols=c(`2015`,`2016`), names_to='year', values_to='return' ) ``` ``` ## # A tibble: 8 × 3 ## qtr year return ## <dbl> <chr> <dbl> ## 1 1 2015 1.88 ## 2 1 2016 NA ## 3 2 2015 0.59 ## 4 2 2016 0.92 ## 5 3 2015 0.35 ## 6 3 2016 0.17 ## 7 4 2015 NA ## 8 4 2016 2.66 ``` --- ```r stocks |> pivot_wider( names_from = year, values_from = return ) |> pivot_longer( cols=c(`2015`,`2016`), names_to='year', values_to='return', values_drop_na = TRUE ) ``` ``` ## # A tibble: 6 × 3 ## qtr year return ## <dbl> <chr> <dbl> ## 1 1 2015 1.88 ## 2 2 2015 0.59 ## 3 2 2016 0.92 ## 4 3 2015 0.35 ## 5 3 2016 0.17 ## 6 4 2016 2.66 ``` --- ```r stocks |> pivot_wider( names_from = year, values_from = return ) |> pivot_longer( cols=c(`2015`,`2016`), names_to='year', values_to='return', values_drop_na = TRUE ) ``` ``` ## # A tibble: 6 × 3 ## qtr year return ## <dbl> <chr> <dbl> ## 1 1 2015 1.88 ## 2 2 2015 0.59 ## 3 2 2016 0.92 ## 4 3 2015 0.35 ## 5 3 2016 0.17 ## 6 4 2016 2.66 ``` --- class: inverse, middle, center # Exercices ### 6.2.1