class: center, middle, inverse, title-slide .title[ # Week 8: Transforming data: more tricks ] .subtitle[ ## PUBPOL 750 Data Analysis for Public Policy I ] .author[ ### Justin Savoie ] .institute[ ### MPP-DS McMaster ] .date[ ### 2023-11-08 ] --- class: inverse, center, middle # Administrative note ### Linear regression next week (Week 9). Functions and loops in Week 11. --- # Data wrangling ? - `dplyr` verbs (e.g., select(), filter(), group(), mutate() etc.) perform specific tasks on data frames - Data wrangling can also involve fixing messy strings, dates; joining data from different sources --- # Plan for today - Binding data frames - Mutating joins - Dealing with character strings - Dealing with dates - Worked example --- # Binding (columsn or rows) means appending new lines or new columns ## Binding rows with `bind_rows()` - Bind any number of data frames by row, making a longer result. ```r df1 <- tibble(age=c(25,29,22),name=c("Justin","Ian","Rosaline")) df2 <- tibble(age=c(55,30,12),name=c("Hillary","Anna","Luis")) (df <- bind_rows(df1,df2)) ``` ``` ## # A tibble: 6 × 2 ## age name ## <dbl> <chr> ## 1 25 Justin ## 2 29 Ian ## 3 22 Rosaline ## 4 55 Hillary ## 5 30 Anna ## 6 12 Luis ``` --- ```r df1 <- read_csv("mydata1.csv") df2 <- read_csv("mydata2.csv") df <- bind_rows(df1,df2) ``` - Sometimes, you can use a loop to read in numerous csvs. We'll cover this in week 11. Think of data for 338 ridings. - For `bind_rows()` columns names have to be the same; or if they are not it's simply to create a new column. ```r df1 <- tibble(age1=c(25,29,22),name1=c("Justin","Ian","Rosaline")) df2 <- tibble(age2=c(55,30,12),name2=c("Hillary","Anna","Luis")) (df <- bind_rows(df1,df2)) ``` ``` ## # A tibble: 6 × 4 ## age1 name1 age2 name2 ## <dbl> <chr> <dbl> <chr> ## 1 25 Justin NA <NA> ## 2 29 Ian NA <NA> ## 3 22 Rosaline NA <NA> ## 4 NA <NA> 55 Hillary ## 5 NA <NA> 30 Anna ## 6 NA <NA> 12 Luis ``` - In this case you would have to rename the columns. --- ## Binding columns with `bind_cols()` - Bind any number of data frames by column, making a wider result. ```r df1 <- tibble(age=c(25,29,22),name=c("Justin","Ian","Rosaline")) df2 <- tibble(income=c(55,30,12),name=c("Justin","Ian","Rosaline")) (df <- bind_cols(df1,df2 |> select(-name))) ``` ``` ## # A tibble: 3 × 3 ## age name income ## <dbl> <chr> <dbl> ## 1 25 Justin 55 ## 2 29 Ian 30 ## 3 22 Rosaline 12 ``` --- ```r (df <- bind_cols(df1,df2)) ``` ``` ## New names: ## • `name` -> `name...2` ## • `name` -> `name...4` ``` ``` ## # A tibble: 3 × 4 ## age name...2 income name...4 ## <dbl> <chr> <dbl> <chr> ## 1 25 Justin 55 Justin ## 2 29 Ian 30 Ian ## 3 22 Rosaline 12 Rosaline ``` --- # Joins Instead of "stacking", with a bind, joins "joins" on one or more columns. If you have riding names in two different datasets, you can join them. - `inner_join()`: includes all rows in x and y. - `left_join()`: includes all rows in x. - `right_join()`: includes all rows in y. - `full_join()`: includes all rows in x or y. FYI: the notion of 'joins' are a fundamental aspect of SQL (Structured Query Language, is a standardized programming language that is used to manage and manipulate relational databases; SQL is a dominant language for interacting with databases) and are used extensively. They allow you to combine rows from two or more tables based on a related column between them, enabling the construction of complex queries over relational data. --- ```r df1;df2 ``` ``` ## # A tibble: 3 × 2 ## age name ## <dbl> <chr> ## 1 25 Justin ## 2 29 Ian ## 3 22 Rosaline ``` ``` ## # A tibble: 3 × 2 ## income name ## <dbl> <chr> ## 1 55 Justin ## 2 30 Ian ## 3 12 Rosaline ``` ```r left_join(df1,df2,by="name") ``` ``` ## # A tibble: 3 × 3 ## age name income ## <dbl> <chr> <dbl> ## 1 25 Justin 55 ## 2 29 Ian 30 ## 3 22 Rosaline 12 ``` --- ```r left_join(df1,df2,by="name") ``` ``` ## # A tibble: 3 × 3 ## age name income ## <dbl> <chr> <dbl> ## 1 25 Justin 55 ## 2 29 Ian 30 ## 3 22 Rosaline 12 ``` ```r left_join(df1,df2) ``` ``` ## Joining with `by = join_by(name)` ``` ``` ## # A tibble: 3 × 3 ## age name income ## <dbl> <chr> <dbl> ## 1 25 Justin 55 ## 2 29 Ian 30 ## 3 22 Rosaline 12 ``` --- ```r df1 <- tibble(age=c(25,29,22,23),name=c("Justin","Ian","Rosaline","Jack")) df2 <- tibble(income=c(55,30,12),name=c("Justin","Ian","Rosaline")) left_join(df1,df2,by='name') ``` ``` ## # A tibble: 4 × 3 ## age name income ## <dbl> <chr> <dbl> ## 1 25 Justin 55 ## 2 29 Ian 30 ## 3 22 Rosaline 12 ## 4 23 Jack NA ``` --- ```r inner_join(df1,df2,by='name') ``` ``` ## # A tibble: 3 × 3 ## age name income ## <dbl> <chr> <dbl> ## 1 25 Justin 55 ## 2 29 Ian 30 ## 3 22 Rosaline 12 ``` --- ```r left_join(df1,df2,by='name') ``` ``` ## # A tibble: 4 × 3 ## age name income ## <dbl> <chr> <dbl> ## 1 25 Justin 55 ## 2 29 Ian 30 ## 3 22 Rosaline 12 ## 4 23 Jack NA ``` ```r right_join(df2,df1,by='name') ``` ``` ## # A tibble: 4 × 3 ## income name age ## <dbl> <chr> <dbl> ## 1 55 Justin 25 ## 2 30 Ian 29 ## 3 12 Rosaline 22 ## 4 NA Jack 23 ``` --- ```r full_join(df1,df2,by='name') ``` ``` ## # A tibble: 4 × 3 ## age name income ## <dbl> <chr> <dbl> ## 1 25 Justin 55 ## 2 29 Ian 30 ## 3 22 Rosaline 12 ## 4 23 Jack NA ``` --- ```r df1 <- tibble(age=c(25,29,22),name=c("Justin","Ian","Rosaline")) (df2 <- tibble(income=c(55,30,12,33),name=c("Justin","Ian","Rosaline","Justin"))) ``` ``` ## # A tibble: 4 × 2 ## income name ## <dbl> <chr> ## 1 55 Justin ## 2 30 Ian ## 3 12 Rosaline ## 4 33 Justin ``` ```r left_join(df1,df2,by='name') ``` ``` ## # A tibble: 4 × 3 ## age name income ## <dbl> <chr> <dbl> ## 1 25 Justin 55 ## 2 25 Justin 33 ## 3 29 Ian 30 ## 4 22 Rosaline 12 ``` --- ```r df1 <- tibble(age=c(25,29,22),name=c("Justin","Ian","Rosaline"),last=c("Savoie","Smith","Azevedo")) df2 <- tibble(income=c(55,30,12,33),name=c("Justin","Ian","Rosaline","Justin"),last=c("Savoie","Smith","Azevedo","Gajevic")) left_join(df1,df2,c("name"="name","last"="last")) ``` ``` ## # A tibble: 3 × 4 ## age name last income ## <dbl> <chr> <chr> <dbl> ## 1 25 Justin Savoie 55 ## 2 29 Ian Smith 30 ## 3 22 Rosaline Azevedo 12 ``` --- # Dealing with character strings - Very useful when dealing with text - Cryptic at first - replacing, substituting, extracting with `stringr` package --- ```r my_string <- c("Justin Savoie (2019)","Daniel Ortega (2019)", "Natalia Burina (2019)") str_replace(my_string,"\\(2019\\)","") %>% trimws() ``` ``` ## [1] "Justin Savoie" "Daniel Ortega" "Natalia Burina" ``` ```r my_string <- c("POL2517-U","POL2517-V","POL2699-S") str_sub(my_string,4,-1) ``` ``` ## [1] "2517-U" "2517-V" "2699-S" ``` ```r str_sub(my_string,4,7) ``` ``` ## [1] "2517" "2517" "2699" ``` ```r my_string <- c("Justin Savoie (2019)","Daniel Ortega (2017)", "Natalia Burina (2018)") str_extract(my_string,"(?<=\\().+?(?=\\))") ``` ``` ## [1] "2019" "2017" "2018" ``` ```r str_replace(my_string,"\\((?<=\\().+?(?=\\))\\)","") %>% trimws() ``` ``` ## [1] "Justin Savoie" "Daniel Ortega" "Natalia Burina" ``` ```r # https://stackoverflow.com/questions/8613237/extract-info-inside-all-parenthesis-in-r ``` --- ```r my_string <- c("justin savoie","micheal Gillet","daniela Smith-jones") str_to_title(my_string) ``` ``` ## [1] "Justin Savoie" "Micheal Gillet" "Daniela Smith-Jones" ``` ```r str_length(my_string) ``` ``` ## [1] 13 14 19 ``` ```r str_count(my_string,"j") ``` ``` ## [1] 1 0 1 ``` Honestly, this is the kind of thing you "ChatGPT", "Google" or "ask-a-data-analyst-friend" when you need it. At least you know these tools exist: when you can **logically** explain in your mind what you want to do, it's doable. --- # Dates and times ```r library(lubridate) today() ``` ``` ## [1] "2023-11-08" ``` ```r now()+60 ``` ``` ## [1] "2023-11-08 15:47:31 EST" ``` ```r today()+1 ``` ``` ## [1] "2023-11-09" ``` --- ```r adate <- "2022-06-01" ymd(adate) ``` ``` ## [1] "2022-06-01" ``` ```r lubridate::dmy("26-06-19") ``` ``` ## [1] "2019-06-26" ``` ```r adatetime <- "2001-10-10 02:30:30" ymd_hms(adatetime) ``` ``` ## [1] "2001-10-10 02:30:30 UTC" ``` ```r as.numeric(ymd_hms(adatetime)) ``` ``` ## [1] 1002681030 ``` --- ```r (my_data <- tibble(year=c("2022","2023"), month=c("6","1"), day=c("14","22"))) ``` ``` ## # A tibble: 2 × 3 ## year month day ## <chr> <chr> <chr> ## 1 2022 6 14 ## 2 2023 1 22 ``` ```r my_data |> mutate(date_string = paste0(year,"-",month,"_",day), date=lubridate::as_date(date_string)) ``` ``` ## # A tibble: 2 × 5 ## year month day date_string date ## <chr> <chr> <chr> <chr> <date> ## 1 2022 6 14 2022-6_14 2022-06-14 ## 2 2023 1 22 2023-1_22 2023-01-22 ``` --- ```r to_plot <- tibble(date=c("2019-01-01","2019-03-01","2020-01-01","2020-06-01"),value=rnorm(4,10,3)) ggplot(to_plot,aes(x=date,y=value)) + geom_point() ``` ![](Slides_files/figure-html/unnamed-chunk-19-1.png)<!-- --> --- ```r to_plot <- to_plot |> mutate(date=ymd(date)) ggplot(to_plot,aes(x=date,y=value)) + geom_point() + geom_line() + scale_x_date(breaks = as_date(c("2019-01-01","2019-03-01","2020-01-01","2020-06-01")), labels=c("Jan 2019","March 2019","Jan 2020","June 202")) + theme(axis.text.x = element_text(angle=45)) + scale_y_continuous(limits=c(0,30)) ``` ![](Slides_files/figure-html/unnamed-chunk-20-1.png)<!-- --> --- # A worked example See Script Code Example November 8