Week 8: Transforming data: more tricks
PUBPOL 750 Data Analysis for Public Policy I
Justin Savoie
MPP-DS McMaster
2023-11-08
1 / 26

Administrative noteLinear regression next week (Week 9). Functions and loops in Week 11.2 / 26

Data wrangling ?

dplyr verbs (e.g., select(), filter(), group(), mutate() etc.) perform specific tasks on data frames
Data wrangling can also involve fixing messy strings, dates; joining data from different sources

3 / 26

Plan for today

Binding data frames
Mutating joins
Dealing with character strings
Dealing with dates
Worked example

4 / 26

Binding (columsn or rows) means appending new lines or new columns

Binding rows with `bind_rows()`

Bind any number of data frames by row, making a longer result.

df1 <- tibble(age=c(25,29,22),name=c("Justin","Ian","Rosaline"))
df2 <- tibble(age=c(55,30,12),name=c("Hillary","Anna","Luis"))
(df <- bind_rows(df1,df2))

## # A tibble: 6 × 2
##     age name    
##   <dbl> <chr>   
## 1    25 Justin  
## 2    29 Ian     
## 3    22 Rosaline
## 4    55 Hillary 
## 5    30 Anna    
## 6    12 Luis

5 / 26

df1 <- read_csv("mydata1.csv")
df2 <- read_csv("mydata2.csv")
df <- bind_rows(df1,df2)

Sometimes, you can use a loop to read in numerous csvs. We'll cover this in week 11. Think of data for 338 ridings.
For bind_rows() columns names have to be the same; or if they are not it's simply to create a new column.

df1 <- tibble(age1=c(25,29,22),name1=c("Justin","Ian","Rosaline"))
df2 <- tibble(age2=c(55,30,12),name2=c("Hillary","Anna","Luis"))
(df <- bind_rows(df1,df2))

## # A tibble: 6 × 4
##    age1 name1     age2 name2  
##   <dbl> <chr>    <dbl> <chr>  
## 1    25 Justin      NA <NA>   
## 2    29 Ian         NA <NA>   
## 3    22 Rosaline    NA <NA>   
## 4    NA <NA>        55 Hillary
## 5    NA <NA>        30 Anna   
## 6    NA <NA>        12 Luis

In this case you would have to rename the columns.

6 / 26

Binding columns with `bind_cols()`

Bind any number of data frames by column, making a wider result.

df1 <- tibble(age=c(25,29,22),name=c("Justin","Ian","Rosaline"))
df2 <- tibble(income=c(55,30,12),name=c("Justin","Ian","Rosaline"))
(df <- bind_cols(df1,df2 |> select(-name)))

## # A tibble: 3 × 3
##     age name     income
##   <dbl> <chr>     <dbl>
## 1    25 Justin       55
## 2    29 Ian          30
## 3    22 Rosaline     12

7 / 26

(df <- bind_cols(df1,df2))

## New names:
## • `name` -> `name...2`
## • `name` -> `name...4`

## # A tibble: 3 × 4
##     age name...2 income name...4
##   <dbl> <chr>     <dbl> <chr>   
## 1    25 Justin       55 Justin  
## 2    29 Ian          30 Ian     
## 3    22 Rosaline     12 Rosaline

8 / 26

Joins

Instead of "stacking", with a bind, joins "joins" on one or more columns. If you have riding names in two different datasets, you can join them.

inner_join(): includes all rows in x and y.
left_join(): includes all rows in x.
right_join(): includes all rows in y.
full_join(): includes all rows in x or y.

FYI: the notion of 'joins' are a fundamental aspect of SQL (Structured Query Language, is a standardized programming language that is used to manage and manipulate relational databases; SQL is a dominant language for interacting with databases) and are used extensively. They allow you to combine rows from two or more tables based on a related column between them, enabling the construction of complex queries over relational data.

9 / 26

df1;df2

## # A tibble: 3 × 2
##     age name    
##   <dbl> <chr>   
## 1    25 Justin  
## 2    29 Ian     
## 3    22 Rosaline

## # A tibble: 3 × 2
##   income name    
##    <dbl> <chr>   
## 1     55 Justin  
## 2     30 Ian     
## 3     12 Rosaline

left_join(df1,df2,by="name")

## # A tibble: 3 × 3
##     age name     income
##   <dbl> <chr>     <dbl>
## 1    25 Justin       55
## 2    29 Ian          30
## 3    22 Rosaline     12

10 / 26

left_join(df1,df2,by="name")

## # A tibble: 3 × 3
##     age name     income
##   <dbl> <chr>     <dbl>
## 1    25 Justin       55
## 2    29 Ian          30
## 3    22 Rosaline     12

left_join(df1,df2)

## Joining with `by = join_by(name)`

## # A tibble: 3 × 3
##     age name     income
##   <dbl> <chr>     <dbl>
## 1    25 Justin       55
## 2    29 Ian          30
## 3    22 Rosaline     12

11 / 26

df1 <- tibble(age=c(25,29,22,23),name=c("Justin","Ian","Rosaline","Jack"))
df2 <- tibble(income=c(55,30,12),name=c("Justin","Ian","Rosaline"))
left_join(df1,df2,by='name')

## # A tibble: 4 × 3
##     age name     income
##   <dbl> <chr>     <dbl>
## 1    25 Justin       55
## 2    29 Ian          30
## 3    22 Rosaline     12
## 4    23 Jack         NA

12 / 26

inner_join(df1,df2,by='name')

## # A tibble: 3 × 3
##     age name     income
##   <dbl> <chr>     <dbl>
## 1    25 Justin       55
## 2    29 Ian          30
## 3    22 Rosaline     12

13 / 26

left_join(df1,df2,by='name')

## # A tibble: 4 × 3
##     age name     income
##   <dbl> <chr>     <dbl>
## 1    25 Justin       55
## 2    29 Ian          30
## 3    22 Rosaline     12
## 4    23 Jack         NA

right_join(df2,df1,by='name')

## # A tibble: 4 × 3
##   income name       age
##    <dbl> <chr>    <dbl>
## 1     55 Justin      25
## 2     30 Ian         29
## 3     12 Rosaline    22
## 4     NA Jack        23

14 / 26

full_join(df1,df2,by='name')

## # A tibble: 4 × 3
##     age name     income
##   <dbl> <chr>     <dbl>
## 1    25 Justin       55
## 2    29 Ian          30
## 3    22 Rosaline     12
## 4    23 Jack         NA

15 / 26

df1 <- tibble(age=c(25,29,22),name=c("Justin","Ian","Rosaline"))
(df2 <- tibble(income=c(55,30,12,33),name=c("Justin","Ian","Rosaline","Justin")))

## # A tibble: 4 × 2
##   income name    
##    <dbl> <chr>   
## 1     55 Justin  
## 2     30 Ian     
## 3     12 Rosaline
## 4     33 Justin

left_join(df1,df2,by='name')

## # A tibble: 4 × 3
##     age name     income
##   <dbl> <chr>     <dbl>
## 1    25 Justin       55
## 2    25 Justin       33
## 3    29 Ian          30
## 4    22 Rosaline     12

16 / 26

df1 <- tibble(age=c(25,29,22),name=c("Justin","Ian","Rosaline"),last=c("Savoie","Smith","Azevedo"))
df2 <- tibble(income=c(55,30,12,33),name=c("Justin","Ian","Rosaline","Justin"),last=c("Savoie","Smith","Azevedo","Gajevic"))
left_join(df1,df2,c("name"="name","last"="last"))

## # A tibble: 3 × 4
##     age name     last    income
##   <dbl> <chr>    <chr>    <dbl>
## 1    25 Justin   Savoie      55
## 2    29 Ian      Smith       30
## 3    22 Rosaline Azevedo     12

17 / 26

Dealing with character strings

Very useful when dealing with text
Cryptic at first

replacing, substituting, extracting with stringr package

18 / 26

my_string <- c("Justin Savoie (2019)","Daniel Ortega (2019)", "Natalia Burina (2019)")
str_replace(my_string,"\\(2019\\)","") %>% trimws()

## [1] "Justin Savoie"  "Daniel Ortega"  "Natalia Burina"

my_string <- c("POL2517-U","POL2517-V","POL2699-S")
str_sub(my_string,4,-1)

## [1] "2517-U" "2517-V" "2699-S"

str_sub(my_string,4,7)

## [1] "2517" "2517" "2699"

my_string <- c("Justin Savoie (2019)","Daniel Ortega (2017)", "Natalia Burina (2018)")
str_extract(my_string,"(?<=\\().+?(?=\\))")

## [1] "2019" "2017" "2018"

str_replace(my_string,"\\((?<=\\().+?(?=\\))\\)","") %>% trimws()

## [1] "Justin Savoie"  "Daniel Ortega"  "Natalia Burina"

# https://stackoverflow.com/questions/8613237/extract-info-inside-all-parenthesis-in-r

19 / 26

my_string <- c("justin savoie","micheal Gillet","daniela Smith-jones")
str_to_title(my_string)

## [1] "Justin Savoie"       "Micheal Gillet"      "Daniela Smith-Jones"

str_length(my_string)

## [1] 13 14 19

str_count(my_string,"j")

## [1] 1 0 1

Honestly, this is the kind of thing you "ChatGPT", "Google" or "ask-a-data-analyst-friend" when you need it. At least you know these tools exist: when you can logically explain in your mind what you want to do, it's doable.

20 / 26

Dates and times

library(lubridate)
today()

## [1] "2023-11-08"

now()+60

## [1] "2023-11-08 15:47:31 EST"

today()+1

## [1] "2023-11-09"

21 / 26

adate <- "2022-06-01"
ymd(adate)

## [1] "2022-06-01"

lubridate::dmy("26-06-19")

## [1] "2019-06-26"

adatetime <- "2001-10-10 02:30:30"
ymd_hms(adatetime)

## [1] "2001-10-10 02:30:30 UTC"

as.numeric(ymd_hms(adatetime))

## [1] 1002681030

22 / 26

(my_data <- tibble(year=c("2022","2023"),
       month=c("6","1"),
       day=c("14","22")))

## # A tibble: 2 × 3
##   year  month day  
##   <chr> <chr> <chr>
## 1 2022  6     14   
## 2 2023  1     22

my_data |>
  mutate(date_string = paste0(year,"-",month,"_",day),
         date=lubridate::as_date(date_string))

## # A tibble: 2 × 5
##   year  month day   date_string date      
##   <chr> <chr> <chr> <chr>       <date>    
## 1 2022  6     14    2022-6_14   2022-06-14
## 2 2023  1     22    2023-1_22   2023-01-22

23 / 26

to_plot <- tibble(date=c("2019-01-01","2019-03-01","2020-01-01","2020-06-01"),value=rnorm(4,10,3))
ggplot(to_plot,aes(x=date,y=value)) + geom_point()

24 / 26

to_plot <- to_plot |> mutate(date=ymd(date))
ggplot(to_plot,aes(x=date,y=value)) + geom_point() + geom_line() +
  scale_x_date(breaks = as_date(c("2019-01-01","2019-03-01","2020-01-01","2020-06-01")),
               labels=c("Jan 2019","March 2019","Jan 2020","June 202")) +
  theme(axis.text.x = element_text(angle=45)) + scale_y_continuous(limits=c(0,30))

25 / 26

A worked example

See Script Code Example November 8

26 / 26

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Week 8: Transforming data: more tricks

PUBPOL 750 Data Analysis for Public Policy I

Justin Savoie

MPP-DS McMaster

2023-11-08

Administrative note

Linear regression next week (Week 9). Functions and loops in Week 11.