+ - 0:00:00
Notes for current slide
Notes for next slide

Week 8: Transforming data: more tricks

PUBPOL 750 Data Analysis for Public Policy I

Justin Savoie

MPP-DS McMaster

2023-11-08

1 / 26

Administrative note

Linear regression next week (Week 9). Functions and loops in Week 11.

2 / 26

Data wrangling ?

  • dplyr verbs (e.g., select(), filter(), group(), mutate() etc.) perform specific tasks on data frames

  • Data wrangling can also involve fixing messy strings, dates; joining data from different sources

3 / 26

Plan for today

  • Binding data frames

  • Mutating joins

  • Dealing with character strings

  • Dealing with dates

  • Worked example

4 / 26

Binding (columsn or rows) means appending new lines or new columns

Binding rows with bind_rows()

  • Bind any number of data frames by row, making a longer result.
df1 <- tibble(age=c(25,29,22),name=c("Justin","Ian","Rosaline"))
df2 <- tibble(age=c(55,30,12),name=c("Hillary","Anna","Luis"))
(df <- bind_rows(df1,df2))
## # A tibble: 6 × 2
## age name
## <dbl> <chr>
## 1 25 Justin
## 2 29 Ian
## 3 22 Rosaline
## 4 55 Hillary
## 5 30 Anna
## 6 12 Luis
5 / 26
df1 <- read_csv("mydata1.csv")
df2 <- read_csv("mydata2.csv")
df <- bind_rows(df1,df2)
  • Sometimes, you can use a loop to read in numerous csvs. We'll cover this in week 11. Think of data for 338 ridings.

  • For bind_rows() columns names have to be the same; or if they are not it's simply to create a new column.

df1 <- tibble(age1=c(25,29,22),name1=c("Justin","Ian","Rosaline"))
df2 <- tibble(age2=c(55,30,12),name2=c("Hillary","Anna","Luis"))
(df <- bind_rows(df1,df2))
## # A tibble: 6 × 4
## age1 name1 age2 name2
## <dbl> <chr> <dbl> <chr>
## 1 25 Justin NA <NA>
## 2 29 Ian NA <NA>
## 3 22 Rosaline NA <NA>
## 4 NA <NA> 55 Hillary
## 5 NA <NA> 30 Anna
## 6 NA <NA> 12 Luis
  • In this case you would have to rename the columns.
6 / 26

Binding columns with bind_cols()

  • Bind any number of data frames by column, making a wider result.
df1 <- tibble(age=c(25,29,22),name=c("Justin","Ian","Rosaline"))
df2 <- tibble(income=c(55,30,12),name=c("Justin","Ian","Rosaline"))
(df <- bind_cols(df1,df2 |> select(-name)))
## # A tibble: 3 × 3
## age name income
## <dbl> <chr> <dbl>
## 1 25 Justin 55
## 2 29 Ian 30
## 3 22 Rosaline 12
7 / 26
(df <- bind_cols(df1,df2))
## New names:
## • `name` -> `name...2`
## • `name` -> `name...4`
## # A tibble: 3 × 4
## age name...2 income name...4
## <dbl> <chr> <dbl> <chr>
## 1 25 Justin 55 Justin
## 2 29 Ian 30 Ian
## 3 22 Rosaline 12 Rosaline
8 / 26

Joins

Instead of "stacking", with a bind, joins "joins" on one or more columns. If you have riding names in two different datasets, you can join them.

  • inner_join(): includes all rows in x and y.

  • left_join(): includes all rows in x.

  • right_join(): includes all rows in y.

  • full_join(): includes all rows in x or y.

FYI: the notion of 'joins' are a fundamental aspect of SQL (Structured Query Language, is a standardized programming language that is used to manage and manipulate relational databases; SQL is a dominant language for interacting with databases) and are used extensively. They allow you to combine rows from two or more tables based on a related column between them, enabling the construction of complex queries over relational data.

9 / 26
df1;df2
## # A tibble: 3 × 2
## age name
## <dbl> <chr>
## 1 25 Justin
## 2 29 Ian
## 3 22 Rosaline
## # A tibble: 3 × 2
## income name
## <dbl> <chr>
## 1 55 Justin
## 2 30 Ian
## 3 12 Rosaline
left_join(df1,df2,by="name")
## # A tibble: 3 × 3
## age name income
## <dbl> <chr> <dbl>
## 1 25 Justin 55
## 2 29 Ian 30
## 3 22 Rosaline 12
10 / 26
left_join(df1,df2,by="name")
## # A tibble: 3 × 3
## age name income
## <dbl> <chr> <dbl>
## 1 25 Justin 55
## 2 29 Ian 30
## 3 22 Rosaline 12
left_join(df1,df2)
## Joining with `by = join_by(name)`
## # A tibble: 3 × 3
## age name income
## <dbl> <chr> <dbl>
## 1 25 Justin 55
## 2 29 Ian 30
## 3 22 Rosaline 12
11 / 26
df1 <- tibble(age=c(25,29,22,23),name=c("Justin","Ian","Rosaline","Jack"))
df2 <- tibble(income=c(55,30,12),name=c("Justin","Ian","Rosaline"))
left_join(df1,df2,by='name')
## # A tibble: 4 × 3
## age name income
## <dbl> <chr> <dbl>
## 1 25 Justin 55
## 2 29 Ian 30
## 3 22 Rosaline 12
## 4 23 Jack NA
12 / 26
inner_join(df1,df2,by='name')
## # A tibble: 3 × 3
## age name income
## <dbl> <chr> <dbl>
## 1 25 Justin 55
## 2 29 Ian 30
## 3 22 Rosaline 12
13 / 26
left_join(df1,df2,by='name')
## # A tibble: 4 × 3
## age name income
## <dbl> <chr> <dbl>
## 1 25 Justin 55
## 2 29 Ian 30
## 3 22 Rosaline 12
## 4 23 Jack NA
right_join(df2,df1,by='name')
## # A tibble: 4 × 3
## income name age
## <dbl> <chr> <dbl>
## 1 55 Justin 25
## 2 30 Ian 29
## 3 12 Rosaline 22
## 4 NA Jack 23
14 / 26
full_join(df1,df2,by='name')
## # A tibble: 4 × 3
## age name income
## <dbl> <chr> <dbl>
## 1 25 Justin 55
## 2 29 Ian 30
## 3 22 Rosaline 12
## 4 23 Jack NA
15 / 26
df1 <- tibble(age=c(25,29,22),name=c("Justin","Ian","Rosaline"))
(df2 <- tibble(income=c(55,30,12,33),name=c("Justin","Ian","Rosaline","Justin")))
## # A tibble: 4 × 2
## income name
## <dbl> <chr>
## 1 55 Justin
## 2 30 Ian
## 3 12 Rosaline
## 4 33 Justin
left_join(df1,df2,by='name')
## # A tibble: 4 × 3
## age name income
## <dbl> <chr> <dbl>
## 1 25 Justin 55
## 2 25 Justin 33
## 3 29 Ian 30
## 4 22 Rosaline 12
16 / 26
df1 <- tibble(age=c(25,29,22),name=c("Justin","Ian","Rosaline"),last=c("Savoie","Smith","Azevedo"))
df2 <- tibble(income=c(55,30,12,33),name=c("Justin","Ian","Rosaline","Justin"),last=c("Savoie","Smith","Azevedo","Gajevic"))
left_join(df1,df2,c("name"="name","last"="last"))
## # A tibble: 3 × 4
## age name last income
## <dbl> <chr> <chr> <dbl>
## 1 25 Justin Savoie 55
## 2 29 Ian Smith 30
## 3 22 Rosaline Azevedo 12
17 / 26

Dealing with character strings

  • Very useful when dealing with text

  • Cryptic at first

  • replacing, substituting, extracting with stringr package
18 / 26
my_string <- c("Justin Savoie (2019)","Daniel Ortega (2019)", "Natalia Burina (2019)")
str_replace(my_string,"\\(2019\\)","") %>% trimws()
## [1] "Justin Savoie" "Daniel Ortega" "Natalia Burina"
my_string <- c("POL2517-U","POL2517-V","POL2699-S")
str_sub(my_string,4,-1)
## [1] "2517-U" "2517-V" "2699-S"
str_sub(my_string,4,7)
## [1] "2517" "2517" "2699"
my_string <- c("Justin Savoie (2019)","Daniel Ortega (2017)", "Natalia Burina (2018)")
str_extract(my_string,"(?<=\\().+?(?=\\))")
## [1] "2019" "2017" "2018"
str_replace(my_string,"\\((?<=\\().+?(?=\\))\\)","") %>% trimws()
## [1] "Justin Savoie" "Daniel Ortega" "Natalia Burina"
# https://stackoverflow.com/questions/8613237/extract-info-inside-all-parenthesis-in-r
19 / 26
my_string <- c("justin savoie","micheal Gillet","daniela Smith-jones")
str_to_title(my_string)
## [1] "Justin Savoie" "Micheal Gillet" "Daniela Smith-Jones"
str_length(my_string)
## [1] 13 14 19
str_count(my_string,"j")
## [1] 1 0 1

Honestly, this is the kind of thing you "ChatGPT", "Google" or "ask-a-data-analyst-friend" when you need it. At least you know these tools exist: when you can logically explain in your mind what you want to do, it's doable.

20 / 26

Dates and times

library(lubridate)
today()
## [1] "2023-11-08"
now()+60
## [1] "2023-11-08 15:47:31 EST"
today()+1
## [1] "2023-11-09"
21 / 26
adate <- "2022-06-01"
ymd(adate)
## [1] "2022-06-01"
lubridate::dmy("26-06-19")
## [1] "2019-06-26"
adatetime <- "2001-10-10 02:30:30"
ymd_hms(adatetime)
## [1] "2001-10-10 02:30:30 UTC"
as.numeric(ymd_hms(adatetime))
## [1] 1002681030
22 / 26
(my_data <- tibble(year=c("2022","2023"),
month=c("6","1"),
day=c("14","22")))
## # A tibble: 2 × 3
## year month day
## <chr> <chr> <chr>
## 1 2022 6 14
## 2 2023 1 22
my_data |>
mutate(date_string = paste0(year,"-",month,"_",day),
date=lubridate::as_date(date_string))
## # A tibble: 2 × 5
## year month day date_string date
## <chr> <chr> <chr> <chr> <date>
## 1 2022 6 14 2022-6_14 2022-06-14
## 2 2023 1 22 2023-1_22 2023-01-22
23 / 26
to_plot <- tibble(date=c("2019-01-01","2019-03-01","2020-01-01","2020-06-01"),value=rnorm(4,10,3))
ggplot(to_plot,aes(x=date,y=value)) + geom_point()

24 / 26
to_plot <- to_plot |> mutate(date=ymd(date))
ggplot(to_plot,aes(x=date,y=value)) + geom_point() + geom_line() +
scale_x_date(breaks = as_date(c("2019-01-01","2019-03-01","2020-01-01","2020-06-01")),
labels=c("Jan 2019","March 2019","Jan 2020","June 202")) +
theme(axis.text.x = element_text(angle=45)) + scale_y_continuous(limits=c(0,30))

25 / 26

A worked example

See Script Code Example November 8

26 / 26

Administrative note

Linear regression next week (Week 9). Functions and loops in Week 11.

2 / 26
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow