dplyr
verbs (e.g., select(), filter(), group(), mutate() etc.) perform specific tasks on data frames
Data wrangling can also involve fixing messy strings, dates; joining data from different sources
Binding data frames
Mutating joins
Dealing with character strings
Dealing with dates
Worked example
bind_rows()
df1 <- tibble(age=c(25,29,22),name=c("Justin","Ian","Rosaline"))df2 <- tibble(age=c(55,30,12),name=c("Hillary","Anna","Luis"))(df <- bind_rows(df1,df2))
## # A tibble: 6 × 2## age name ## <dbl> <chr> ## 1 25 Justin ## 2 29 Ian ## 3 22 Rosaline## 4 55 Hillary ## 5 30 Anna ## 6 12 Luis
df1 <- read_csv("mydata1.csv")df2 <- read_csv("mydata2.csv")df <- bind_rows(df1,df2)
Sometimes, you can use a loop to read in numerous csvs. We'll cover this in week 11. Think of data for 338 ridings.
For bind_rows()
columns names have to be the same; or if they are not it's simply to create a new column.
df1 <- tibble(age1=c(25,29,22),name1=c("Justin","Ian","Rosaline"))df2 <- tibble(age2=c(55,30,12),name2=c("Hillary","Anna","Luis"))(df <- bind_rows(df1,df2))
## # A tibble: 6 × 4## age1 name1 age2 name2 ## <dbl> <chr> <dbl> <chr> ## 1 25 Justin NA <NA> ## 2 29 Ian NA <NA> ## 3 22 Rosaline NA <NA> ## 4 NA <NA> 55 Hillary## 5 NA <NA> 30 Anna ## 6 NA <NA> 12 Luis
bind_cols()
df1 <- tibble(age=c(25,29,22),name=c("Justin","Ian","Rosaline"))df2 <- tibble(income=c(55,30,12),name=c("Justin","Ian","Rosaline"))(df <- bind_cols(df1,df2 |> select(-name)))
## # A tibble: 3 × 3## age name income## <dbl> <chr> <dbl>## 1 25 Justin 55## 2 29 Ian 30## 3 22 Rosaline 12
(df <- bind_cols(df1,df2))
## New names:## • `name` -> `name...2`## • `name` -> `name...4`
## # A tibble: 3 × 4## age name...2 income name...4## <dbl> <chr> <dbl> <chr> ## 1 25 Justin 55 Justin ## 2 29 Ian 30 Ian ## 3 22 Rosaline 12 Rosaline
Instead of "stacking", with a bind, joins "joins" on one or more columns. If you have riding names in two different datasets, you can join them.
inner_join()
: includes all rows in x and y.
left_join()
: includes all rows in x.
right_join()
: includes all rows in y.
full_join()
: includes all rows in x or y.
FYI: the notion of 'joins' are a fundamental aspect of SQL (Structured Query Language, is a standardized programming language that is used to manage and manipulate relational databases; SQL is a dominant language for interacting with databases) and are used extensively. They allow you to combine rows from two or more tables based on a related column between them, enabling the construction of complex queries over relational data.
df1;df2
## # A tibble: 3 × 2## age name ## <dbl> <chr> ## 1 25 Justin ## 2 29 Ian ## 3 22 Rosaline
## # A tibble: 3 × 2## income name ## <dbl> <chr> ## 1 55 Justin ## 2 30 Ian ## 3 12 Rosaline
left_join(df1,df2,by="name")
## # A tibble: 3 × 3## age name income## <dbl> <chr> <dbl>## 1 25 Justin 55## 2 29 Ian 30## 3 22 Rosaline 12
left_join(df1,df2,by="name")
## # A tibble: 3 × 3## age name income## <dbl> <chr> <dbl>## 1 25 Justin 55## 2 29 Ian 30## 3 22 Rosaline 12
left_join(df1,df2)
## Joining with `by = join_by(name)`
## # A tibble: 3 × 3## age name income## <dbl> <chr> <dbl>## 1 25 Justin 55## 2 29 Ian 30## 3 22 Rosaline 12
df1 <- tibble(age=c(25,29,22,23),name=c("Justin","Ian","Rosaline","Jack"))df2 <- tibble(income=c(55,30,12),name=c("Justin","Ian","Rosaline"))left_join(df1,df2,by='name')
## # A tibble: 4 × 3## age name income## <dbl> <chr> <dbl>## 1 25 Justin 55## 2 29 Ian 30## 3 22 Rosaline 12## 4 23 Jack NA
inner_join(df1,df2,by='name')
## # A tibble: 3 × 3## age name income## <dbl> <chr> <dbl>## 1 25 Justin 55## 2 29 Ian 30## 3 22 Rosaline 12
left_join(df1,df2,by='name')
## # A tibble: 4 × 3## age name income## <dbl> <chr> <dbl>## 1 25 Justin 55## 2 29 Ian 30## 3 22 Rosaline 12## 4 23 Jack NA
right_join(df2,df1,by='name')
## # A tibble: 4 × 3## income name age## <dbl> <chr> <dbl>## 1 55 Justin 25## 2 30 Ian 29## 3 12 Rosaline 22## 4 NA Jack 23
full_join(df1,df2,by='name')
## # A tibble: 4 × 3## age name income## <dbl> <chr> <dbl>## 1 25 Justin 55## 2 29 Ian 30## 3 22 Rosaline 12## 4 23 Jack NA
df1 <- tibble(age=c(25,29,22),name=c("Justin","Ian","Rosaline"))(df2 <- tibble(income=c(55,30,12,33),name=c("Justin","Ian","Rosaline","Justin")))
## # A tibble: 4 × 2## income name ## <dbl> <chr> ## 1 55 Justin ## 2 30 Ian ## 3 12 Rosaline## 4 33 Justin
left_join(df1,df2,by='name')
## # A tibble: 4 × 3## age name income## <dbl> <chr> <dbl>## 1 25 Justin 55## 2 25 Justin 33## 3 29 Ian 30## 4 22 Rosaline 12
df1 <- tibble(age=c(25,29,22),name=c("Justin","Ian","Rosaline"),last=c("Savoie","Smith","Azevedo"))df2 <- tibble(income=c(55,30,12,33),name=c("Justin","Ian","Rosaline","Justin"),last=c("Savoie","Smith","Azevedo","Gajevic"))left_join(df1,df2,c("name"="name","last"="last"))
## # A tibble: 3 × 4## age name last income## <dbl> <chr> <chr> <dbl>## 1 25 Justin Savoie 55## 2 29 Ian Smith 30## 3 22 Rosaline Azevedo 12
Very useful when dealing with text
Cryptic at first
stringr
packagemy_string <- c("Justin Savoie (2019)","Daniel Ortega (2019)", "Natalia Burina (2019)")str_replace(my_string,"\\(2019\\)","") %>% trimws()
## [1] "Justin Savoie" "Daniel Ortega" "Natalia Burina"
my_string <- c("POL2517-U","POL2517-V","POL2699-S")str_sub(my_string,4,-1)
## [1] "2517-U" "2517-V" "2699-S"
str_sub(my_string,4,7)
## [1] "2517" "2517" "2699"
my_string <- c("Justin Savoie (2019)","Daniel Ortega (2017)", "Natalia Burina (2018)")str_extract(my_string,"(?<=\\().+?(?=\\))")
## [1] "2019" "2017" "2018"
str_replace(my_string,"\\((?<=\\().+?(?=\\))\\)","") %>% trimws()
## [1] "Justin Savoie" "Daniel Ortega" "Natalia Burina"
# https://stackoverflow.com/questions/8613237/extract-info-inside-all-parenthesis-in-r
my_string <- c("justin savoie","micheal Gillet","daniela Smith-jones")str_to_title(my_string)
## [1] "Justin Savoie" "Micheal Gillet" "Daniela Smith-Jones"
str_length(my_string)
## [1] 13 14 19
str_count(my_string,"j")
## [1] 1 0 1
Honestly, this is the kind of thing you "ChatGPT", "Google" or "ask-a-data-analyst-friend" when you need it. At least you know these tools exist: when you can logically explain in your mind what you want to do, it's doable.
library(lubridate)today()
## [1] "2023-11-08"
now()+60
## [1] "2023-11-08 15:47:31 EST"
today()+1
## [1] "2023-11-09"
adate <- "2022-06-01"ymd(adate)
## [1] "2022-06-01"
lubridate::dmy("26-06-19")
## [1] "2019-06-26"
adatetime <- "2001-10-10 02:30:30"ymd_hms(adatetime)
## [1] "2001-10-10 02:30:30 UTC"
as.numeric(ymd_hms(adatetime))
## [1] 1002681030
(my_data <- tibble(year=c("2022","2023"), month=c("6","1"), day=c("14","22")))
## # A tibble: 2 × 3## year month day ## <chr> <chr> <chr>## 1 2022 6 14 ## 2 2023 1 22
my_data |> mutate(date_string = paste0(year,"-",month,"_",day), date=lubridate::as_date(date_string))
## # A tibble: 2 × 5## year month day date_string date ## <chr> <chr> <chr> <chr> <date> ## 1 2022 6 14 2022-6_14 2022-06-14## 2 2023 1 22 2023-1_22 2023-01-22
to_plot <- tibble(date=c("2019-01-01","2019-03-01","2020-01-01","2020-06-01"),value=rnorm(4,10,3))ggplot(to_plot,aes(x=date,y=value)) + geom_point()
to_plot <- to_plot |> mutate(date=ymd(date))ggplot(to_plot,aes(x=date,y=value)) + geom_point() + geom_line() + scale_x_date(breaks = as_date(c("2019-01-01","2019-03-01","2020-01-01","2020-06-01")), labels=c("Jan 2019","March 2019","Jan 2020","June 202")) + theme(axis.text.x = element_text(angle=45)) + scale_y_continuous(limits=c(0,30))
See Script Code Example November 8
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |