Week 4: Importing data and tidy data

# Week 4: Importing data and tidy data
## PUBPOL 750 Data Analysis for Public Policy I
### Justin Savoie
### MPP-DS McMaster
### 2023-10-03

---

# Homework 1

Solution code for HW1 posted on the website tomorrow.

We will discuss it next week.

---

# Paths

---

# Code style

This is chapter 5 in the book. Optional reading.

---

# Names

- Use lowercase separated with `_`.

- Use names that are easy to understand.

- Be consistent. For example don't have: `dataset_ON`, `dataset_QC` and `datas_BC` if these three objects are three datasets for the three provinces.

---

# Spaces

- Put spaces on either side of mathematical operators apart from `^` (i.e. `+`, `-`, `==`, `<`, ...), and around the assignment operator (`<-`).

```r
# Strive for
z <- (a + b)^2 / d
# Avoid
z<-( a + b ) ^ 2/d
```

- Don’t put spaces inside or outside parentheses for regular function calls. Always put a space after a comma.

```r
# Strive for
mean(x, na.rm = TRUE)
# Avoid
mean (x ,na.rm=TRUE)
```

---

# Spaces

- It’s OK to add extra spaces if it improves alignment.

```r
flights |> 
  mutate(
    speed      = distance / air_time,
    dep_hour   = dep_time %/% 100,
    dep_minute = dep_time %%  100
  )
```

---

# Style

One advice: imitate the code "style" that you see in reputable sources for example in modern R texbooks.

- https://r4ds.hadley.nz

- https://moderndive.com

- https://tellingstorieswithdata.com/

Try to be consistent and coherent.

---

# The pipe (`|>`)

- We use the pipe `|>` to emphasise a sequence of actions

- The pipe means "THEN"

- `|>` should always have a space before it and should typically

```r
# Strive for 
flights |>  
  filter(!is.na(arr_delay), !is.na(tailnum)) |> 
  count(dest)
# Avoid
flights|>filter(!is.na(arr_delay),!is.na(tailnum))|>count(dest)
```
---

# The `|>`

- If the function you’re piping into has named arguments (like mutate() or summarize()), put each argument on a new line.

```r
flights |>  
  group_by(tailnum) |> 
  summarize(
    delay = mean(arr_delay, na.rm = TRUE),
    n = n()
  )
```

---

# The `|>`

- Don't write pipes that are too long. The book says 10-15 lines, use judgment.

- Break into smaller tasks.

---

# ggplot2

- Same rules apply to ggplot2 plots.

```r
flights |> 
  group_by(month) |> 
  summarize(
    delay = mean(arr_delay, na.rm = TRUE)
  ) |> 
  ggplot(aes(x = month, y = delay)) +
  geom_point() + 
  geom_line()
```

---

# Data importing
---

# Reading a csv

```r
library(tidyverse)
Parliamentarians_subset100 <- read_csv("https://www.justinsavoie.com/pubpol750-2022/homework1/Parliamentarians_subset100.csv")
Parliamentarians_subset100 <- read_csv("~/Downloads/Parliamentarians_subset100.csv")
Parliamentarians_subset100 <- read_csv("/Users/justinsavoie/Downloads/Parliamentarians_subset100.csv")
Parliamentarians_subset100 <- read_csv("/Users/justinsavoie/Documents/personal_repos/personal_website/pubpol750-2022/Homework1/Parliamentarians_subset100.csv")
Parliamentarians_subset100 <- read_csv("C:\\Users\\justinsavoie\\Downloads/Parliamentarians_subset100.csv")
```

# Reading an excel file

```r
library(readxl)
Parliamentarians_subset100 <- read_excel("~/Downloads/Parliamentarians.xlsx")
```
---
# Reading spss, sas, stata files

```r
library(haven)
#read_spss()
#read_sas()
#read_stata()
library(readstata13)
#readstata13::read.dta13()
```

---

# Tidy data
---

# Tidy data

(A)
- Each variable is a column; each column is a variable.
- Each observation is a row; each row is an observation.
- Each value is a cell; each cell is a single value.

(B)
- Put each dataset in a tibble.
- Put each variable in a column.

---

---

---

# Why tidy data?

- **picking one consistent way of storing data** (and tidyverse functions work well on tidy data)

- works well in R because of vectorization

---

```r
table1
```

```
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
```

```r
table2
```

```
## # A tibble: 12 × 4
##    country      year type            count
##    <chr>       <dbl> <chr>           <dbl>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583
```
]
.pull-right[

```r
table3
```

```
## # A tibble: 6 × 3
##   country      year rate             
##   <chr>       <dbl> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583
```

```r
table4a
```

```
## # A tibble: 3 × 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
```

```r
table4b
```

```
## # A tibble: 3 × 3
##   country         `1999`     `2000`
##   <chr>            <dbl>      <dbl>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583
```
]
---
# Pivoting

- Pivoting wider
- Pivoting longer
---
# Pivoting wider

```r
table4a
```

```
## # A tibble: 3 × 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
```

```r
table4a |>
  pivot_longer(
    cols=c(`1999`,`2000`),
    names_to = 'year',
    values_to = 'cases'
  )
```

```
## # A tibble: 6 × 3
##   country     year   cases
##   <chr>       <chr>  <dbl>
## 1 Afghanistan 1999     745
## 2 Afghanistan 2000    2666
## 3 Brazil      1999   37737
## 4 Brazil      2000   80488
## 5 China       1999  212258
## 6 China       2000  213766
```

---

```r
(tidy4a <- table4a |> 
  pivot_longer(
    c(`1999`, `2000`),
    names_to = "year",
    values_to = "cases"
  ))
```

```r
(tidy4b <- table4b |> 
  pivot_longer(
    c(`1999`, `2000`),
    names_to = "year",
    values_to = "population"
  ))
```

```
## # A tibble: 6 × 3
##   country     year  population
##   <chr>       <chr>      <dbl>
## 1 Afghanistan 1999    19987071
## 2 Afghanistan 2000    20595360
## 3 Brazil      1999   172006362
## 4 Brazil      2000   174504898
## 5 China       1999  1272915272
## 6 China       2000  1280428583
```
---
# Pivoting wider

```r
head(table2,3)
```

```
## # A tibble: 3 × 4
##   country      year type          count
##   <chr>       <dbl> <chr>         <dbl>
## 1 Afghanistan  1999 cases           745
## 2 Afghanistan  1999 population 19987071
## 3 Afghanistan  2000 cases          2666
```

```r
table2 |>
    pivot_wider(
      names_from = type,
      values_from = count)
```

---

# Missing values

- Explicit vs Implicit

```r
(stocks <- tibble(
  year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
  qtr    = c(   1,    2,    3,    4,    2,    3,    4),
  return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
))
```

```
## # A tibble: 7 × 3
##    year   qtr return
##   <dbl> <dbl>  <dbl>
## 1  2015     1   1.88
## 2  2015     2   0.59
## 3  2015     3   0.35
## 4  2015     4  NA   
## 5  2016     2   0.92
## 6  2016     3   0.17
## 7  2016     4   2.66
```

---

- Force explicit

```r
stocks |> 
  pivot_wider(
    names_from = year,
    values_from = return)
```

```
## # A tibble: 4 × 3
##     qtr `2015` `2016`
##   <dbl>  <dbl>  <dbl>
## 1     1   1.88  NA   
## 2     2   0.59   0.92
## 3     3   0.35   0.17
## 4     4  NA      2.66
```

---

```r
stocks |> 
  pivot_wider(
    names_from = year,
    values_from = return
  ) |>
  pivot_longer(
    cols=c(`2015`,`2016`),
    names_to='year',
    values_to='return'
  )
```

```
## # A tibble: 8 × 3
##     qtr year  return
##   <dbl> <chr>  <dbl>
## 1     1 2015    1.88
## 2     1 2016   NA   
## 3     2 2015    0.59
## 4     2 2016    0.92
## 5     3 2015    0.35
## 6     3 2016    0.17
## 7     4 2015   NA   
## 8     4 2016    2.66
```
---

```r
stocks |> 
  pivot_wider(
    names_from = year,
    values_from = return
  ) |>
  pivot_longer(
    cols=c(`2015`,`2016`),
    names_to='year',
    values_to='return',
    values_drop_na = TRUE
  )
```

```
## # A tibble: 6 × 3
##     qtr year  return
##   <dbl> <chr>  <dbl>
## 1     1 2015    1.88
## 2     2 2015    0.59
## 3     2 2016    0.92
## 4     3 2015    0.35
## 5     3 2016    0.17
## 6     4 2016    2.66
```
---

```
## # A tibble: 6 × 3
##     qtr year  return
##   <dbl> <chr>  <dbl>
## 1     1 2015    1.88
## 2     2 2015    0.59
## 3     2 2016    0.92
## 4     3 2015    0.35
## 5     3 2016    0.17
## 6     4 2016    2.66
```

---

# Exercices

### 6.2.1