PUBPOL 750 Homework 1

Introduction

This homework mostly involves copying and pasting with some minor tweaks. It should not be too long. It is designed to make sure you can load data, transform data, and visualize it.

In this assigment, we will use data from Canadian Members of Parliament. Data is available here if you want to check it: https://lop.parl.ca/sites/ParlInfo/default/en_CA/People/parliamentarians (you click search and then export all data). But you don’t have to download anything. It’s all available for download on the class website along the slides and other content.

You have two things to hand in.

  1. Create a R script and save it. The file should look like: example_rscript_homework1.R. You will hand in this R script.
  2. In the R script, you will save figures using the ggsave() functions. These will be png figures. Include all figures in a Microsoft Word document. You will hand this in too.

We will take a look at gender representation in the House of Commons, by historical period and by party.

Homework

Load the three libraries we will need for this assignment. If they are not installed on your computer, install them with install.packages('nameoflibrary').

Take a quick look at Parliamentarians.xlsx in Excel or Numbers. When you load it in R you can also inspect it with `View()

Change the path of the file. On my computer, Parliamentarians.xlsx is in Downloads. Change this on your computer so it’s wherever the file is.

library(readxl)
library(tidyverse)
library(lubridate)

parliamentarians <- read_excel("~/Downloads/Parliamentarians.xlsx")

Run the code below to clean the data. It’s fine if this appears very complicated. It should get clearer as the semester advances. Each line has a comment (# are comments) explaining what is going on in the code.

But really, it’s also OK if you just copy paste this without understanding the details.

# filter by whether the character string "MP" is in the variable `Type of Parliamentarian`
parliamentarians <- parliamentarians |>
  filter(str_detect(`Type of Parliamentarian`, "MP"))

# create (mutate) variables for start and end date of the parliamentarian
# str_sub(x,5,14) returns the values between positions 5 and 14
# e.g. str_sub("abcdefghytr56",3,10) returns "cdefghyt"
# yms() coerces dates-as-string to date format 
parliamentarians <- parliamentarians |>
  mutate(date_started = ymd(str_sub(`Type of Parliamentarian`, 5, 14))) |>
  mutate(date_ended = str_sub(`Type of Parliamentarian`, 18, 27))

# Clean and extract party and province. These are regular expressions and it's fine if it's not clear. Regular expression extract/replace data based on patterns.
# If you are curious, take a look at this:
# https://jfjelstul.github.io/regular-expressions-tutorial/
parliamentarians <- parliamentarians |>
  mutate(party = trimws(str_extract(`Political Affiliation`, "[^\\()]+")),
         province = str_extract(`Province/Territory`, "[^\\\r]+"))

# Change format of date to date format
parliamentarians <- parliamentarians |>
  mutate(date_ended = ymd(date_ended, quiet = TRUE))
# If no end date, Parliamentarian is still in HoC so put today as end date.
# If you are very curious, look up difference between ifelse() and if_else()
parliamentarians <- parliamentarians |>
  mutate(date_ended = if_else(is.na(date_ended), ymd(today()), date_ended))

# Keep only those parties, let's not keep the small parties / indps
names_keep <- c("Conservative", "Conservative Party of Canada", "Liberal Party of Canada", 
                "New Democratic Party", "Progressive Conservative Party")
# Keep only variables we need
# Delete rows where missing values
parliamentarians <- parliamentarians |>
  filter(party %in% names_keep) |>
  select(Name, Gender, party, date_started, date_ended, province) |>
  na.omit()

Output a basic bar chart plotting gender on the x axis and counts on the y axis. You can use this code:

ggplot(parliamentarians,aes(x=Gender)) +
  geom_bar()

Save it using ggsave(). Play with the width and height parameters so that the graph looks decent. I save the figures on my desktop, but you can save them elsewhere by changing the path. You can add axes-labels/title by adding +labs(x="yourxlabel",y="yourylabel",title="yourtitle").

ggsave("~/Desktop/figure1.png",width=8,height=6)

Output the exact same bar chart, but calculating counts yourself, and then plotting it. Save as figure2.

to_plot <- parliamentarians |>
  group_by(Gender) |>
  summarise(count = n())

ggplot(to_plot,aes(x=Gender,y=count)) +
  geom_bar(stat='identity')
ggsave("~/Desktop/figure2.png",width=8,height=6)

Create a variable called year_period that cuts years in 4 groups (i.e. those MPs that started before 1900, 1950, 2000 and after 2000).

parliamentarians <- parliamentarians |>
  mutate(year = year(date_started),
         year_period = cut(year, breaks = c(1860, 1900, 1950, 2000, 2025), dig.lab = 4))

Plot the number of male and female MPs per time period. Save as figure3 using ggsave().

ggplot(parliamentarians,aes(x=Gender)) +
  geom_bar() +
  facet_wrap(~year_period) + 
  scale_y_continuous(limits=c(0,1300))

Plot the number of male and female MPs per time period and party. Save as figure4 using ggsave().

ggplot(parliamentarians,aes(x=Gender,fill=party)) +
  geom_bar() +
  facet_grid(cols=vars(party),rows=vars(year_period)) +
  theme(legend.position = "none") +
  scale_fill_manual(values=c("darkblue","darkblue","red","orange","darkblue"))

Transform the party variable so that all the various “Conservative” parties are together mapped to “Conservative”. Then rerun the last figure. Save as figure5 using ggsave(). You will have to change the five color values to three.

parliamentarians <- parliamentarians |>
  mutate(party = recode(party, 
                        `Conservative Party of Canada` = "Conservative", 
                        `Progressive Conservative Party` = "Conservative"))

Lastly, try to replicate figure5 exactly by first grouping by party and year period and gender (a triple group_), then using summarise() to calculate counts by groups. Then replicate figure5 by adding stat='identity' inside geom_bar(). You don’t have to include this figure in the Word document, it’s just an additional challenge to see that you can do plots in different ways.

Optional: if you want to try more data analysis, mutate a new variable called year_in_office with mutate(year_in_office=year(date_ended)-year(date_started)). Filter parliamentarians before 2000 because those after 2000 might actually have a smaller number of years in office (e.g. all those who were elected in 2015 or 2019). Try to use geom_density() with various facets / colors. Try to use summarise() over year_in_office by different groups. Can you find something strange in the density plot? Why is it so “long” to the right? Why could it be like that?