Data

Featured Image

Data Description

Dataset #1:

Our first dataset contains a record of all 311 service requests made in the city of Boston in 2021, with 273,951 observations across 20 variables. These variables include time, location, closure status, case title, type, and queue which the case assigned to, classifying a case into different categories. We retrieved our data from Analyze Boston, an open data hub with data about Boston. Analyze Boston retrieved their data from the City of Boston.

Source: https://data.boston.gov/dataset/311-service-requests

Dataset #2:

Our second dataset comes from the 2020 US census. We retrieved the data using the tidycensus package in R, but the data can also be found at this link: https://data.boston.gov/dataset/2020-census-for-boston

This dataset includes ten variables including GEOID, geometry, 5 variables for race (White, Black, Native, Asian, and Hispanic), a variable “total” which is the sum of the population from each of the 5 racial categories, and a variable for income. The observations are made at the sub-district level (i.e. each row corresponds to a sub-district of Boston).

Description of Variables:

Dataset #1:

In our first dataset, the variables of interest are as follows: * open_dt, target_dt, closed_dt: the date and time at which each case was opened, expected to be closed, and closed respectively.

  • duration: a variable we created by calculating the difference between closed_dt and open_dt

  • ontime: a variable indicating whether or not the case was completed on time

  • case_status: indicates whether the case is open or closed

  • subject: the subject of a specific 311 case (e.g. Public Works, Parks and Recreation Department, etc). There are 11 unique subjects across the dataset

  • department: similar to subject but specifies the department responsible for each case. This variable has 16 unique values.

  • neighborhood: specifies the neighborhood in which the case request was made.

  • latitude and longitude: data on the latitude and longitude of each case request.

Dataset #2:

For our second dataset, we are interested in the following variables:

  • GEOID and geometry: these variables give us information about the geographies of the sub-districts of Boston. The geometry variable allows us to create maps of our data.

  • Race: these 5 variables (White, Black, Hispanic, Asian, and Native American) specify the number of people of a certain specified race living in a given sub-district of Boston.

  • total: this variable states the total population of each sub-district of Boston (for all racial groups)

  • Income: this variable measures the mean income for each of the sub-districts of Boston

Data Cleaning

Our load_and_clean_data file can be found here: cleaning script

Dataset #1

Our first dataset required limited cleaning. The variables “open_dt,” “target_dt,” and “closed_dt” were initially date-time variables, so we converted them to dates for simplicity. We then used the difftime function to calculate the duration of each case from the “open_dt” and “closed_dt” variables. Aside from this, we also removed several variables from the dataset which we deemed unnecessary for our analysis. We also made some minor modifications to the values in the “neighborhood” variable, since there were initially separate categories for “Allston,” “Brighton,” and “Allston / Brighton.”

Dataset #2

Our second dataset did not require any initial cleaning since we imported it directly from the “tidycensus” package. However, because of the nature of spatial data frames and the types of analysis we were interested in doing, the process of reformatting the data in order to merge with our original dataset was somewhat involved. In order to make our bos_census dataset, we imported each variable from the “tidycensus” package individually, then stitched these data frames together using st_join(). We then used st_join() to combine our bos_census dataset with our clean_311 dataset by matching the coordinates from the 311 dataset with their corresponding geometries in our census dataset. This allows us to analyze the distribution of 311 cases throughout the sub-districts of Boston while taking into account the racial demographics and income levels in each of those sub-districts. However, we were interested in creating a map showing the relative populations of each sub-district of Boston when grouped by race. In order to do this, we needed to use the pivot_longer function to reformat our data. However, pivot_longer is not designed for spatial data frames, so to get around this error, we created a second dataset, bos_race_long, conveying the same racial information included in bos_census but reformatted into 2 couumns instead of 5. In this dataset, we have two variables containing this same information, “variable” and “estimate.” “Variable” specifies the race for which “estimate” is measuring the population in a given subdistrict specified by “GEOID.” The format of this dataset allows us to group by “variable” and create our map.

Header image source: https://healthitanalytics.com/news/data-integration-analytics-support-public-health-in-rhode-island

Homepage image source: https://medium.com/@MarkCoolski/what-is-digital-transformation-it-isnt-all-about-1s-and-0s-5da04383219e

Previous Big Picture