Dataset Proposal

2022-03-04

Dataset 1

https://data.boston.gov/dataset/311-service-requests

The dataset contains 273951 observations across 20 variables, recording all cases of Boston 311 calls requested for non-emergency city services in 2021.The variables include time, location, closure status, case title, type, and queue which the case assigned to, classifying a case into different categories. The dataset is already clean, but we can do further cleaning on some columns to facilitate the analysis. We are hoping to use this dataset to examine the distribution of reasons for requests for non-emergency city services. We are interested in seeing which types of requests are most common, and if there are any types of services that are more or less likely to be fulfilled. We will also use the zip code variable to determine if there is any association between location and type/frequency of request(s). This dataset also includes variables on the date and time which each case was opened and closed, so we can examine whether or not there is any relationship between the type and duration of a case. The analysis on this dataset, however, may require expertise on city services, and we are lacking knowledge in this field. We may also run into issues with any missing values.

Dataset 2

https://carnegieclassifications.iu.edu/downloads.php

This dataset contains information about all US colleges and universities for the year 2021. The dataset includes 101 different variables and nearly 4000 observations. Most of the data comes from the CCIHE (Carnegie Classification of Institutions of Higher Education) or IPEDS (Integrated Postsecondary Education Data System), with a few of the variables coming from different sources. The data is already for the most part cleaned, although we may find that some cleaning is necessary once we begin working with the dataset. This dataset has just over 100 variables, so there are lots of interesting trends to look into. We might want to compare data on admissions/selectivity across different types of colleges (e.g. public versus private, minority serving vs non-minority serving institutions, city vs rural/suburban, etc). We could also look at how funding for research and development differs across different types of institutions. We are interested in looking at the distribution of resources available at different types of institutions, and determining whether or not this distribution seems equitable. There are some variables with a lot of missing values, so we may have some difficulties if we are interested in examining these variables. This dataset also only contains no categorical data (aside from college name and city), so this might make some comparisons more difficult.

Dataset 3

https://www.kaggle.com/jpmiller/police-violence-in-the-us?select=police_killings_MPV.csv

This dataset comes from the larger directory entitled Police Violence and Racial Equity, and the source for the dataset we are interested in is called Mapping Police Violence. This specific dataset contains information on police killings. The variables included in the dataset are the victim’s name, gender, age, and race, as well as information on the date, city, and state of the incident. There are additional variables represented in the dataset which contextualize the incidents (such as cause of death, agency held responsible, criminal charges, etc). This dataset has 29 variables in total, and roughly 8000 rows. The data is already mostly cleaned, although after loading the dataset we will undoubtedly need to do some cleaning. We are interested in looking at the overall demographics of victims of police shootings, as well as identifying any trends regarding the context in which these incidents happen. We will look in depth at data regarding the victim’s gender and race, as well as data on the date and location of the incident. There is a lot of missing data, especially for some of the demographics variables, which will make some of our analysis difficult. This is also an extremely sensitive subject, so it may be difficult to conduct a purely objective analysis of the data.