Developing a data pipeline to improve accessibility and utilization of Charlottesville’s Open Data Portal

To improve democratic engagement between the public and the government, the city of Charlottesville created an online portal containing data from city departments. This move was an effort to promote access to data pertinent to policy debates in the city, and to incentivize the public to contribute to the policy-making process with informed participation.

Unfortunately, such efforts, while successful at their start, have gradually stagnated, and the end objective of the portal has not been reached.

SIEDS posterMaster of Science in Data Science researchers Lucas Beane and Elena Gillis, advised by Data Science Institute professor Rafael Alvarado and School of Engineering professor Caitlin Wylie, investigated possible reasons for this stagnation – including inconsistent formatting of the datasets, variables that are not meant for human legibility, and limited data with disproportional representation from the city departments.

The pair proposed a data pipeline that would serve as a tool to extract utility from the data. The pipeline does so by converting the datasets into a consistent format, merges the datasets, and allows for creation of simple visualizations. The pipeline acts as a link between the raw data published by the government units and the city by increasing the data’s interpretability and legibility and outputting results that are easily relatable to the policy issues at hand.

An analysis of datasets for crime and real estate, and relating findings to the affordable housing debate, demonstrate the advantages of a data pipeline.

The team encountered myriad problems that could potentially explain the dearth of users of the portal. Chief among them is the lack of interpretability of the data—though the datasets have descriptions, they lack extensive data dictionaries, so many of the features therein are unusable. This is likely due to the data collection policy, which has few guidelines to allow for consistency across datasets. Additionally, Beane and Gillis found a disproportionate number of representative datasets for some departments, perhaps because they were willing and able to offer up more data. If a user isn’t interested in the overrepresented departments, this trend could disenfranchise them simply because the available data do not form a reliable picture of Charlottesville. There also still exist simple problems of accessibility; at times, some datasets on the portal are not available. On top of that, several datasets are simply too large for the average home computer to utilize.

Addressing these problems would be an important first step toward achieving a wider user base for Charlottesville’s open data portal.

Completed in:
2019
Project type:
Advisor:

Caitlin Wylie