The U.S. Census Bureau engaged a team of researchers to define essential metadata fields and automate citation generation, with the goal of improving the discoverability of its statistical products. 

Researchers Angela Albarano, Chi Do, Will Milch, and Becca Van Nostrand undertook a capstone project, as a part of their Master's in Data Science curriculum, that researched statistical product metadata and citation.

The team's objectives included the creation of data products inventory and data products metadata, the generation of citations for data products, the documentation of site issues, and the investigation of metadata use within other organizations. 

User feedback revealed that navigating Census websites is often challenging, with users frequently struggling to locate the information they need and to discover existing data. While the team encountered many of the same difficulties, they documented specific instances below to support actionable remediation.

An analysis was conducted to better understand site content and structure using both manual and automated methods. The manual assessment provided insights into the user experience and helped contextualize existing end-user feedback. 

In parallel, automated scripts were employed to scrape and crawl Census websites and the datasets API endpoint were used to collect information on available data products across Census and related domains. The resulting inventory consists of a CSV output listing the available data products and their corresponding paths. Additionally, random samples of Census PDF files were downloaded for ingestion and metadata extraction to aid in identifying key descriptive fields.

Researchers: Angela Albarano, Chi Do, Will Milch, Rebecca Van Nostrand

Sponsors: Cass Dorius, Emily Molfino

Advisors: Philip Waggoner

Completed in:
2025