M.S. in Data Science Students Showcase Predictive Modeling Projects
M.S. in Data Science students at the University of Virginia recently presented in the School's Capital One Hub as part of their class "Machine Learning I: Introduction to Predictive Modeling." Teams covered a variety of topics including music, medicine, sports, food insecurity, and environmental concerns for their semester-long final project. The course challenges students to demonstrate their ability to discover reputable data, build predictive models, and discuss their research in the company of their peers and professor.
Professor Prince Afriyie taught "Introduction to Predictive Modeling" this fall and said, "Watching the students present is genuinely the most rewarding part of the semester. I intentionally kept the rubric open so they could choose datasets that genuinely interested them, frame their own research questions, explore the data, and build models to generate meaningful insights."
He reflected on the wide range of topics and pointed out that project diversity "powerfully illustrates how pervasive the field of data science is and how adaptable our students are in applying core methods across domains." He went on to say how incredibly proud he was of this cohort. "Their curiosity, creativity, and professionalism made this a special semester, and I will genuinely miss teaching them in the spring."
Teams competed for awards in a student-run voting competition after all presentations concluded. The team who presented "Congressional Transparency & Legislative Outcomes Prediction" swept the stage, placing first for Best Overall Presentation, Best Data Visualization, Best Dashboard, Most Impactful Model, and Best Storytelling. The "Golden Ratio Project" Team won first place for Most Creative Topic & Dataset.
Project Descriptions
Bean There, Analyzed That!
Using data gathered by the Coffee Quality Institute, these students set out to build several statistical models in order to predict future coffee quality data. Through analysis of linear and logistic regression, K-Means, KNN, and MLP models, they gained in-depth insights into variables such as country of origin, coffee species, coffee sweetness, and altitude. Their findings include that both country of origin and species are robust predictors of overall coffee quality, that sweetness alone serves as a reliable indicator of coffee bean species, and that age and altitude have opposing influences on taste and quality for different species.
Student Team: Marissa Burton, Hayeon Chung, Maggie Crowner, Asmita Kadam, and Ashrita Kodali
Bias, Popularity, and Timing: A Data-Driven Study of Album Review Scores
Music criticism often feels subjective, but these students aimed to predict album review scores using data. Their project merges web-scraped reviews from Pitchfork with streaming metrics from Spotify to uncover what drives an album’s rating. By analyzing factors such as artist popularity and release timing, they identified trends and biases in the data to gain a better understanding of how critics rate music. Read more about their project.
Student Team: Rameez Ali, Sam Kunitz-Levy, Finn Sjue, Mauricio Torres, and Heywood Williams-Tracy
Congressional Transparency and Legislative Outcomes Prediction
VoteScope is an app designed to make Congressional behavior clearer, more accessible, and more accountable to the public. It provides clean, data-driven snapshots of how every member of Congress votes—revealing their positions across major policy areas, their ideological and party-loyalty patterns, and which lawmakers most closely align with a citizen’s own values. This app also includes a forecasting engine that predicts how legislators are likely to vote on hypothetical or future bills based on historical voting trends. Together, these features offer an unprecedented level of transparency into Congress and empower voters to better understand, compare, and anticipate legislative decision-making.
Student Team: Lino De Ros, Steve Ferenzi, Jackson Kennedy, Hudson Noyes, James Sweat, and Nathan Todd
A Data-Driven Analysis of SNAP Benefits and Policy Across U.S. States
Students analyzed U.S. SNAP (food stamp) data, inspired by how government shutdowns can severely impact millions of Americans who rely on these programs. Food assistance is often surrounded by assumptions about who benefits from it, and they wanted to challenge those assumptions using data. By examining benefits per person, participation rates, and state-level policy classifications across different demographic and socioeconomic factors, they found that SNAP benefits support a wide range of communities. Using regression, classification models, and unsupervised methods like K-means and PCA, they show that SNAP policy generosity behaves more like a continuous spectrum rather than clearly separated ‘Low,’ ‘Moderate,’ or ‘High’ categories. This helps explain why benefits and outcomes overlap across states and highlights that SNAP is a broadly impactful program shaped by many interconnected factors.
Student Team: Razan Habboub, Aeon Levy, Arnav Jai, Shawn Ding, and Grace George
Fuel Economy Analysis
Have cars actually gotten more fuel efficient? Do the physical attributes of carts actually affect their Gas mileage? What factors tend to identify which cars are most like each other? With data from the U.S Department of Energy on over 36,000 vehicles from 1984 to 2017, we can be begin to better answer these questions and know just what has been happening with our cars.
Student Team: Andrew Pavlak, Aiden Rocha, Joseph Hudson, Will Novak, Sammy Aridi, and Hongfei Zhu
Golden Ratio Project
While there is no definitive definition of beauty, humans have a natural ability to recognize reliable patterns and proportions, such as the golden ratio, that extend into art and aesthetics, evident everywhere. After gathering headshots of their fellow cohort members, these Data Science Master’s students used Google’s AI image analysis package, MediaPipe, to analyze faces. After creating five separate models, they were able to explore research questions related to how facial feature proportions, as well as various demographics, relate to perceived symmetry and alignment with the golden ratio.
Student Team: Stephanie Delgadillo, Sheyi Faparusi, Jillian Howe, Sophie Kim, and Bella Lu
Mental Health and Adverse Childhood Experiences: A Relational Investigation
The goal of this group was to investigate the relationship between adverse childhood events (ACE) and mental health. They used unsupervised models like Hierarchical Cluster Analysis to explore patterns and find clusters within the data, identifying 2 main groups, people who experienced high ACE, and people who did not. Using supervised models like logistic regression and KNN, they were able to predict whether or not a person would be diagnosed with depression based on ACE with 73-79% accuracy. To round out our investigation they used a multilinear regression model to asses how much ACE affects mental health when including other potential mental health predictors.
Student Team: Randa Ampah, Isabel Delgado, Jessica Oseghale, and Aniyah McWilliams
Predicting and Classifying Player Movement in the National Football League
In this project, this group used NFL spatial tracking data, sourced from Kaggle’s NFL Big Data Bowl 2026, in order to implement various machine learning models and methods. In predicting certain outcomes, like player position on the field, the expected play type, or the outcome of a pass play, they explored features like player speed, acceleration, orientation, and others in our attempt to outline the intricacies and dynamics that come with the sport of American football. Overall, the final analysis looks to help guide both the fundamental knowledge of the sport and machine learning models, and simultaneously understand the strengths and limitations that with trying to outline the complexity of a sport through just numbers. With their models, we can better inform decision-making and predict different outcomes in a given play.
Student Team: Emmett Hannam, Jarrett Markman, Weston Williams, and Jeffery Zhang
Predictive Modeling Approaches to U.S. Food Insecurity
Food insecurity, one of our nation's most rampant problems that impacts millions of Americans each and every day, is a complex issue that requires complex solutions. Through the use of several different modeling techniques, these students aim to help provide clarity on the subject and supply policy makers, non-profits, and other key stakeholders with valuable context on the issue. Though this research acts as just one of the countless steps required to solve food insecurity across the United States, they believe our contributions augment the discussion surrounding potential solutions that decision makers can use to steer our nation in a more effective direction.
Student Team: Muhammad Amjad, Reed Baumgardner, Thomas Blalock, Helen Corbat, Max Ellingsen, and John Twomey
Predictive Modeling Framework for MLB Free-Agent Contracts
For an MLB general manager, free-agent contract negotiations are critical to a team’s long-term success. Overpaying a player can strain a team’s finances for years, while underpaying may result in losing a potential franchise cornerstone. Accurate predictions of a player’s expected contract value would allow teams to make optimal offers without overextending financially. Our MLB contract prediction tool leverages advanced modeling techniques to forecast both the likely contract length and the average annual value (AAV) for free-agent players.
Student Team: William Brannock, Joseph Kaminetz, Faizan Khan, Garret Knapp, and Nathan Wan
Predicting Movie Ratings Using Engagement and Production Features
Description: This project analyzes a large collection of movie metadata sourced from The Movie Database API and audience rating records from the GroupLens MovieLens platform. These students set out to understand which characteristics of a film, such as production scale or audience engagement, are most predictive of its average viewer rating. Using models ranging from linear regression to neural networks, they found that an MLP best captures the nonlinear patterns that drive audience response. Their results highlight how engagement indicators dominate rating behavior and show the value of combining interpretable analytics with advanced predictive modeling.
Student Team: Mason Earp, Nathan Harris, Tianyin Mao, Sabine Segaloff, and Nicholas Thorton
Something in the Air: Wildfire Impact on Air Quality
Wildfires pose a clear and present danger to humans, wildlife, and flora alike. A notable impact is air quality degradation. Here, these students explore the myriad of ways that wildfires, weather, time, and location interact to affect air quality. Their target variable to quantify air quality, PM2.5 concentration (in µg/m3), describes the amount of fine particulate matter in the air. They faced prediction challenges in our multi-model approach, but ultimately were able to successfully capture trends and conditions, providing actionable insights for protecting environmental and public health. For more information on their project, including an interactive web app, click here.
Student Team: Caroline Kranefuss, Anna Li, Karina Mehta, Shaveen Saadee
Tundra Tree Stress Modeling
This group investigated how environmental factors such as humidity and soil volumetric water content interact influence tree stem amplitude changes in spruce’s at the Arctic Treeline. This can help reveal how Arctic trees respond to environmental stress, which can inform predictions about ecosystem resilience under climate change as the Arctic warms at more than four times the global average.
Student Team: Isaac Tabor, Lucas Rayder, Seth Spire, Michael Dunlap, and Cole Whittington
Utilizing Machine Learning to Understand Adolescent Psychiatric Development
Adolescence is a period of rapid neurodevelopment when psychiatric symptoms often emerge. Different forms of brain imaging data, such as resting-state fMRI functional connectivity (FC), sMRI structural connectivity (SC), and T1 morphometric features, capture complementary aspects of brain organization. Similarly, cognitive testing provides a higher level look into brain organization that cannot easily be identified by imaging alone. This project looks to find the best machine learning methods to incorporate all of this data, using standard machine learning practice as well as a new graph learning method.
Student Team: Ethan Meidinger



