Research Papers from Tom Hartvigsen of School of Data Science to be Featured at Prestigious Conferences
Six different papers in which Tom Hartvigsen and his colleagues are co-authors have been accepted by two of the top conferences in the machine learning and natural language processing fields.
Hartvigsen, who joined the University of Virginia’s School of Data Science in 2023, leads a research group focused on responsible AI in ever-changing environments. This includes learning from changing data, often time series, and developing ways to keep models up-to-date and compliant with people’s needs, especially to ensure equitable models.
Three of the papers were accepted by the Annual Conference on Neural Information Processing Systems, known as NeurIPS, which will be held in Vancouver in December. NeurIPS is widely considered the top publication venue for machine learning research.
Additionally, Hartvigsen and others will present three papers at the 2024 Conference on Empirical Methods in Natural Language Processing in Miami in November. Known as EMNLP, this conference is a leading publication venue for natural language processing.
Check out each of the papers co-authored by Hartvigsen to learn more about these research projects:
Accepted at NeurIPS 2024
- In a paper that examines the effectiveness of large language models in time series forecasting — titled “Are Language Models Actually Useful for Time Series Forecasting?” — Hartvigsen worked with Mingtian Tan of UVA and Mike Merrill, Vinayak Gupta, and Tim Althoff of the University of Washington, identifying a major oversight from popular, recent methods that propose integrating LLMs into time series forecasting methods. In removing their LLM components entirely, Hartvigsen’s team found that forecasting performance often improved, suggesting an expensive illusion of progress in recent forecasting research.
- A collaboration with Walter Gerych, Haoran Zhang, Kimia Hamidieh, Eileen Pan, Maanas Sharma, and Marzyeh Ghassemi of MIT, this work, titled “Test-Time Debiasing of Vision-Language Embeddings," (link to be added when available) develops a new method for correcting harmful societal biases in powerful Vision-Language Models, which drive the image-generation abilities of popular systems like ChatGPT. Without incurring expensive training costs, their method is the first to consider potential biases for individual inputs to these systems, lowering the chances they perpetuate harmful stereotypes.
- Teaming up with Shanghua Gao, Owen Queen, and Marinka Zitnik of Harvard, as well as Teddy Koker and Theodoros Tsiligkaridis of the MIT Lincoln Laboratory, this project — titled "UniTS: A Unified Multi-Task Time Series Model" — develops a new foundation model for time series data. By training one, unified model to achieve multiple tasks simultaneously, the team achieved state-of-the-art performance on a wide range of tasks, outperforming 66 models on 38 datasets. The model can also be prompted and learns to solve new tasks using very little data.
Accepted at EMNLP 2024
- Working with Bryan Christ, a Ph.D. student at the School of Data Science, and Jonathan Kropko, a UVA Quantitative Foundation Associate Professor of Data Science, on a paper titled “MATHWELL: Generating Educational Math Word Problems Using Teacher Annotations,” this paper demonstrates how large language models can generate educationally appropriate word problems for K-8 mathematics classrooms.
- In a second collaboration with Merrill, Gupta, Althoff, and Tan, this paper — titled “Language Models Still Struggle to Zero-shot Reason about Time Series” — develops, for the first time, an evaluation framework for time series reasoning. With a large, human-validated benchmark, this work sets the stage to develop AI systems that can reason about the world through the lens of time series data, a pervasive data modality.
- Teaming up with a large group of researchers, including co-first authors Jack Gallifant of MIT and Shan Chen of Harvard, Mass General Brigham, and Boston Children’s Hospital, this study, titled “Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks," evaluates 23 recent and popular LLMs' resilience to swapping brand and generic drug names in simple medical question-answering tasks, which should not change performance. Surprisingly, these models were fragile, with their performance dropping from 1%-10%. This suggests a key weakness of these models and their applications to health care, where knowledge of drug names can be critical.