The School Explores Latest Developments in Computer Vision

Researchers at the University of Virginia School of Data Science recently shared new work on computer vision and generative AI during the Computer and Autonomous Vision Systems symposium, an event highlighting advances in how machines understand and generate visual information. The symposium, organized by assistant professor of data science Lei Li, featured talks from UVA engineering assistant professors of computer science Yen-Ling Kuo and Zezhou Cheng, along with guest speaker Jia-Bin Huang, associate professor of computer science at the University of Maryland, whose work focuses on controllable generative models.

Computer vision, the effort to build systems that interpret visual data, has advanced rapidly in recent years. Researchers emphasized, however, that many systems still struggle to interpret the physical structure of the world or give users precise control over generated images and videos.

Teaching Machines to Understand the Physical World

Yen-Ling Kuo studies how visual systems can learn about the physical environment in ways that resemble human reasoning. Kuo's work, explored in her talk, "Learning Physically and Cognitively Grounded Robotic Assistants," examines how machines can infer properties like object interactions and spatial relationships from visual inputs.

Kuo's talk highlighted the difficulty of translating these spatial skills that may come naturally to humans to robots, but does emphasize the ways that robots can bridge this gap. She presents two ways this can be supported: integrating multiple modalities and employing Theory of Mind reasoning to ground human behaviors. "The system is designed to acquire a theory of mind: the ability to infer a partner’s intentions, beliefs, and needs so the agent can coordinate and provide appropriate assistance," Kuo said. 

Her research highlights the importance of grounding artificial intelligence in the physical world rather than relying only on pattern recognition. By incorporating knowledge about how objects behave and interact, computer vision systems could better support applications such as robotics, autonomous systems, and embodied AI.

Image
Yen Ling presenting at the School of Data Science.
Yen-Ling Kuo presenting at the School of Data Science.

Rethinking How Vision Models Learn

Zezhou Cheng presented research on self-supervised learning, a method that allows models to learn visual representations from large amounts of unlabeled data in his talk, "Rethinking Self-Supervised Learning in Computer Vision: Challenges, Tasks, and Methods." While such models have achieved strong results on common benchmarks, Cheng argued that these metrics do not always reflect real visual understanding.

“The classical definition of computer vision is to build a machine that can perceive the three-dimensional world,” Cheng said. He evaluated 22 self-supervised models across tasks like depth, geometry, and motion. The findings suggest that strong performance on popular image benchmarks does not necessarily translate to better understanding of 3D structure.

Cheng also introduced new tools for studying visual motion and geometry, including datasets of dynamic indoor scenes and methods for learning camera motion and object movement directly from video.

Image
Zezhou Cheng presenting at the School of Data Science.
Zezhou Cheng presenting research on self-supervised learning.

Making Generative AI More Controllable

Guest speaker Jia-Bin Huang discussed work aimed at giving users greater control over generative AI systems. While modern tools can generate realistic images from text prompts, they often behave like “black boxes,” making it difficult to produce exactly the scene a user intends.

His talk, "Controllable Visual Imagination," explored new interfaces and algorithms that allow more precise interaction with generative models. These include techniques for editing specific regions of an image, separating video scenes into layers for flexible manipulation, and controlling motion using three-dimensional trajectories. Huang concluded that the goal is to give users more agency over visual imagination. "To achieve this," he said, "we expose a variety of control handles that enable finer-grained, more intuitive control over generated and edited visual content." 

Image
Jia-Bin Huang presenting at the School of Data Science.
Jia-Bin Huang presenting at the Symposium on Computer and Autonomous Vision.

Together, the talks highlighted the next challenges in computer vision: building models that not only recognize visual patterns but also understand the physical world and respond directly to human intent. As research at UVA continues to explore these questions, the field is moving closer to systems that can interpret the visual world with greater accuracy and control.