1-2 p.m.; Controllable Visual Imagination with Jia-Bin Huang
Abstract: Generative models have empowered human creators to visualize their imaginations without artistic skills and labor. A prominent example is large-scale text-to-image/video generation models. However, these models are often difficult to control and do not respect 3D perspective geometry and the temporal consistency of videos. In this talk, I will showcase several of our recent efforts to improve controllability for visual imagination. Specifically, I will discuss how we enable semantic and spatial control for 2D image generation, facilitate layered decompositions for video editing, and synthesize object and camera motions from monocular videos.
Bio: Jia-Bin Huang is a Capital One-endowed Associate Professor of Computer Science at the University of Maryland, College Park. Before coming to UMD, Huang was a research scientist at Meta Reality Labs and an Assistant Professor of Electrical and Computer Engineering at Virginia Tech. Huang received his Ph.D. from the University of Illinois, Urbana-Champaign (UIUC) in 2016. His research interests include 3D computer vision, generative models, and computational photography. Huang is the recipient of the Thomas & Margaret Huang Award, NSF CRII award, faculty award from Samsung, Google, 3M, Qualcomm, Meta, and a Google Research Scholar Award.
2-3 p.m.; Rethinking Self-Supervised Learning in Computer Vision: Challenges, Tasks, and Methods with Zezhou Cheng
Abstract: Self-supervised learning has become a cornerstone of modern computer vision and multimodal foundation models, enabling scalable training on massive unlabeled data and driving recent advances in perception for autonomous agents. However, despite impressive performance on high-level semantic recognition, these models remain brittle in dynamic, open-world physical environments that demand robust spatial and temporal reasoning.
A key limitation is that spatiotemporal understanding is typically learned only indirectly from objectives targeting semantic recognition, rather than being treated as a foundational visual capability. In this talk, I will present our recent efforts to rethink self-supervised representation learning for vision including:
(1) a comprehensive benchmarking of self-supervised learning methods across diverse non-semantic, mid-level vision tasks
(2) a scalable framework that directly learn 3D geometry and scene dynamics from large-scale, in-the-wild video data
(3) a method for learning 3D point cloud representations without any human-created 3D shapes
Bio: Zezhou Cheng is an Assistant Professor of Computer Science at the University of Virginia. His research interests include computer vision, machine learning, and their applications to ecology, material discovery, VR/AR, and autonomous vehicles. He received awards for his work, such as the Best Synthesis Award in the Computer Science Department at UMass Amherst in 2020 and the Best Poster Award at the New England Computer Vision Workshop in 2019. He has also served on the program committee at major computer vision conferences and was recognized as an Outstanding Reviewer at CVPR 2021.
3-4 p.m.; Learning Physically and Cognitively Grounded Robotic Assistants with Yen-Ling Kuo
Abstract: For robotic assistants to be useful and helpful in our homes or on the road, they must do more than execute motor commands. They must autonomously plan and reason about tasks, recognize human needs, and offer proactive assistance across diverse scenarios. While these physical and social interactions seem to be effortless for humans, they remain very difficult for robots. Inspired by how humans understand and interact with the physical and social world, in this talk, I will explore how robots can bridge this gap. First, I will demonstrate how integrating multiple modalities (e.g., vision, tactile sensing, and force control) can enable efficient and robust manipulation of objects. Second, I will show how robots can employ Theory of Mind reasoning to ground human behaviors, enabling diverse and effective social interaction. I will conclude by discussing how these learned abilities can augment human capabilities to create more useful and helpful robotic assistants.
Bio: Yen-Ling Kuo is an Assistant Professor in Computer Science and a member of the Link Lab at the University of Virginia. Her research interests lies in the intersection of artificial intelligence and cognitive science with a focus on integrating them into robotic systems. Her work develops machine learning models that provide robots with generalizable reasoning skills to interact with humans including language understanding, social interactions, and common-sense reasoning.