Vision through Language: Towards Open-world Recognition
The Department of Information Science and Media Studies and the I2S research group welcomes you to this seminar with Paolo Rota of University of Trento.
Main content
The rapid evolution of vision-language models is transforming the landscape of image and video understanding, going beyond traditional classification and localization paradigms. We will explore two recent methodologies that challenge the conventional reliance on predefined vocabularies and training data. The first part of the talk introduces the concept of Vocabulary-Free Image Classification (VIC), a novel approach that assigns classes to images without the constraints of a fixed vocabulary. We will delve into the challenges of operating within an unconstrained semantic space containing millions of concepts and present Category Search from External Databases (CaSED), a training-free method that leverages external vision-language databases for efficient and accurate classification. In the second part, we will shift focus to Test-Time Zero-Shot Temporal Action Localization (ZS-TAL), which tackles the problem of identifying and locating unseen actions in untrimmed videos without the need for annotated training data. We will introduce the Test-Time adaptation for Temporal Action Localization (T3AL) approach, which adapts a pre-trained Vision and Language Model (VLM) to perform action localization in a self-supervised manner, significantly improving generalization across diverse video domains. Finally we will show how LLMs can be used as a sort of orchestrator to solve research problems autonomously, through visual programming.
Paolo is an assistant professor at the Center for Mind and Brain (CIMeC) at the University of Trento. He received his Ph.D. from the same university and has worked as a postdoctoral Marie Curie fellow at TU Wien and as a postdoc at the Istituto Italiano di Tecnologia in Genoa. He also worked as an ML researcher at the ProM Facility in Rovereto. He has been an assistant professor at the University of Trento since 2019 and started his tenure track in 2022. His research interests are focused on image and video classification using Vision and Language.