ICDM 2022 Workshop on Foundation Models in Vision and Language

Overview

State-of-the-art AI systems can learn directly from whatever information they perceive, without relying on heavily labeled data sets for guidance. Such easy-to-collect data provide a more flexible form of supervision and a more affordable solution to data scalabilility. By training big deep neural networks with a large number of parameters on such heterogeneous data, recent foundation models have shown great promises in generality and usability. For example:

Language: BERT, GPT family
Vision: SimCLR, MAE
Vision-and-Language: CLIP, ALIGN and DALLE

One appealing property of these foundation models is their ashtonishing performance on zero-shot and few-shot adaptation to various new real-world tasks. We organize this "Foundation Models in Vision and Language (FOMO-VL)" workshop, aiming to gather academic and industry communities to work on foundation models to solve real-world problems, focusing on the challenge of building scalable AI models can learn from heterogeneous data to gain generalized task-level transfer ability. This year, our FOMO-VL workshop will be held (tentatively in a hybrid mode) in conjunction with ICDM 2022, Orlando, FL, USA.

Important Dates

Workshop paper submission deadline: October 10, 2022

Workshop paper acceptance decision to authors: October 13, 2022

Workshop dates: November 28, 2022

How to Submit

Please submit your papers to the Online Submision System . Please refer to Call for Papers for details on the topics and other related information. Thanks for the support of Amazon, some cash awards will be made to best papers. We look forward to your excellent work!

Invited Speakers

Danqi Chen
Princeton University

[Bio & Talk Info]

Danqi Chen is an Assistant Professor of Computer Science at Princeton University and co-leads the Princeton NLP Group. Her recent research focuses on training, adapting, and understanding large language models, and developing scalable and generalizable NLP systems for question answering, information extraction, and conversational agents. Before joining Princeton, Danqi worked as a visiting scientist at Facebook AI Research. She received her Ph.D. from Stanford University (2018) and B.E. from Tsinghua University (2012), both in Computer Science. Danqi is a recipient of a Sloan Fellowship, a Samsung AI Researcher of the Year award, outstanding paper awards from ACL 2016, EMNLP 2017, and ACL 2022, and multiple research awards from industry.

Talk Title: Building Language Models Based on Retrieval

Abstract: Large language models (LLMs) have utterly transformed the field of natural language processing. However, training LLMs comes at a massive financial and environmental cost, making them out of reach of academic research labs. Meanwhile, these models are costly to update and brittle in leaking private text data. In this talk, I will argue that retrieval-based language models are a promising way of scaling LMs and overcoming the above limitations. I will discuss recent developments of retrieval-based language models, compare their pros and cons, and show their benefits in interpretability, adaptability, and privacy. In particular, I will introduce a new training approach for retrieval-based language models called TRIME (TRaining with In-batch MEmories), which can train LMs to retrieve better from the text during inference.

Xifeng Yan
UC at Santa Barbara

[Bio & Talk Info] [Slides]

Xifeng Yan is a professor at the University of California at Santa Barbara. He holds the Venkatesh Narayanamurti Chair of Computer Science. He received his Ph.D. degree in Computer Science from the University of Illinois at Urbana-Champaign in 2006. He was a research staff member at the IBM T. J. Watson Research Center between 2006 and 2008. His work is centered on knowledge discovery, knowledge bases, and conversational AI. His contribution can be found in data mining, database systems, natural language processing, and their applications in interdisciplinary areas. His works were extensively referenced, with over 23,000 citations per Google Scholar and thousands of software downloads. He received NSF CAREER Award, IBM Invention Achievement Award, ACM-SIGMOD Dissertation Runner-Up Award, IEEE ICDM 10-year Highest Impact Paper Award, 2022 PLDI Distinguished Paper Award, and 2022 VLDB Test of Time Award.

Talk Title: Limitations of Language Models in Arithmetic Induction

Abstract: Recent work has shown that large pretrained Language Models (LMs) can not only perform remarkably well on a range of Natural Language Processing (NLP) tasks but also start improving on reasoning tasks. Surprisingly, we find that these models have limitations on certain basic symbolic manipulation tasks such as copy, reverse, and addition. When the total number of symbols or repeating symbols increases, the model performance drops quickly. We investigate the potential causes behind this phenomenon and examine a set of possible methods, including explicit positional markers, fine-grained computation steps, and LMs with callable programs. Experimental results show that none of these techniques can solve the simplest addition induction problem completely. In the end, we introduce LMs with tutor, which is demonstrated with every single step of teaching. By limiting the type of operations it can conduct, LMs with tutor is able to deliver 100% accuracy in situations of OOD for simple tasks, shedding new insights on the boundary of large LMs in induction.

Tengyu Ma
Standford University

[Bio & Talk Info] [Slides]

Tengyu Ma is an assistant professor of Computer Science and Statistics at Stanford University. He received his Ph.D. from Princeton University and B.E. from Tsinghua University. His research interests include topics in machine learning and algorithms, such as deep learning and its theory, non-convex optimization, deep reinforcement learning, representation learning, and high-dimensional statistics. He is a recipient of the ACM Doctoral Dissertation Award Honorable Mention, the Sloan Fellowship, and NSF CAREER Award.

Talk Title: Toward understanding foundation models

Abstract: AI is undergoing a paradigm shift with the rise of models pre-trained with self-supervisions and then adapted to a wide range of downstream tasks. However, their working largely remains a mystery; classical learning theory cannot explain why pre-training on an unsupervised task can help many different downstream tasks. This talk will first investigate the role of pre-training losses in extracting meaningful structural information from unlabeled data, especially in the infinite data regime. Concretely, I will show that the contrastive loss can give rise to embeddings whose Euclidean distance captures the manifold distance between raw data (or, more generally, the graph distance of a so-called positive-pair graph). Then, I will discuss two other elements that seem necessary for a sharp explanation of the behavior of practical pre-trained models: inductive bias of architectures and implicit bias of optimizers. I will introduce a recent project that demonstrates the implicit bias of optimizers in pre-training, even with infinite pre-training data, empirically and theoretically.

Letitia Parcalabescu
University of Heidelberg

[Bio & Talk Info] [Slides]

Letitia Parcalabescu After having studied Physics and Computer Science, Letitia is a PhD candidate at Heidelberg University in the Heidelberg Natural Language Processing Group. Her research focuses on vision and language integration in multimodal machine learning. Her side-project revolves around the "AI Coffee Break with Letitia" YouTube channel, where the animated Ms. Coffee Bean explains and visualizes concepts from the latest natural language processing, computer vision and multimodal research.

Talk Title: VALSE: Phenomenon-centered testing of Vision and Language models

Abstract: In this talk, I will introduce Vision and Language (VL) models by (a) explaining what pretrained VL models are, (b) giving a short overview over the architecture of VL models, and (c) stating their contributions towards vision and language integration with neural networks. I will present our recent work on the new VALSE benchmark to assess the degree of success of VL integration. VL models are usually evaluated on tasks such as image-sentence alignment or visual question answering. While performance on these tasks is important, task-centered evaluation does not disentangle the fine-grained linguistic capabilities of VL models. By contrast, VALSE comprises a suite of six specific linguistic phenomena grounded in the visual modality. Our zero-shot experiments for five widely-used pretrained V&L models on VALSE---CLIP, LXMERT, VisualBERT, ViLBERT and ViLBERT 12-in-1--- suggest that current VL models have considerable difficulty addressing most phenomena.

Jason Baldridge
Google Brain

[Bio & Talk Info] [Slides]

Jason Baldridge is a research scientist at Google, where he works on grounded language understanding. He was previously an Associate Professor of Computational Linguistics at the University of Texas at Austin. His most recent work has focused on vision-and-language navigation, text-and-image representation learning and generating images from language descriptions. Jason received his Ph.D. from the University of Edinburgh in 2002, where his doctoral dissertation on Multimodal Combinatory Categorial Grammar was awarded the 2003 Beth Dissertation Prize from the European Association for Logic, Language and Information.

Talk Title: What's Missing in Text-to-Image Generation? Current Models and Paths Forward

Abstract: Text-to-image models have made quite remarkable and surprising progress in just the past year. I'll provide some broader context for research in this area, focusing especially on WordsEye (from 2000) and Google's recent Parti model. I'll highlight important failure modes of modern models in accurately depicting detailed textual descriptions and argue we need to be more precise in defining and developing description-to-depiction tasks as we consider data and modeling strategies that will allow the next generation of models to better represent descriptions and reflect them visually.

Lu Yuan
Microsoft Cloud and AI

[Bio & Talk Info] [Slides]

Lu Yuan is a Principal Research Manager in the Cognitive Services Research at Microsoft Cloud and AI, Bellevue, WA. He got his MS degree from the Department of Electronic Engineering at Tsinghua University in 2005 and his Ph.D degree from Department of Computer Science & Engineering at the Hong Kong University of Science and Technology (HKUST) in 2009. Then, he joined Visual Computing Group at Microsoft Research Asia in August, 2009. His current research interests include: computer vision, large-scale pretraining, human understanding and computational photography.

Talk Title: Florence: A New Computer Vision Foundation Model

Abstract: In this talk, I will introduce Florence, a new computer vision foundation model, which is trained on noisy web-scale dataset and demonstrates outstanding performance in a wide range of diverse downstream tasks and many types of transfer learning. It is critical for this mission to solve real-world computer vision applications. To develop the foundation model, we incorporate the latest hierarchical transformer architecture as encoders, and leverage unified contrastive learning to learn a good image-text embedding which empowers generic image understanding capability. The purpose of building foundation model is not solely improving the performance in close-world recognition tasks. The most important thing is that it is pushing the perception toward the open world and cognition. I will showcase how Florence enables open vocabulary recognition, self-evolving learning, and reasoning as humans do. Florence is designed to pave the way for building vision foundation models to power millions of real-world vision tasks and applications. In addition, the preliminary progress may motivate more research to build any breakthroughs for integrative AI.

Jiasen Lu
Allen Institute of Artificial Intelligence

[Bio & Talk Info] [Slides]

Jiasen Lu is a research scientist in Prior at Allen Institute for AI, Seattle. His research interests span in computer vision, vision & language. Prior to joining AI2 at Feb, 2020, he earned his Ph.D. in Computer Science from School of Interactive Computing at Georgia Tech with thesis “Visually Grounded Language understanding and Generation”.

Talk Title: Unified-IO: A Unified Model for Vision, Language and Multi-Modal Tasks

Abstract: In this talk, I will talk about Unified-IO, which is the first neural model to perform a large and diverse set of AI tasks spanning classical computer vision, image synthesis, vision-and-language, and natural language processing (NLP). Unified-IO achieves this broad unification by homogenizing every task's input and output into a sequence of tokens drawn from a discrete and finite vocabulary.

Justin Lin
DAMO Academy, Alibaba Group

[Bio & Talk Info] [Slides]

Junyang Lin is a staff engineer at DAMO Academy, Alibaba Group. He received his master degree at Peking University in 2019. His research interests include multimodal representation learning, natural language processing, and computer vision, with a focus on large-scale pretraining. He has been in. charge of developing large-scale unified multimodal pretrained models, including M6, OFA, etc., and also the real-world applications of foundation models.

Talk Title: Towards a One-For-All System for Multimodal Multitask Learning

Abstract: In recent years, mirroring large language models, multimodal pretraining that incorporates multimodal representation learning and large-scale pretraining has been developing rapidly and creating a series of new state-of-the-art performances in vision-language downstream tasks and even vision tasks. A trend this year is the unified models that enable multimodal multitask learning. They achieve both the unification of tasks as well as modalities and outstanding performance. This talk first reviews the developments of multimodal pretraining, and then provides an introduction to the One-For-All (OFA) model for vision-language pretraining and downstream transfer, and also a recently introduced framework OFASys that facilitates multimodal multitask learning with a simple interface. We hope this can promote the research in building generalist models.

Advisory Committee -- Panelists

Jianfeng Gao
Microsoft Research, Redmond

Trishul Chilimbi
Amazon

Ruslan Salakhutdinov
Carnegie Mellon University

Christoph Schuhmann
LAION

Ludwig Schmidt
University of Washington, OpenCLIP

Mohammad Norouzi
Google Brain

Accepted Papers

Convolutional K-Nearest Neighbors: An End-to-End Hybrid Approach Carlos Guzman, Thomas Florian, Dadetyn Wood, Jacob Peterson, and Jacob Furst
A New Design of VQA System based on Weighted Contextual Features Yu Wang, Vijay Srinivasan, and Hongxia Jin
What Do All Audio Transformer Models Hear? Probing Acoustic Representations For Language Delivery And Structure Yaman Kumar, Jui Shah, Rajiv Ratn Shah, and Changyou Chen
Join-Chain Network: A Logical Reasoning View of Multi-head Attention in Transformer Jianyi Zhang, Yiran Chen, and Jianshu Chen
A Multi-level Alignment Training Scheme for Video-and-Language Grounding Yubo Zhang, Feiyang Niu, Qing Ping, and Govind Thattai
STT: Soft Template Tuning for Few-Shot Adaptation Ping Yu, Wei Wang, Chunyuan Li, Ruiyi Zhang, Zhanpeng Jin, and Changyou Chen
Zero-shot Object Detection Through Vision-Language Embedding Alignment Johnathan Xie and Shuai Zheng

Overview

Important Dates

Workshop paper submission deadline: October 10, 2022

Workshop paper acceptance decision to authors: October 13, 2022

Workshop dates: November 28, 2022

How to Submit