The First International Workshop on
Retrieval-Enhanced Machine Learning
(REML @ SIGIR 2023)
July 27th, 2023, Taipei, Taiwan. Co-located with SIGIR 2023.
Tell Me More

About the Workshop

Most machine learning models are designed to be self-contained and encode both “knowledge” and “reasoning” in their parameters. However, such models cannot perform effectively for tasks that require knowledge grounding and tasks that deal with non-stationary data, such as news and social media. Besides, these models often require huge number of parameters to encode all the required knowledge. These issues can be addressed via augmentation with a retrieval model. This category of machine learning models, which is called Retrieval-enhanced machine learning (REML), has recently attracted considerable attention in multiple research communities. For instance, REML models have been studied in the context of open-domain question answering, fact verification, and dialogue systems and also in the context of generalization through memorization in language models and memory networks. We believe that the information retrieval community can significantly contribute to this growing research area by designing, implementing, analyzing, and evaluating various aspects of retrieval models with applications to REML tasks. The goal of this full-day hybrid workshop is to bring together researchers from industry and academia to discuss various aspects of retrieval-enhanced machine learning, including effectiveness, efficiency, and robustness of these models in addition to their impact on real-world applications.

Motivation:

The vast majority of machine learning (ML) systems are designed to be self-contained, with both knowledge and reasoning encoded in model parameters. They suffer from a number of major shortcomings that can be (fully or partially) addressed, if machine learning models have access to efficient and effective retrieval models:

  • Knowledge grounding: A number of important real-world problems, often called knowledge-intensive tasks, require access to external knowledge. They include (among others) open-domain question answering, task-oriented dialogues, and fact checking. Therefore, ML systems that make predictions solely based on the data observed during training fall short when dealing with knowledge-intensive tasks. In addition, ML models related to non-stationary domains, such as news or social media, can significantly benefit from accessing fresh data. An information retrieval (IR) system can decouple reasoning from knowledge, allowing it to be maintained and updated independent of model parameters at a cadence aligned with the corpus.
  • Generalization: Recent work has shown that many ML models can significantly benefit from retrieval augmentation. For example, enhancing generative ML models, such as language models and dialogue systems, using retrieval will have a large impact on their generalization.
  • Significant growth in model parameters: Since all the required information for making predictions is often encoded in the ML models' parameters, increasing their capacity by increasing the number of parameters generally leads to higher accuracy. For example, the number of parameters used in LMs has increased from 94 million in ELMo to 1.6 trillion in Switch Transformers, an over 16x increase in just three years (2018 -- 2021). Despite these successes, improving performance by increasing the number of model parameters can incur significant cost and limit access to a handful of organizations that have the resources to train them. As such, this approach is neither scalable nor sustainable in the long run, and providing access to a scalable large collection (or memory) can potentially mitigate this issue.
  • Interpretability and explainability: Because the knowledge in training data is encoded in learned model parameters, explanations of model predictions often appeal to abstract and difficult-to-interpret distributed representations. By grounding inference on retrieved information, predictions can more easily be traced to specific data, often stored in a human-readable format such as text.

Recent research has demonstrated that errors in many REML models are mostly due to the failure of retrieval model as opposed to the augmented machine learning model, confirming the well-known "garbage in, garbage out" phenomenon. We believe that the expertise of the IR research community is pivotal for further progress in REML models. Therefore, we propose to organize a workshop with a fresh perspective on retrieval-enhanced machine learning through an information retrieval lens. For more information, refer to the following perspective paper:

H. Zamani, F. Diaz, M. Dehghani, D. Metzler, M. Bendersky. "Retrieval-Enhanced Machine Learning". In Proc. of SIGIR 2022.

Theme and Scope:

The workshop will focus on models, techniques, data collections, and evaluation methodologies for various retrieval-enhanced machine learning problems. These include but are not limited to:

  • Effectiveness and/or efficiency of retrieval models for knowledge grounding, e.g., for open-domain question answering, dialogue systems, fact verification, and information extraction.
  • Effectiveness and/or efficiency of retrieval models for generalization through memorization, e.g., nearest neighbor language models.
  • Effectiveness and/or efficiency of retrieval models for memory networks.
  • Effectiveness and/or efficiency of retrieval models for retrieval-augmented representation learning.
  • Retrieval-enhanced optimization.
  • Retrieval-enhanced domain adaptation.
  • Retrieval-enhanced models for multi-media and multi-modal learning.
  • Query generation for retrieval-enhanced models.
  • Retrieval result utilization by machine learning models.
  • Interactive retrieval-enhanced machine learning models.
  • Retrieval-enhanced models for non-stationary data, such as news, social media, etc.

Call for Papers

Submissions must be in English, in PDF format, and in the current ACM two-column conference format. Suitable LaTeX, Word, and Overleaf templates are available from the ACM Website (use “sigconf” proceedings template for LaTeX and the Interim Template for Word). Submissions may consist of up to 6 pages of content, plus unrestricted space for appendices and references. The papers can represent reports of original research or preliminary research results. The review process is double-blind. Authors are required to take all reasonable steps to preserve the anonymity of their submission. The submission must not include author information and must not include citations or discussion of related work that would make the authorship apparent. However, it is acceptable to refer to companies or organizations that provided datasets, hosted experiments or deployed solutions if there is no implication that the authors are currently affiliated with these organizations. Papers will be evaluated according to their significance, originality, technical content, style, clarity, relevance to the workshop, and likelihood of generating discussion. Authors should note that changes to the author list after the submission deadline are not allowed. At least one author of each accepted paper is required to register for, attend , and present the work at the workshop (either in person or remotely). All papers are to be submitted via EasyChair.

Papers presented at the workshop will be required to be uploaded to arXiv.org after the acceptance notification, but will be considered non-archival, and may be submitted elsewhere (modified or not), although the workshop site will maintain a link to the arXiv versions. Therefore, articles being under consideration by other venues as long as they are within the submission policy of the other venues can be submitted to REML. This makes the workshop a forum for the presentation and discussion of current work, without preventing the work from being published elsewhere.

Important Dates:

  • Submission deadline: April 25, 2023 May 2, 2023
  • Paper notifications: May 23, 2023 June 14, 2023
  • Camera-ready deadline: June 23, 2023 June 30, 2023
  • Workshop Day: July 27, 2023

Program

9:00 - 9:30 Opening and Overview of REML Research
9:30 - 10:30 Keynote by Mohit Iyyer [Link]
10:30 - 11:00 Coffee break
11:00 - 12:30 Joint Poster Session with the Gen-IR and ReNeuIR Workshops
12:30 - 13:30 Lunch
13:30 - 15:00 Invited Talks
      Invited Talk by Akari Asai (University of Washington)
      Invited Talk by Omar Khattab (Stanford University)
      Invited Talk by Laura Dietz (University of New Hampshire)
      Invited Talk by Anirudh Goyal (University of Montreal)
15:00 - 15:30 Coffee break
15:30 - 17:00 Panel Discussion and Closing
      Panelists: Jiafeng Guo, Junxian He, Yoav Levine, and Hamed Zamani.
      Moderator: Fernando Diaz.

Keynote

How can retrieval aid long-form text generation?

Mohit Iyyer

Large language models (LLMs) encode huge amounts of knowledge about the world into their parameters. However, this parametric knowledge is often insufficient to generate accurate and specific information about arbitrary user-selected topics. The emerging area of retrieval-augmented language models thus aims to give LLMs access to external data, usually in the form of a vector database. There are many ways to integrate retrieved data into an LLM for text generation, from late-fusion inference-only methods such as kNN-LM to end-to-end trainable systems such as RETRO. How well do these methods improve long-form text generation, in which a model must generate paragraph-length responses to user queries? In this talk, I first provide an overview of the modeling and evaluation challenges associated with retrieval-augmented long-form question answering. Next, I show that methods such as the kNN-LM may not actually improve the generation quality of language models, again highlighting the challenge of evaluation for future research. Finally, I switch gears and discuss another usage of retrieval entirely: detecting whether a piece of long-form text has been generated by an LLM.

Mohit Iyyer is an associate professor in computer science at the University of Massachusetts Amherst. His research focuses broadly on designing machine learning models for long-form language generation (e.g., for story generation and machine translation), and his group also works on tasks involving creative language understanding (e.g., modeling fictional narratives and characters). He is the recipient of best paper awards at NAACL (2016, 2018), an outstanding paper at EACL 2023, and a best demo award at NeurIPS 2015, and he received the 2022 Samsung AI Researcher of the Year award. He received his PhD in computer science from the University of Maryland, College Park in 2017 and spent the following year as a researcher at the Allen Institute for Artificial Intelligence.

Accepted Papers

  • Retrieving Supporting Evidence for LLMs Generated Answers, Siqing Huo, Negar Arabzadeh, and Charles Clarke (University of Waterloo). [Link]

  • Citations as Queries: Source Attribution Using Language Models as Rerankers, Ryan Muther and David Smith (Northeastern University). [Link]

  • Resources and Evaluations for Multi-Distribution Dense Information Retrieval, Soumya Chatterjee, Omar Khattab, and Simran Arora (Stanford University). [Link]

  • LaMP: When Large Language Models Meet Personalization, Alireza Salemi, Sheshera Mysore, Michael Bendersky, and Hamed Zamani (University of Massachusetts Amherst & Google Research). [Link]
  • Organizers

    Michael Bendersky

    Google Research

    Danqi Chen

    Princeton University

    Fernando Diaz

    Google Research

    Hamed Zamani

    University of Massachusetts Amherst

    Program Committee:

  • Qingyao Ai, Tsinghua University, China
  • Uri Alon, Carnegie Mellon University, USA
  • Zhuyun Dai, Google, USA
  • Andrew Drozdov, University of Massachusetts Amherst, USA
  • Claudia Hauff, TU Delft & Spotify, The Netherlands
  • Jinhyuk Lee, Korea University & Google DeepMind, South Korea
  • Don Metzler, Google, USA
  • Sewon Min, University of Washington, USA
  • Rodrigo Nogueira, University of Campinas and NeuralMind, Brazil
  • Andrew Yates, University of Amsterdam, The Netherlands
  • Mingyang Zhang, Google, USA
  • Zexuan Zhong, Princeton University, USA
  • Sponsors