The First International Workshop on
Retrieval-Enhanced Machine Learning
(REML @ SIGIR 2023)
July 27th, 2023, Taipei, Taiwan. Co-located with SIGIR 2023.
Tell Me More

About the Workshop

Most machine learning models are designed to be self-contained and encode both “knowledge” and “reasoning” in their parameters. However, such models cannot perform effectively for tasks that require knowledge grounding and tasks that deal with non-stationary data, such as news and social media. Besides, these models often require huge number of parameters to encode all the required knowledge. These issues can be addressed via augmentation with a retrieval model. This category of machine learning models, which is called Retrieval-enhanced machine learning (REML), has recently attracted considerable attention in multiple research communities. For instance, REML models have been studied in the context of open-domain question answering, fact verification, and dialogue systems and also in the context of generalization through memorization in language models and memory networks. We believe that the information retrieval community can significantly contribute to this growing research area by designing, implementing, analyzing, and evaluating various aspects of retrieval models with applications to REML tasks. The goal of this full-day hybrid workshop is to bring together researchers from industry and academia to discuss various aspects of retrieval-enhanced machine learning, including effectiveness, efficiency, and robustness of these models in addition to their impact on real-world applications.

Motivation:

The vast majority of machine learning (ML) systems are designed to be self-contained, with both knowledge and reasoning encoded in model parameters. They suffer from a number of major shortcomings that can be (fully or partially) addressed, if machine learning models have access to efficient and effective retrieval models:

  • Knowledge grounding: A number of important real-world problems, often called knowledge-intensive tasks, require access to external knowledge. They include (among others) open-domain question answering, task-oriented dialogues, and fact checking. Therefore, ML systems that make predictions solely based on the data observed during training fall short when dealing with knowledge-intensive tasks. In addition, ML models related to non-stationary domains, such as news or social media, can significantly benefit from accessing fresh data. An information retrieval (IR) system can decouple reasoning from knowledge, allowing it to be maintained and updated independent of model parameters at a cadence aligned with the corpus.
  • Generalization: Recent work has shown that many ML models can significantly benefit from retrieval augmentation. For example, enhancing generative ML models, such as language models and dialogue systems, using retrieval will have a large impact on their generalization.
  • Significant growth in model parameters: Since all the required information for making predictions is often encoded in the ML models' parameters, increasing their capacity by increasing the number of parameters generally leads to higher accuracy. For example, the number of parameters used in LMs has increased from 94 million in ELMo to 1.6 trillion in Switch Transformers, an over 16x increase in just three years (2018 -- 2021). Despite these successes, improving performance by increasing the number of model parameters can incur significant cost and limit access to a handful of organizations that have the resources to train them. As such, this approach is neither scalable nor sustainable in the long run, and providing access to a scalable large collection (or memory) can potentially mitigate this issue.
  • Interpretability and explainability: Because the knowledge in training data is encoded in learned model parameters, explanations of model predictions often appeal to abstract and difficult-to-interpret distributed representations. By grounding inference on retrieved information, predictions can more easily be traced to specific data, often stored in a human-readable format such as text.

Recent research has demonstrated that errors in many REML models are mostly due to the failure of retrieval model as opposed to the augmented machine learning model, confirming the well-known "garbage in, garbage out" phenomenon. We believe that the expertise of the IR research community is pivotal for further progress in REML models. Therefore, we propose to organize a workshop with a fresh perspective on retrieval-enhanced machine learning through an information retrieval lens. For more information, refer to the following perspective paper:

H. Zamani, F. Diaz, M. Dehghani, D. Metzler, M. Bendersky. "Retrieval-Enhanced Machine Learning". In Proc. of SIGIR 2022.

Theme and Scope:

The workshop will focus on models, techniques, data collections, and evaluation methodologies for various retrieval-enhanced machine learning problems. These include but are not limited to:

  • Effectiveness and/or efficiency of retrieval models for knowledge grounding, e.g., for open-domain question answering, dialogue systems, fact verification, and information extraction.
  • Effectiveness and/or efficiency of retrieval models for generalization through memorization, e.g., nearest neighbor language models.
  • Effectiveness and/or efficiency of retrieval models for memory networks.
  • Effectiveness and/or efficiency of retrieval models for retrieval-augmented representation learning.
  • Retrieval-enhanced optimization.
  • Retrieval-enhanced domain adaptation.
  • Retrieval-enhanced models for multi-media and multi-modal learning.
  • Query generation for retrieval-enhanced models.
  • Retrieval result utilization by machine learning models.
  • Interactive retrieval-enhanced machine learning models.
  • Retrieval-enhanced models for non-stationary data, such as news, social media, etc.

Call for Papers

Submissions must be in English, in PDF format, and in the current ACM two-column conference format. Suitable LaTeX, Word, and Overleaf templates are available from the ACM Website (use “sigconf” proceedings template for LaTeX and the Interim Template for Word). Submissions may consist of up to 6 pages of content, plus unrestricted space for appendices and references. The papers can represent reports of original research or preliminary research results. The review process is double-blind. Authors are required to take all reasonable steps to preserve the anonymity of their submission. The submission must not include author information and must not include citations or discussion of related work that would make the authorship apparent. However, it is acceptable to refer to companies or organizations that provided datasets, hosted experiments or deployed solutions if there is no implication that the authors are currently affiliated with these organizations. Papers will be evaluated according to their significance, originality, technical content, style, clarity, relevance to the workshop, and likelihood of generating discussion. Authors should note that changes to the author list after the submission deadline are not allowed. At least one author of each accepted paper is required to register for, attend , and present the work at the workshop (either in person or remotely). All papers are to be submitted via EasyChair.

Papers presented at the workshop will be required to be uploaded to arXiv.org after the acceptance notification, but will be considered non-archival, and may be submitted elsewhere (modified or not), although the workshop site will maintain a link to the arXiv versions. Therefore, articles being under consideration by other venues as long as they are within the submission policy of the other venues can be submitted to REML. This makes the workshop a forum for the presentation and discussion of current work, without preventing the work from being published elsewhere.

Important Dates:

  • Submission deadline: April 25, 2023
  • Paper notifications: May 23, 2023
  • Camera-ready deadline: June 23, 2023
  • Workshop Day: July 27, 2023

Program

Keynote

Will be announced shortly!

Accepted Papers

-->

Organizers

Michael Bendersky

Google Research

Danqi Chen

Princeton University

Fernando Diaz

Google Research

Hamed Zamani

University of Massachusetts Amherst

Program Committee:

TBD.

Sponsors