no code implementations • 30 Apr 2024 • Arsalan SharifNassab, Sina Ghiassian, Saber Salehkaleybar, Surya Kanoria, Dale Schuurmans
We propose Soft Preference Optimization (SPO), a method for aligning generative models, such as Large Language Models (LLMs), with human preferences, without the need for a reward model.
no code implementations • 13 Oct 2023 • Federico Tomasi, Joseph Cauteruccio, Surya Kanoria, Kamil Ciosek, Matteo Rinaldi, Zhenwen Dai
In this paper, we present a reinforcement learning framework that solves for such limitations by directly optimizing for user satisfaction metrics via the use of a simulated playlist-generation environment.
1 code implementation • Findings (ACL) 2022 • Samuel Carton, Surya Kanoria, Chenhao Tan
Learning from rationales seeks to augment model prediction accuracy using human-annotated rationales (i. e. subsets of input tokens) that justify their chosen labels, often in the form of intermediate or multitask supervision.