no code implementations • 2 May 2024 • Hao Wang, Tetsuro Morimura, Ukyo Honda, Daisuke Kawahara
Non-autoregressive (NAR) language models are known for their low latency in neural machine translation (NMT).
1 code implementation • 22 Apr 2024 • Tetsuro Morimura, Mitsuki Sakamoto, Yuu Jinnai, Kenshi Abe, Kaito Ariu
This paper addresses the issue of text quality within the preference dataset by focusing on Direct Preference Optimization (DPO), an increasingly adopted reward-model-free RLHF method.
1 code implementation • 1 Apr 2024 • Yuu Jinnai, Tetsuro Morimura, Kaito Ariu, Kenshi Abe
Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning Large Language Models (LLMs) to human preferences at the time of decoding.
1 code implementation • 31 Mar 2024 • Atsumoto Ohashi, Ukyo Honda, Tetsuro Morimura, Yuu Jinnai
Minimum Bayes-risk (MBR) decoding has recently gained renewed attention in text generation.
no code implementations • 6 Feb 2024 • Tsunehiko Tanaka, Kenshi Abe, Kaito Ariu, Tetsuro Morimura, Edgar Simo-Serra
Traditional approaches in offline reinforcement learning aim to learn the optimal policy that maximizes the cumulative reward, also known as return.
1 code implementation • 10 Jan 2024 • Yuu Jinnai, Ukyo Honda, Tetsuro Morimura, Peinan Zhang
We propose two variants of MBR, Diverse MBR (DMBR) and $k$-medoids MBR (KMBR), methods to generate a set of sentences with high quality and diversity.
1 code implementation • 9 Nov 2023 • Yuu Jinnai, Tetsuro Morimura, Ukyo Honda, Kaito Ariu, Kenshi Abe
MBR decoding selects a hypothesis from a pool of hypotheses that has the least expected risk under a probability model according to a given utility function.
no code implementations • 23 Oct 2023 • Satoshi Hayakawa, Tetsuro Morimura
Reward evaluation of episodes becomes a bottleneck in a broad range of reinforcement learning tasks.
no code implementations • 25 Aug 2023 • Yuu Jinnai, Tetsuro Morimura, Ukyo Honda
To this end, we introduce Lookahead Beam Search (LBS), a multi-step lookahead search that optimizes the objective considering a fixed number of future steps.
no code implementations • 13 Jul 2023 • Sho Shimoyama, Tetsuro Morimura, Kenshi Abe, Toda Takamichi, Yuta Tomomatsu, Masakazu Sugiyama, Asahi Hentona, Yuuki Azuma, Hirotaka Ninomiya
One way to estimate rewards from collected data is to train the reward estimator and dialog policy simultaneously using adversarial learning (AL).
1 code implementation • 8 Jun 2023 • Riku Togashi, Tatsushi Oka, Naoto Ohsaka, Tetsuro Morimura
Excellent tail performance is crucial for modern machine learning tasks, such as algorithmic fairness, class imbalance, and risk-sensitive decision making, as it ensures the effective handling of challenging samples within a dataset.
no code implementations • 2 Jun 2022 • Tetsuro Morimura, Kazuhiro Ota, Kenshi Abe, Peinan Zhang
However, since the standard MCTS does not have the ability to learn state representation, the size of the tree-search space can be too large to search.
no code implementations • 3 Oct 2020 • Masahiro Kato, Kei Nakagawa, Kenshi Abe, Tetsuro Morimura
To achieve this purpose, we train an agent to maximize the expected quadratic utility function, a common objective of risk management in finance and economics.
no code implementations • 2 Jul 2019 • Kun Zhao, Takayuki Osogami, Tetsuro Morimura
To solve this problem, we consider a whole match as a Markov chain of significant events, so that event values can be estimated with a continuous parameter space by solving the Markov chain with a machine learning model.
no code implementations • 16 Jun 2019 • Yachiko Obara, Tetsuro Morimura, Hiroki Yanagisawa
The key points of our approach are (1) designing an appropriate target distribution by using a condition on the number of nonzero elements, and (2) changing values only between a certain pair of elements in each iteration.
no code implementations • NeurIPS 2013 • Tetsuro Morimura, Takayuki Osogami, Tsuyoshi Ide
The Markov chain is a convenient tool to represent the dynamics of complex systems such as traffic and social systems, where probabilistic transition takes place between internal states.
no code implementations • NeurIPS 2009 • Tetsuro Morimura, Eiji Uchibe, Junichiro Yoshimoto, Kenji Doya
In this paper, we describe a generalized Natural Gradient (gNG) by linearly interpolating the two FIMs and propose an efficient implementation for the gNG learning based on a theory of the estimating function, generalized Natural Actor-Critic (gNAC).