no code implementations • 14 Feb 2024 • Yongchao Zhou, Uri Alon, Xinyun Chen, Xuezhi Wang, Rishabh Agarwal, Denny Zhou
We show that the success of length generalization is intricately linked to the data format and the type of position encoding.
no code implementations • 12 Oct 2023 • Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Rostamizadeh, Sanjiv Kumar, Jean-François Kagy, Rishabh Agarwal
Finally, in practical scenarios with models of varying sizes, first using distillation to boost the performance of the target model and then applying DistillSpec to train a well-aligned draft model can reduce decoding latency by 6-10x with minimal performance drop, compared to standard decoding without distillation.
1 code implementation • 25 Sep 2023 • Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, Tatsunori Hashimoto
Alongside the emulator, we develop an LM-based automatic safety evaluator that examines agent failures and quantifies associated risks.
no code implementations • 23 Jun 2023 • Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, Olivier Bachem
Instead of solely relying on a fixed set of output sequences, GKD trains the student on its self-generated output sequences by leveraging feedback from the teacher on such sequences.
1 code implementation • 24 May 2023 • Yongchao Zhou, Hshmat Sahak, Jimmy Ba
In this paper, we present Diffusion Inversion, a simple yet effective method that leverages the pre-trained generative model, Stable Diffusion, to generate diverse, high-quality training data for image classification.
2 code implementations • 3 Nov 2022 • Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, Jimmy Ba
By conditioning on natural language instructions, large language models (LLMs) have displayed impressive capabilities as general-purpose computers.
2 code implementations • 1 Jun 2022 • Yongchao Zhou, Ehsan Nezhadarya, Jimmy Ba
Dataset distillation can be formulated as a bi-level meta-learning problem where the outer loop optimizes the meta-dataset and the inner loop trains a model on the distilled data.