no code implementations • 22 Dec 2023 • Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, Wenhu Chen
We evaluate VIESCORE on seven prominent tasks in conditional image tasks and found: (1) VIESCORE (GPT4-v) achieves a high Spearman correlation of 0. 3 with human evaluations, while the human-to-human correlation is 0. 45.
2 code implementations • 27 Nov 2023 • Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, Wenhu Chen
We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning.
1 code implementation • 1 Oct 2023 • Dongfu Jiang, Yishan Li, Ge Zhang, Wenhao Huang, Bill Yuchen Lin, Wenhu Chen
To quantitatively assess our metric, we evaluate its correlation with human ratings on 5 held-in datasets, 2 held-out datasets and show that TIGERScore can achieve the open-source SoTA correlation with human ratings across these datasets and almost approaches GPT-4 evaluator.
3 code implementations • 5 Jun 2023 • Dongfu Jiang, Xiang Ren, Bill Yuchen Lin
We present LLM-Blender, an ensembling framework designed to attain consistently superior performance by leveraging the diverse strengths of multiple open-source large language models (LLMs).
no code implementations • 20 Dec 2022 • Dongfu Jiang, Bill Yuchen Lin, Xiang Ren
Pre-trained language models have been successful in natural language generation (NLG) tasks.