AgentBench

Introduced by Liu et al. in AgentBench: Evaluating LLMs as Agents

AgentBench is a comprehensive benchmark designed to evaluate Large Language Models (LLMs) as agents in interactive environments. LLMs, which are increasingly smart and autonomous, have expanded beyond traditional natural language processing tasks to tackle real-world pragmatic missions. Here are the key details about AgentBench:

  1. Purpose: AgentBench assesses LLMs' reasoning and decision-making abilities in a multi-turn open-ended generation setting across challenging tasks.
  2. Environments: It currently consists of eight distinct environments, each representing different scenarios. These environments include:
    • SQL-based Environment: Evaluates LLMs' ability to operate with real databases.
    • Game-based Environment: Tests LLMs in digital card games.
    • Web-based Environments: Includes web shopping, web browsing, and more.
  3. Performance Disparity: After extensive testing of 27 API-based and open-sourced LLMs, it was observed that while top commercial LLMs exhibit strong agent capabilities in complex environments, there is a significant performance gap between them and open-source competitors.
  4. Challenges: Poor long-term reasoning, decision-making, and instruction-following abilities are identified as the main obstacles for developing usable LLM agents.
  5. Improvement Strategies: Training on code and high-quality multi-turn alignment data could enhance agent performance.
  6. Resources: Datasets, environments, and an integrated evaluation package for AgentBench are available¹.

In summary, AgentBench provides a rigorous evaluation framework for assessing LLMs' performance as autonomous agents across diverse scenarios.

(1) [2308.03688] AgentBench: Evaluating LLMs as Agents - arXiv.org. https://arxiv.org/abs/2308.03688. (2) AGENTBENCH:评估LLMs作为代理的能力 - 知乎 - 知乎专栏. https://zhuanlan.zhihu.com/p/664598024. (3) AgentBench, a comprehensive benchmark for evaluating AI agent .... https://ai-scholar.tech/en/articles/agent-simulation%2Fagentbench. (4) AI-Natural-Language-Processing-Lab/AgentBench-LLM-as-Agent. https://github.com/AI-Natural-Language-Processing-Lab/AgentBench-LLM-as-Agent. (5) AgentBench: Evaluating LLMs as Agents - GitHub. https://github.com/THUDM/AgentBench. (6) undefined. https://doi.org/10.48550/arXiv.2308.03688.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


License


  • Unknown

Modalities


Languages