AgentBench Dataset | Papers With Code

Name:*

Full name (optional):

Description (Markdown and $\LaTeX$ enabled):*

**AgentBench** is a comprehensive benchmark designed to evaluate Large Language Models (LLMs) as agents in interactive environments. LLMs, which are increasingly smart and autonomous, have expanded beyond traditional natural language processing tasks to tackle real-world pragmatic missions. Here are the key details about AgentBench:

1. **Purpose**: AgentBench assesses LLMs' reasoning and decision-making abilities in a **multi-turn open-ended generation setting** across challenging tasks.
2. **Environments**: It currently consists of **eight distinct environments**, each representing different scenarios. These environments include:
    - **SQL-based Environment**: Evaluates LLMs' ability to operate with real databases.
    - **Game-based Environment**: Tests LLMs in digital card games.
    - **Web-based Environments**: Includes web shopping, web browsing, and more.
3. **Performance Disparity**: After extensive testing of **27 API-based and open-sourced LLMs**, it was observed that while top commercial LLMs exhibit strong agent capabilities in complex environments, there is a significant performance gap between them and open-source competitors.
4. **Challenges**: Poor long-term reasoning, decision-making, and instruction-following abilities are identified as the main obstacles for developing usable LLM agents.
5. **Improvement Strategies**: Training on code and high-quality multi-turn alignment data could enhance agent performance.
6. **Resources**: Datasets, environments, and an integrated evaluation package for AgentBench are available¹.

In summary, AgentBench provides a rigorous evaluation framework for assessing LLMs' performance as autonomous agents across diverse scenarios.

(1) [2308.03688] AgentBench: Evaluating LLMs as Agents - arXiv.org. https://arxiv.org/abs/2308.03688.
(2) AGENTBENCH：评估LLMs作为代理的能力 - 知乎 - 知乎专栏. https://zhuanlan.zhihu.com/p/664598024.
(3) AgentBench, a comprehensive benchmark for evaluating AI agent .... https://ai-scholar.tech/en/articles/agent-simulation%2Fagentbench.
(4) AI-Natural-Language-Processing-Lab/AgentBench-LLM-as-Agent. https://github.com/AI-Natural-Language-Processing-Lab/AgentBench-LLM-as-Agent.
(5) AgentBench: Evaluating LLMs as Agents - GitHub. https://github.com/THUDM/AgentBench.
(6) undefined. https://doi.org/10.48550/arXiv.2308.03688.

Homepage URL (optional):

Paper where the dataset was introduced:

Introduction date:

Dataset license:

URL to full license terms:

Image

---

AgentBench

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

Similar Datasets

CSCD-IME

M3KE

ALFWorld

Mind2Web

Usage

License

Modalities

Languages

AgentBench

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit