AgentBench is a comprehensive benchmark designed to evaluate Large Language Models (LLMs) as agents in interactive environments. LLMs, which are increasingly smart and autonomous, have expanded beyond traditional natural language processing tasks to tackle real-world pragmatic missions. Here are the key details about AgentBench:
In summary, AgentBench provides a rigorous evaluation framework for assessing LLMs' performance as autonomous agents across diverse scenarios.
(1) [2308.03688] AgentBench: Evaluating LLMs as Agents - arXiv.org. https://arxiv.org/abs/2308.03688. (2) AGENTBENCH:评估LLMs作为代理的能力 - 知乎 - 知乎专栏. https://zhuanlan.zhihu.com/p/664598024. (3) AgentBench, a comprehensive benchmark for evaluating AI agent .... https://ai-scholar.tech/en/articles/agent-simulation%2Fagentbench. (4) AI-Natural-Language-Processing-Lab/AgentBench-LLM-as-Agent. https://github.com/AI-Natural-Language-Processing-Lab/AgentBench-LLM-as-Agent. (5) AgentBench: Evaluating LLMs as Agents - GitHub. https://github.com/THUDM/AgentBench. (6) undefined. https://doi.org/10.48550/arXiv.2308.03688.
Paper | Code | Results | Date | Stars |
---|