MMLU (Massive Multitask Language Understanding) is a new benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots.
831 PAPERS • 25 BENCHMARKS
The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their future capabilities. Big-bench include more than 200 tasks.
238 PAPERS • 134 BENCHMARKS
This dataset contains a set of multiple-choice questions related to various legal topics. The dataset contains 20 questions covering various aspects of legal knowledge, such as the workings of the European Commission, types of legal documents, procedures in the court system, legal definitions, and European Union+United Kingdom law, among others.
0 PAPER • NO BENCHMARKS YET