M3Exam

Introduced by Zhang et al. in M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models

M3Exam is a multilingual, multimodal, and multilevel benchmark designed for evaluating Large Language Models (LLMs). Unlike traditional benchmarks, which often focus on specific tasks or datasets, M3Exam takes a more comprehensive approach by sourcing real and official human exam questions. Let's delve into its unique characteristics:

Multilingualism: M3Exam encompasses questions from multiple countries, requiring strong multilingual proficiency and cultural knowledge. It evaluates how well LLMs handle diverse languages.
Multimodality: Many exam questions are multimodal, combining text with images. M3Exam tests the model's ability to understand and process such complex, multimodal content.
Multilevel Structure: M3Exam features exams from three critical educational periods, allowing a comprehensive assessment of a model's proficiency at different levels.

Here are some key details about M3Exam:

Number of Questions: M3Exam contains 12,317 questions in 9 diverse languages across three educational levels.
Image Processing: Approximately 23% of the questions require processing images for successful solving.

Despite the existence of various benchmarks, M3Exam argues that human exams provide a more suitable means of evaluating general intelligence for large language models. These exams inherently demand a wide range of abilities, including language understanding, domain knowledge, and problem-solving skills.

Top-performing LLMs, including GPT-4, have been assessed on M3Exam. However, they still face challenges with multilingual text, especially in low-resource and non-Latin script languages. Additionally, multimodal LLMs struggle with complex multimodal questions.

(1) [2306.05179] M3Exam: A Multilingual, Multimodal, Multilevel Benchmark .... https://arxiv.org/abs/2306.05179. (2) M3Exam: A Multilingual, Multimodal, Multilevel Benchmark For Evaluating .... https://www.ai-summary.com/m3exam-a-multilingual-multimodal-multilevel-benchmark-for-evaluating-large-language-models/. (3) M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for ... - DeepAI. https://deepai.org/publication/m3exam-a-multilingual-multimodal-multilevel-benchmark-for-examining-large-language-models. (4) M3Exam: A Multilingual , Multimodal , Multilevel ... - GitHub. https://github.com/DAMO-NLP-SG/M3Exam. (5) undefined. https://doi.org/10.48550/arXiv.2306.05179.

Homepage

Benchmarks

Add a new result Link an existing benchmark

No benchmarks yet. Start a new benchmark or link an existing one.

Papers

Paper	Code	Results	Date	Stars

M3Exam

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

Similar Datasets

SafetyBench

GLUE-X

CSCD-IME

CMB

Usage

License

Modalities

Languages

M3Exam

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit

Similar Datasets

SafetyBench

GLUE-X

CSCD-IME

CMB

Usage

License Edit

Modalities Edit

Languages Edit

Benchmarks

Add a new result Link an existing benchmark

Dataset Loaders

Add Remove

Tasks

License

Modalities

Languages