M3Exam is a multilingual, multimodal, and multilevel benchmark designed for evaluating Large Language Models (LLMs). Unlike traditional benchmarks, which often focus on specific tasks or datasets, M3Exam takes a more comprehensive approach by sourcing real and official human exam questions. Let's delve into its unique characteristics:

  1. Multilingualism: M3Exam encompasses questions from multiple countries, requiring strong multilingual proficiency and cultural knowledge. It evaluates how well LLMs handle diverse languages.

  2. Multimodality: Many exam questions are multimodal, combining text with images. M3Exam tests the model's ability to understand and process such complex, multimodal content.

  3. Multilevel Structure: M3Exam features exams from three critical educational periods, allowing a comprehensive assessment of a model's proficiency at different levels.

Here are some key details about M3Exam:

  • Number of Questions: M3Exam contains 12,317 questions in 9 diverse languages across three educational levels.
  • Image Processing: Approximately 23% of the questions require processing images for successful solving.

Despite the existence of various benchmarks, M3Exam argues that human exams provide a more suitable means of evaluating general intelligence for large language models. These exams inherently demand a wide range of abilities, including language understanding, domain knowledge, and problem-solving skills.

Top-performing LLMs, including GPT-4, have been assessed on M3Exam. However, they still face challenges with multilingual text, especially in low-resource and non-Latin script languages. Additionally, multimodal LLMs struggle with complex multimodal questions.

(1) [2306.05179] M3Exam: A Multilingual, Multimodal, Multilevel Benchmark .... https://arxiv.org/abs/2306.05179. (2) M3Exam: A Multilingual, Multimodal, Multilevel Benchmark For Evaluating .... https://www.ai-summary.com/m3exam-a-multilingual-multimodal-multilevel-benchmark-for-evaluating-large-language-models/. (3) M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for ... - DeepAI. https://deepai.org/publication/m3exam-a-multilingual-multimodal-multilevel-benchmark-for-examining-large-language-models. (4) M3Exam: A Multilingual , Multimodal , Multilevel ... - GitHub. https://github.com/DAMO-NLP-SG/M3Exam. (5) undefined. https://doi.org/10.48550/arXiv.2306.05179.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


License


  • Unknown

Modalities


Languages