We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Common Sense Reasoning ARC (Easy) Mistral 7B (0-shot) Accuracy 80.5 # 11
Common Sense Reasoning ARC (Easy) Mixtral 8x7B (0-shot) Accuracy 83.1 # 9
Code Generation HumanEval Mistral 7B (0-shot) Pass@1 26.2 # 87
Code Generation HumanEval Mixtral 8x7B (0-shot) Pass@1 40.2 # 61
Math Word Problem Solving MATH Mixtral 8x7B (maj@4) Accuracy 28.4 # 67
Math Word Problem Solving MATH Mistral 7B (maj@4) Accuracy 12.7 # 86
Parameters (Billions) 7 # 58
Code Generation MBPP Mixtral 8x7B (3-shot) Accuracy 60.7 # 35
Multi-task Language Understanding MMLU Mixtral 8x7B (5-shot) Average (%) 70.6 # 22
Question Answering PIQA Mixtral 8x7B (0-shot) Accuracy 83.6 # 9
Question Answering PIQA Mistral 7B (0-shot) Accuracy 82.2 # 16
Common Sense Reasoning WinoGrande Mixtral 8x7B (0-shot) Accuracy 77.2 # 17
Common Sense Reasoning WinoGrande Mistral 7B (0-shot) Accuracy 74.2 # 25

Methods