Mixtral of Experts

Marie-Anne Lachaux (Meta AI) , William El Sayed , Thomas Wang , Théophile Gervet , Teven Le Scao , Szymon Antoniak , Sophia Yang , Sandeep Subramanian , Pierre Stock , Lucile Saulnier , Lélio Renard Lavaud , Guillaume Bour , Gianna Lengyel , Emma Bou Hanna , Florian Bressand , Thibaut Lavril (Meta AI) , Timothée Lacroix (Meta AI) , Guillaume Lample (Meta AI) , Albert Q. Jiang , Alexandre Sablayrolles , Antoine Roux , Arthur Mensch , Blanche Savary , Chris Bamford , Devendra Singh Chaplot , Diego de las Casas
0
We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.
2024-01-08 arXiv Inference Optimization Language Model Architecture Sparse Mixture of Experts

Mistral 7B

William El Sayed , Thomas Wang , Teven Le Scao , Pierre Stock , Lucile Saulnier , Lélio Renard Lavaud , Gianna Lengyel , Florian Bressand , Diego de las Casas , Devendra Singh Chaplot , Chris Bamford , Arthur Mensch , Alexandre Sablayrolles , Albert Q. Jiang , Guillaume Lample (Meta AI) , Timothée Lacroix (Meta AI) , Marie-Anne Lachaux (Meta AI) , Thibaut Lavril (Meta AI)
0
We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses the Llama 2 13B -- Chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license.
2023-10-10 arXiv Inference Optimization Instruction-following Model Language Model Engineering