How Mixtral 8x7B Sets New Standards in Open-Source AI with Innovative Design

How Mixtral 8x7B Sets New Standards in Open-Source AI with Innovative Design

Mixtral 8x7B introduces the first mixture-of-experts network to achieve state-of-the-art performance among open-source models, outperforming notable competitors like Claude-2.1 and GPT-3.5 Turbo. By utilizing only 13B active parameters per token, Mixtral maintains high efficiency while delivering superior results. The model will be shared under the Apache 2.0 license to foster advancements in AI applications across diverse fields.
Abstract and 1. Introduction

2 Architectural details and 2.1 Sparse Mixture of Experts

3 Results

3.1 Multilingual benchmarks, 3.2 Long range performance, and 3.3 Bias Benchmarks

4 Instruction Fine-tuning

5 Routing analysis

6 Conclusion, Acknowledgements, and References

6 Conclusion

In this paper, we introduced Mixtral 8x7B, the first mixture-of-experts network to reach a state-of-theart performance among open-source models. Mixtral 8x7B Instruct outperforms Claude-2.1, Gemini Pro, and GPT-3.5 Turbo on human evaluation benchmarks. Because it only uses two experts at each time step, Mixtral only uses 13B active parameters per token while outperforming the previous best model using 70B parameters per token (Llama 2 70B). We are making our trained and fine-tuned models publicly available under the Apache 2.0 license. By sharing our models, we aim to facilitate the development of new techniques and applications that can benefit a wide range of industries and domains.

Figure 8: Text samples where each token is colored with the first expert choice. The selection of experts appears to be more aligned with the syntax rather than the domain, especially at the initial and final layers.


We thank the CoreWeave and Scaleway teams for technical support as we trained our models. We are grateful to NVIDIA for supporting us in integrating TensorRT-LLM and Triton and working alongside us to make a sparse mixture of experts compatible with TensorRT-LLM.


Figure 10: Repeated consecutive assignments per MoE layer. Repeated assignments occur a lot more often than they would with uniform assignments (materialized by the dashed lines). Patterns are similar acrossdatasets with less repetitions for DM Mathematics.

This paper is available on arxiv under CC 4.0 license.


