Authors:
(1) Lewis Tunstall, Equal contribution and The H4 (Helpful, Honest, Harmless, Huggy) Team (email: [email protected]);
(2) Edward Beeching, Equal contribution and The H4 (Helpful, Honest, Harmless, Huggy) Team;
(3) Nathan Lambert, The H4 (Helpful, Honest, Harmless, Huggy) Team;
(4) Nazneen Rajani, The H4 (Helpful, Honest, Harmless, Huggy) Team;
(5) Kashif Rasul, The H4 (Helpful, Honest, Harmless, Huggy) Team;
(6) Younes Belkada, The H4 (Helpful, Honest, Harmless, Huggy) Team;
(7) Shengyi Huang, The H4 (Helpful, Honest, Harmless, Huggy) Team;
(8) Leandro von Werra, The H4 (Helpful, Honest, Harmless, Huggy) Team;
(9) Clementine Fourrier, The H4 (Helpful, Honest, Harmless, Huggy) Team;
(10) Nathan Habib, The H4 (Helpful, Honest, Harmless, Huggy) Team;
(11) Nathan Sarrazin, The H4 (Helpful, Honest, Harmless, Huggy) Team;
(12) Omar Sanseviero, The H4 (Helpful, Honest, Harmless, Huggy) Team;
(13) Alexander M. Rush, The H4 (Helpful, Honest, Harmless, Huggy) Team;
(14) Thomas Wolf, The H4 (Helpful, Honest, Harmless, Huggy) Team.
We conduct all of our fine-tuning experiments using Mistral 7B (Jiang et al., 2023), which is the current state-of-the-art base LM at the 7B parameter scale, and matches the performance of much larger models like LLaMa 34B on many NLP benchmarks. We use the Transformer Reinforcement Learning (TRL) library for fine-tuning (von Werra et al., 2020), in conjunction with DeepSpeed ZeRO3 (Rajbhandari et al., 2020) and FlashAttention-2 (Dao, 2023) to optimize memory and improve training speed. All models are trained with the AdamW optimizer and no weight decay. We did not experiment with parameter-efficient techniques such as LoRA (Hu et al., 2021), but expect similar results to hold with these methods. All experiments were run on 16 A100s using bfloat16 precision and typically took 2-4 hours to complete. For the full set of hyperparameters and instructions on how to train the models, see: https://github.com/huggingface/alignment-handbook.
We focus on two dialogue datasets that have been distilled from a mix of open and proprietary models, and have previously been shown to produce strong chat models like the UltraLM (Ding et al., 2023):
• UltraChat (Ding et al., 2023) is a self-refinement dataset consisting of 1.47M multi-turn dialogues generated by GPT-3.5-TURBO over 30 topics and 20 different types of text material. We initially ran dSFT over the whole corpus, but found the resulting chat model had a tendency to respond with incorrect capitalization and would preface its answers with phrases such as “I don’t have personal experiences”, even for straightforward questions like “How do I clean my car?”. To handle these issues in the training data, we applied truecasing heuristics to fix the grammatical errors (approximately 5% of the dataset), as well as several filters to focus on helpfulness and remove the undesired model responses. The resulting dataset contains approximately 200k examples.
• UltraFeedback (Cui et al., 2023) consists of 64k prompts, each of which have four LLM responses that are rated by GPT-4 according to criteria like instruction-following, honesty, and helpfulness. We construct binary preferences from UltraFeedback by selecting the highest mean score as the “chosen” response and one of the remaining three at random as “rejected”. We opted for random selection instead of selecting the lowest-scored response to encourage diversity and make the DPO objective more challenging. As noted above, this step is computed offline and does not involve any sampling from the reference model.
We make the pre-processed datasets available on the Hugging Face Hub.1
Our main evaluations are on single-turn and multi-turn chat benchmarks that measure a model’s ability to follow instructions and respond to challenging prompts across a diverse range of domains:
• MT-Bench (Zheng et al., 2023) is a multi-turn benchmark that consists of 160 questions across eight different areas of knowledge. In this benchmark, the model must answer an initial question, and then provide a second response to a predefined followup question. Each model response is then rated by GPT-4 on a scale from 1-10, with the final score given by the mean over the two turns.
• AlpacaEval (Li et al., 2023) is a single-turn benchmark where a model must generate a response to 805 questions on different topics, mostly focused on helpfulness. Models are also scored by GPT-4, but the final metric is the pairwise win-rate against a baseline model (text-davinci-003).
We also evaluate ZEPHYR-7B on the Open LLM Leaderboard (Beeching et al., 2023), which measures the performance of LMs across four multiclass classification tasks: ARC (Clark et al., 2018), HellaSwag (Zellers et al., 2019), MMLU (Hendrycks et al., 2021), and Truthful QA(Lin et al., 2022). Although this leaderboard does not directly measure the conversational quality of chat models, it does provide a useful signal to validate whether fine-tuning has introduced regressions on the base model’s reasoning and truthfulness capabilities.
Across all benchmarks, we compare ZEPHYR-7B against a variety of open and proprietary models, each with different alignment procedures. To facilitate comparison across open model sizes, we group our comparisons in terms of 7B models (XWIN-LM (Team, 2023), MISTRALINSTRUCT (Jiang et al., 2023), MPT-CHAT (ML, 2023), and STABLELM-α), as well as larger models up to 70B parameters (LLAMA2-CHAT (Touvron et al., 2023), VICUNA˜ (Chiang et al., 2023), WizardLM (Xu et al.), and GUANACO (Dettmers et al., 2023)). For the chat benchmarks, we also compare against proprietary models, including CLAUDE 2, GPT-3.5-TURBO and GPT-4 (OpenAI, 2023).
We train our SFT models for one to three epochs. We use a cosine learning rate scheduler with a peak learning rate of 2e-5 and 10% warmup steps. We train all models with a global batch size of 512 and use packing with a sequence length of 2048 tokens.
Similar to SFT, we train our DPO models for one to three epochs. We use a linear learning rate scheduler with a peak learning rate of 5e-7 and 10% warmup steps. We train all models with a global batch size of 32 and use β = 0.1 from Eq. (1) to control the deviation from the reference model. The final ZEPHYR-7B model was initialized from the SFT model that was trained for one epoch and further optimized for three DPO epochs (see Figure 3 for an epoch ablation on MT-Bench).
This paper is available on arxiv under CC 4.0 license.
1 https://huggingface.co/collections/HuggingFaceH4/ zephyr-7b-6538c6d6d5ddd1cbb1744a66