Authors:
(1) Suriya Gunasekar, Microsoft Research;
(2) Yi Zhang, Microsoft Research;
(3) Jyoti Aneja, Microsoft Research;
(4) Caio C´esar Teodoro Mendes, Microsoft Research;
(5) Allie Del Giorno, Microsoft Research;
(6) Sivakanth Gopi, Microsoft Research;
(7) Mojan Javaheripi, Microsoft Research;
(8) Piero Kauffmann, Microsoft Research;
(9) Gustavo de Rosa, Microsoft Research;
(10) Olli Saarikivi, Microsoft Research;
(11) Adil Salim, Microsoft Research;
(12) Shital Shah, Microsoft Research;
(13) Harkirat Singh Behl, Microsoft Research;
(14) Xin Wang, Microsoft Research;
(15) S´ebastien Bubeck, Microsoft Research;
(16) Ronen Eldan, Microsoft Research;
(17) Adam Tauman Kalai, Microsoft Research;
(18) Yin Tat Lee, Microsoft Research;
(19) Yuanzhi Li, Microsoft Research. Table of Links Abstract and 1. Introduction
2 Training details and the importance of high-quality data
2.1 Filtering of existing code datasets using a transformer-based classifier
2.2 Creation of synthetic textbook-quality datasets
2.3 Model architecture and training
3 Spikes of model capability after finetuning on CodeExercises, 3.1 Finetuning improves the model’s understanding, and 3.2 Finetuning improves the model’s ability to use external libraries
4 Evaluation on unconventional problems with LLM grading
5 Data pruning for unbiased performance evaluation
5.1 N-gram overlap and 5.2 Embedding and syntax-based similarity analysis
6 Conclusion and References
A Additional examples for Section 3
B Limitation of phi-1
C Examples for Section 5 2.3 Model architecture and training We use a decoder only transformer [VSP+ 17] model using the FlashAttention implementation of multihead attention (MHA) [DFE+ 22]. We also use MHA and MLP layers in parallel configuration following some recent models like CodeGen [NPH+ 22], PaLM [CND+ 22], and GPT-NeoX [BBH+ 22]. The architecture for our 1.3B parameter phi-1 model consists of 24 layers, hidden dimension of 2048, MLP-inner dimension of 8192, and 32 attention heads of dimension 64 each. The smaller 350M parameter phi1-small model consists of 20 layers, hidden dimension of 1024, MLP-inner dimension of 4096, and 16 attention heads of dimension 64 each. We also use a rotary position embedding [SLP+ 21] with rotary dimension 32. These architectural choices were adopted from [NPH+ 22]. We also use the same tokenizer as codegen-350M-mono [NPH+ 22]. Aside from FlashAttention, our models do not use other techniques like Fill-In-the-Middle (FIM) [BJT+ 22], or Multi-Query-Attention (MQA) [RSR+ 20] that could further boost performance and efficiency [LAZ+ 23]. For both pretraining and finetuning, we concatenate our respective datasets into a single dimensional array with “⟨∣endoftext∣⟩” token used for separating the files. We train our models on sequence length of 2048 sliced from our dataset array with next-token prediction loss. We use fp16 training with AdamW optimizer, linear-warmup-linear-decay learning rate schedule, and attention and residual dropout of 0.1. We train on 8 Nvidia-A100 GPUs using deepspeed. Our pretrained base model phi-1-base was obtained in under 4 days of training. Finetuning to obtain phi-1 used an additional 7 hours on the same hardware. Pretraining. phi-1-base was trained on the CodeTextbook dataset (filtered code-language corpus and synthetic textbooks). We use effective batch size 1024 (including data parallelism and gradient accumulation), maximum learning rate 1e-3 with warmup over 750 steps, and weight decay 0.1, for a total of 36,000 steps. We use the checkpoint at 24,000 steps as our phi-1-base – this is equivalent to ∼ 8 epochs on our CodeTextbook dataset for a total of little over 50B total training tokens. Despite the small size and computation, this model already achieves a 29% accuracy on HumanEval. Finetuning. phi-1 is obtained by finetuning phi-1-base on the CodeExercises dataset. For finetuning, we use the same setup as pretraining, but different hyperparameters: we use effective batchsize of 256, maximum learning rate 1e-4 with 50 steps of warmup, and weight decay 0.01. We train for total of 6,000 steps and pick the best checkpoint (saved every 1000 steps). This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Suriya Gunasekar, Microsoft Research; (2) Yi Zhang, Microsoft Research; (3) Jyoti Aneja, Microsoft Research; (4) Caio C´esar Teodoro Mendes, Microsoft Research; (5) Allie Del Giorno, Microsoft Research; (6) Sivakanth Gopi, Microsoft Research; (7) Mojan Javaheripi, Microsoft Research; (8) Piero Kauffmann, Microsoft Research; (9) Gustavo de Rosa, Microsoft Research; (10) Olli Saarikivi, Microsoft Research; (11) Adil Salim, Microsoft Research; (12) Shital Shah, Microsoft Research; (13) Harkirat Singh Behl, Microsoft Research; (14) Xin Wang, Microsoft Research; (15) S´ebastien Bubeck, Microsoft Research; (16) Ronen Eldan, Microsoft Research; (17) Adam Tauman Kalai, Microsoft Research; (18) Yin Tat Lee, Microsoft Research; (19) Yuanzhi Li, Microsoft Research. Authors: Authors: (1) Suriya Gunasekar, Microsoft Research; (2) Yi Zhang, Microsoft Research; (3) Jyoti Aneja, Microsoft Research; (4) Caio C´esar Teodoro Mendes, Microsoft Research; (5) Allie Del Giorno, Microsoft Research; (6) Sivakanth Gopi, Microsoft Research; (7) Mojan Javaheripi, Microsoft Research; (8) Piero Kauffmann, Microsoft Research; (9) Gustavo de Rosa, Microsoft Research; (10) Olli Saarikivi, Microsoft Research; (11) Adil Salim, Microsoft Research; (12) Shital Shah, Microsoft Research; (13) Harkirat Singh Behl, Microsoft Research; (14) Xin Wang, Microsoft Research; (15) S´ebastien Bubeck, Microsoft Research; (16) Ronen Eldan, Microsoft Research; (17) Adam Tauman Kalai, Microsoft Research; (18) Yin Tat Lee, Microsoft Research; (19) Yuanzhi Li, Microsoft Research. Table of Links Abstract and 1. Introduction 2 Training details and the importance of high-quality data 2.1 Filtering of existing code datasets using a transformer-based classifier 2.2 Creation of synthetic textbook-quality datasets 2.3 Model architecture and training 3 Spikes of model capability after finetuning on CodeExercises, 3.1 Finetuning improves the model’s understanding, and 3.2 Finetuning improves the model’s ability to use external libraries 4 Evaluation on unconventional problems with LLM grading 5 Data pruning for unbiased performance evaluation 5.1 N-gram overlap and 5.2 Embedding and syntax-based similarity analysis 6 Conclusion and References A Additional examples for Section 3 B Limitation of phi-1 C Examples for Section 5 Abstract and 1. Introduction Abstract and 1. Introduction 2 Training details and the importance of high-quality data 2 Training details and the importance of high-quality data 2.1 Filtering of existing code datasets using a transformer-based classifier 2.1 Filtering of existing code datasets using a transformer-based classifier 2.2 Creation of synthetic textbook-quality datasets 2.2 Creation of synthetic textbook-quality datasets 2.3 Model architecture and training 2.3 Model architecture and training 3 Spikes of model capability after finetuning on CodeExercises, 3.1 Finetuning improves the model’s understanding, and 3.2 Finetuning improves the model’s ability to use external libraries 3 Spikes of model capability after finetuning on CodeExercises, 3.1 Finetuning improves the model’s understanding, and 3.2 Finetuning improves the model’s ability to use external libraries 4 Evaluation on unconventional problems with LLM grading 4 Evaluation on unconventional problems with LLM grading 5 Data pruning for unbiased performance evaluation 5 Data pruning for unbiased performance evaluation 5.1 N-gram overlap and 5.2 Embedding and syntax-based similarity analysis 5.1 N-gram overlap and 5.2 Embedding and syntax-based similarity analysis 6 Conclusion and References 6 Conclusion and References A Additional examples for Section 3 A Additional examples for Section 3 B Limitation of phi-1 B Limitation of phi-1 C Examples for Section 5 C Examples for Section 5 2.3 Model architecture and training We use a decoder only transformer [VSP+ 17] model using the FlashAttention implementation of multihead attention (MHA) [DFE+ 22]. We also use MHA and MLP layers in parallel configuration following some recent models like CodeGen [NPH+ 22], PaLM [CND+ 22], and GPT-NeoX [BBH+ 22]. The architecture for our 1.3B parameter phi-1 model consists of 24 layers, hidden dimension of 2048, MLP-inner dimension of 8192, and 32 attention heads of dimension 64 each. The smaller 350M parameter phi1-small model consists of 20 layers, hidden dimension of 1024, MLP-inner dimension of 4096, and 16 attention heads of dimension 64 each. We also use a rotary position embedding [SLP+ 21] with rotary dimension 32. These architectural choices were adopted from [NPH+ 22]. We also use the same tokenizer as codegen-350M-mono [NPH+ 22]. Aside from FlashAttention, our models do not use other techniques like Fill-In-the-Middle (FIM) [BJT+ 22], or Multi-Query-Attention (MQA) [RSR+ 20] that could further boost performance and efficiency [LAZ+ 23]. For both pretraining and finetuning, we concatenate our respective datasets into a single dimensional array with “⟨∣endoftext∣⟩” token used for separating the files. We train our models on sequence length of 2048 sliced from our dataset array with next-token prediction loss. We use fp16 training with AdamW optimizer, linear-warmup-linear-decay learning rate schedule, and attention and residual dropout of 0.1. We train on 8 Nvidia-A100 GPUs using deepspeed. Our pretrained base model phi-1-base was obtained in under 4 days of training. Finetuning to obtain phi-1 used an additional 7 hours on the same hardware. Pretraining. phi-1-base was trained on the CodeTextbook dataset (filtered code-language corpus and synthetic textbooks). We use effective batch size 1024 (including data parallelism and gradient accumulation), maximum learning rate 1e-3 with warmup over 750 steps, and weight decay 0.1, for a total of 36,000 steps. We use the checkpoint at 24,000 steps as our phi-1-base – this is equivalent to ∼ 8 epochs on our CodeTextbook dataset for a total of little over 50B total training tokens. Despite the small size and computation, this model already achieves a 29% accuracy on HumanEval. Pretraining. phi-1-base Finetuning. phi-1 is obtained by finetuning phi-1-base on the CodeExercises dataset. For finetuning, we use the same setup as pretraining, but different hyperparameters: we use effective batchsize of 256, maximum learning rate 1e-4 with 50 steps of warmup, and weight decay 0.01. We train for total of 6,000 steps and pick the best checkpoint (saved every 1000 steps). Finetuning. phi-1 phi-1-base This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Textbooks are All You Need: Model Architecture and Training

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Large Language Models on Memory-Constrained Devices Using Flash Memory: Abstract and Intro

Textbooks Are All You Need: Abstract and Introduction

Textbooks are All You Need: Training Details and the Importance of High-quality Data

Textbooks are All You Need: Filtering of Existing Code Datasets Using a Transformer-based Classifier

Textbooks are All You Need: Creation of Synthetic Textbook-quality Datasets

Textbooks are All You Need: Spikes of Model Capability After Finetuning on CodeExercises

Large Language Models on Memory-Constrained Devices Using Flash Memory: Abstract and Intro

Textbooks Are All You Need: Abstract and Introduction

Textbooks are All You Need: Training Details and the Importance of High-quality Data

Textbooks are All You Need: Filtering of Existing Code Datasets Using a Transformer-based Classifier

Textbooks are All You Need: Creation of Synthetic Textbook-quality Datasets

Textbooks are All You Need: Spikes of Model Capability After Finetuning on CodeExercises

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps