This New AI Model Could Replace Half Your Coding Workflow

Authors: Mayank Mishra⋆, IBM Matt Stallone⋆, IBM Gaoyuan Zhang⋆, IBM Yikang Shen, IBM Aditya Prasad, IBM Adriana Meza Soria, IBM Michele Merler, IBM Parameswaran Selvam, IBM Saptha Surendran, IBM Shivdeep Singh, IBM Manish Sethi, IBM Xuan-Hong Dang, IBM Pengyuan Li, IBM Kun-Lung Wu, IBM Syed Zawad, IBM Andrew Coleman, IBM Matthew White, IBM Mark Lewis, IBM Raju Pavuluri, IBM Yan Koyfman, IBM Boris Lublinsky, IBM Maximilien de Bayser, IBM Ibrahim Abdelaziz, IBM Kinjal Basu, IBM Mayank Agarwal, IBM Yi Zhou, IBM Chris Johnson, IBM Aanchal Goyal, IBM Hima Patel, IBM Yousaf Shah, IBM Petros Zerfos, IBM Heiko Ludwig, IBM Asim Munawar, IBM Maxwell Crouse, IBM Pavan Kapanipathi, IBM Shweta Salaria, IBM Bob Calio, IBM Sophia Wen, IBM Seetharami Seelam, IBM Brian Belgodere, IBM Carlos Fonseca, IBM Amith Singhee, IBM Nirmit Desai, IBM David D. Cox, IBM Ruchir Puri†, IBM Rameswar Panda†, IBM Authors: Mayank Mishra⋆, IBM Matt Stallone⋆, IBM Gaoyuan Zhang⋆, IBM Yikang Shen, IBM Aditya Prasad, IBM Adriana Meza Soria, IBM Michele Merler, IBM Parameswaran Selvam, IBM Saptha Surendran, IBM Shivdeep Singh, IBM Manish Sethi, IBM Xuan-Hong Dang, IBM Pengyuan Li, IBM Kun-Lung Wu, IBM Syed Zawad, IBM Andrew Coleman, IBM Matthew White, IBM Mark Lewis, IBM Raju Pavuluri, IBM Yan Koyfman, IBM Boris Lublinsky, IBM Maximilien de Bayser, IBM Ibrahim Abdelaziz, IBM Kinjal Basu, IBM Mayank Agarwal, IBM Yi Zhou, IBM Chris Johnson, IBM Aanchal Goyal, IBM Hima Patel, IBM Yousaf Shah, IBM Petros Zerfos, IBM Heiko Ludwig, IBM Asim Munawar, IBM Maxwell Crouse, IBM Pavan Kapanipathi, IBM Shweta Salaria, IBM Bob Calio, IBM Sophia Wen, IBM Seetharami Seelam, IBM Brian Belgodere, IBM Carlos Fonseca, IBM Amith Singhee, IBM Nirmit Desai, IBM David D. Cox, IBM Ruchir Puri†, IBM Rameswar Panda†, IBM Mayank Mishra⋆, IBM Matt Stallone⋆, IBM Gaoyuan Zhang⋆, IBM Yikang Shen, IBM Aditya Prasad, IBM Adriana Meza Soria, IBM Michele Merler, IBM Parameswaran Selvam, IBM Saptha Surendran, IBM Shivdeep Singh, IBM Manish Sethi, IBM Xuan-Hong Dang, IBM Pengyuan Li, IBM Kun-Lung Wu, IBM Syed Zawad, IBM Andrew Coleman, IBM Matthew White, IBM Mark Lewis, IBM Raju Pavuluri, IBM Yan Koyfman, IBM Boris Lublinsky, IBM Maximilien de Bayser, IBM Ibrahim Abdelaziz, IBM Kinjal Basu, IBM Mayank Agarwal, IBM Yi Zhou, IBM Chris Johnson, IBM Aanchal Goyal, IBM Hima Patel, IBM Yousaf Shah, IBM Petros Zerfos, IBM Heiko Ludwig, IBM Asim Munawar, IBM Maxwell Crouse, IBM Pavan Kapanipathi, IBM Shweta Salaria, IBM Bob Calio, IBM Sophia Wen, IBM Seetharami Seelam, IBM Brian Belgodere, IBM Carlos Fonseca, IBM Amith Singhee, IBM Nirmit Desai, IBM David D. Cox, IBM Ruchir Puri†, IBM Rameswar Panda†, IBM Abstract Large Language Models (LLMs) trained on code are revolutionizing the software development process. Increasingly, code LLMs are being inte-grated into software development environments to improve the produc-tivity of human programmers, and LLM-based agents are beginning to show promise for handling complex tasks autonomously. Realizing the full potential of code LLMs requires a wide range of capabilities, including code generation, fixing bugs, explaining and documenting code, maintaining repositories, and more. In this work, we introduce the Granite series of decoder-only code models for code generative tasks, trained with code written in 116 programming languages. The Granite Code models family consists of models ranging in size from 3 to 34 billion parameters, suitable for applications ranging from complex application modernization tasks to on-device memory-constrained use cases. Evaluation on a comprehensive set of tasks demonstrates that Granite Code models consistently reaches state-of-the-art performance among available open-source code LLMs. The Granite Code model family was optimized for enterprise software devel-opment workflows and performs well across a range of coding tasks (e.g. code generation, fixing and explanation), making it a versatile “all around” code model. We release all our Granite Code models under an Apache 2.0 license for both research and commercial use. https://github.com/ibm-granite/granite-code-models https://github.com/ibm-granite/granite-code-models 1 Introduction Over the last several decades, software has been woven into the fabric of every aspect of our society. As demand for software development surges, it is more critical than ever to increase software development productivity, and LLMs provide promising path for augmenting human programmers. Prominent enterprise use cases for LLMs in software development productivity include code generation, code explanation, code fixing, unit test and documentation generation, application modernization, vulnerability detection, code translation, and more. Recent years have seen rapid progress in LLM’s ability to generate and manipulate code, and a range of models with impressive coding abilities are available today. Models range in size from single-digit billions of parameters (e.g. Llama-7B (Touvron et al., 2023), Gemma-7B (Gemma-Team et al., 2024), etc.) to hundreds of billions: DBRX (Databricks), Arctic (Snowflake), Grok, Mixtral 8x22B (MistralAI), Command R+ (Cohere), and vary in the generality of intended use, with some models aiming to cover a range of uses outside of code, while others focus primarily on coding-related tasks (e.g. StarCoder (Li et al., 2023a; Lozhkov et al., 2024), CodeGen (Nijkamp et al., 2023), CodeLlama (Rozie`re et al., 2023), and CodeGemma (CodeGemma Team et al., 2024)). However, there remain important gaps in the current field of LLMs for code, especially in the context of enterprise software development. First, while very large, generalist LLMs can achieve excellent coding performance, their size makes them expensive to deploy. Smaller code-focused models (Li et al., 2023a; Lozhkov et al., 2024; Nijkamp et al., 2023; Rozie`re et al., 2023; CodeGemma Team et al., 2024) can achieve excellent code generation performance in a smaller and more flexible package, but performance in coding tasks beyond generation (e.g. fixing and explanation) can lag behind code generation performance. Li et al. 2023a Lozhkov et al. 2024 Nijkamp et al. 2023 Rozie`re et al. 2023 CodeGemma Team et al. 2024 In many enterprise contexts, code LLM adoption can be further complicated by factors beyond the performance of the models. For instance, even open models are sometimes plagued by a lack of transparency about the data sources and data processing methods that went into model, which can erode trust in models in mission critical and regulated contexts. Furthermore, license terms in today’s open LLMs can encumber and complicate an enterprise’s ability to use a model. Here, we present Granite Code models, a series of highly capable code LLMs, designed to support enterprise software development across a wide range of coding tasks. Granite Code models has two main variants that we release in four different sizes (3B, 8B, 20B, and 34B): Granite Code Base: base foundation models for code-related tasks; Granite Code Instruct: instruction following models finetuned using a combination of Git commits paired with human instructions and open-source synthetically generated code instruction datasets. Granite Code Base: base foundation models for code-related tasks; Granite Code Base: Granite Code Instruct: instruction following models finetuned using a combination of Git commits paired with human instructions and open-source synthetically generated code instruction datasets. Granite Code Instruct: The base models in the series have been trained from scratch with a two-phase training strategy. In phase 1, our model is trained on 3 to 4 trillion tokens sourced from 116 pro-gramming languages, ensuring a comprehensive understanding of programming languages and syntax. In phase 2, our model is further trained on 500 billion tokens with a carefully designed mixture of high-quality data from code and natural language domains to improve the model’s ability to reason. We use the unsupervised language modeling objective to train the base models in both the phases of training. The instruct models are derived by further finetuning the above trained base models on a combination of a filtered variant of CommitPack (Muennighoff et al., 2023), natural language instruction following datasets (OASST (Ko¨ pf et al., 2023), HelpSteer (Wang et al., 2023)) and open-source math datasets (MathInstruct (Yue et al., 2023) and MetaMathQA (Yu et al., 2023)), including synthetically generated code datasets for improving instruction following and reasoning capabilities. Muennighoff et al. 2023 Ko¨ pf et al. 2023 Wang et al. 2023 Yue et al. 2023 Yu et al. 2023 We conduct extensive evaluations of our code LLMs on a comprehensive set of benchmarks, including HumanEvalPack (Muennighoff et al., 2023), MBPP(+) (Austin et al., 2021; Liu et al., 2023a), RepoBench (Liu et al., 2023b), ReCode (Wang et al., 2022), and more. This set of benchmarks encompasses many different kinds of coding tasks beyond just code synthesis in Python, e.g., code fixing, code explanation, code editing, code translation, etc., across most major programming languages (Python, JavaScript, Java, Go, C++, Rust, etc.). Muennighoff et al. 2023 Austin et al. 2021 Liu et al. 2023a Liu et al. 2023b Wang et al. 2022 Our findings reveal that among open-source models, the Granite Code models overall show very strong performance across all model sizes and benchmarks (often outperforming other open-source code models that are twice large compared to Granite). As an illustration, fig-ure 1 (top) shows a comparison of Granite-8B-Code-Base with other open-source base code LLMs, including recent high-performing general purpose base LLMs like Mistral (Jiang et al., 2023b) and LLama-3 (AI@Meta, 2024) on HumanEvalPack (Muennighoff et al., 2023). While CodeGemma and StarCoder2 perform reasonably well in generating code, they perform significantly worse on the code fixing and explanation variants of HumanEvalPack. On av-erage, Granite-8B-Code-Base outperforms the most competitive CodeGemma-8B model by almost 12 points on HumanEvalPack (33.2% vs 21.3%), despite being trained on significantly less number of tokens (4.5T vs 7.5T tokens). Besides base models, the instruction tuned variants of our Granite Code models also show strong performance on HumanEvalPack, outperforming other open-source (code) instruction models, demonstrating benefits to a wider set of coding tasks with natural language instructions (see figure 1 (bottom)). 1 Jiang et al. 2023b AI@Meta 2024 Muennighoff et al. 2023 1 Furthermore, since reasoning is critical for solving complicated questions and tasks, we also test our Granite-8B-Code-Base model on six mathematical benchmarks, including MATH (Cobbe et al., 2021), GSM8K (Cobbe et al., 2021) and problem solving with access to computational tools, where our Granite 8B model achieves better performance compared to most state-of-the-art 7B or 8B LLMs. For example, Granite-8B-Code-Base outperforms Llama-3-8B-Base by ∼12 points on GSM8K and ∼6 points on MATH (see table 15). Cobbe et al. 2021 Cobbe et al. 2021 15 The key advantages of Granite Code models include: All-rounder Code LLM: Granite Code models achieve competitive or state-of-the-art performance on different kinds of code-related tasks, including code generation, explanation, fixing, editing, translation, etc., demonstrating their ability to solve diverse coding tasks; Trustworthy Enterprise-Grade LLM: All our models are trained on license-permissible data collected following IBM’s AI Ethics principles1 and guided by IBM’s Corporate Legal team for trustworthy enterprise usage. All the Granite Code models are released under the Apache 2.0 license. All-rounder Code LLM: Granite Code models achieve competitive or state-of-the-art performance on different kinds of code-related tasks, including code generation, explanation, fixing, editing, translation, etc., demonstrating their ability to solve diverse coding tasks; All-rounder Code LLM Trustworthy Enterprise-Grade LLM: All our models are trained on license-permissible data collected following IBM’s AI Ethics principles1 and guided by IBM’s Corporate Legal team for trustworthy enterprise usage. All the Granite Code models are released under the Apache 2.0 license. Trustworthy Enterprise-Grade LLM 1 We describe our entire data collection, filtering, and preprocessing pipeline in section 2. Section 3 describes the details of model architecture, followed by training details in Section 4. Section 5 provides the details about instruction tuning, and Section 6 describes the experiments and results comparing Granite Code models with other open-source LLMs. 2 3 4 5 6 2 Data Collection In this section, we describe the process of crawling and filtering (Sec. 2.1), deduplication (Sec. 2.2), HAP/PII filtering (Sec. 2.3) used to prepare the code data for model training. We also provide an overview of high-quality natural language data used to enhance the model’s language understanding and mathematical reasoning skills. 2.1 2.2 2.3 2.1 Data Crawling and Filtering The pretraining code data was sourced from a combination of publicly available datasets like Github Code Clean2, StarCoderdata3, and additional public code repositories and issues from GitHub. We filter raw data to retain a list of 116 programming languages out of 300+ languages, as listed in Appendix A. The assignment of data to programming languages is performed based solely on file extension, similar to StarCoder (Li et al., 2023a). After language filtering, we apply four key filtering rules to filter out lower-quality code (Li et al., 2023a): (1) remove files with fewer than 25% alphabetic characters, (2) except for the XSLT language, filter out files where the string “<?xml version=” appears within the first 100 characters, (3) for HTML files, only keep files where the visible text makes up at least 20% of the HTML code and has a minimum length of 100 characters, (4) for JSON and YAML files, only keep files that have a character count ranging from 50 to 5000 characters. We also filter GitHub issues using a set of quality metrics that include removing auto-generated text, filtering out non-English issues, excluding comments from bots, and using the number of users engaged in the conversation as an indicator of quality. We also annotate each code file with license information associated with the respective repository, found via Github APIs and only keep files with permissive licenses for model training. 2 3 A Li et al. 2023a Li et al. 2023a 2.2 Exact and Fuzzy Deduplication We adopt an aggressive deduplication strategy including both exact and fuzzy deduplication to remove documents having (near) identical code content in our training set. For exact deduplication, we first compute SHA256 hash on the document content and remove records having identical hashes. Post exact deduplication, we apply fuzzy deduplication with the goal of removing code files that may have slight variations and thereby unbiasing the data further. We apply a two-step method for this: (1) compute MinHashes of all the documents and then utilize Locally Sensitive Hashing (LSH) to group documents based on their MinHash fingerprints, (2) measure Jaccard similarity between each pair of documents in the same bucket and annotate documents except one as duplicates based on a similarity threshold of 0.7. We apply this near-deduplication process to all programming languages including GitHub issues to enhance the richness and diversity of the training dataset. 2.3 HAP, PII, Malware Filtering To reduce the likelihood of generating hateful, abusive, or profane (HAP) language from the models, we make diligent efforts to filter HAP content from the training set. We first create a dictionary of HAP keywords and then annotate each code document with the number of occurrences of such keywords in the content including comments. We filter out documents which exceeds the HAP threshold, computed based on a distributional analysis as well as manual inspection of code files. Moreover, to protect privacy, we follow StarCoder (Li et al., 2023a) and make diligent efforts to redact Personally Identifiable Information (PII) from the training set. Specifically, we leverage the StarPII4 model to detect IP addresses, keys, email addresses, names, user names, and passwords found in the content. The PII redaction step replaces the PII text with the corresponding tokens NAME , EMAIL , KEY , PASSWORD and change the IP address with a synthetically generated IP address, as in Li et al. (2023a). We also scan our datasets using to identify and remove instances of malware in the source code. Li et al. 2023a 4 2.4 Natural Language Datasets In addition to collecting code data for model training, we curate several publicly available high-quality natural language datasets for improving the model’s proficiency in language understanding and mathematical reasoning. Representative datasets under this category in-clude web documents (Stackexchange, CommonCrawl), mathematical web text (OpenWeb-Math; Paster et al. (2023), StackMathQA; Zhang (2024)), academic text (Arxiv, Wikipedia), and instruction tuning datasets (FLAN; Longpre et al. (2023), HelpSteer (Wang et al., 2023)). We do not deduplicate these already preprocessed natural language datasets. Paster et al. 2023 Zhang 2024 Longpre et al. 2023 Wang et al. 2023 3 Model Architecture We train a series of code models of varying sizes based on the transformer decoder architec-ture (Vaswani et al., 2017). The model hyperparameters for these models are given in Table 1. For all model architectures, we use pre-normalization (Xiong et al., 2020): normalization applied to the input of attention and MLP blocks. Vaswani et al. 2017 1 Xiong et al. 2020 3B: The smallest model in the Granite-code model family is trained with RoPE embedding (Su et al., 2023) and Multi-Head Attention (Vaswani et al., 2017). This model use the swish activation function (Ramachandran et al., 2017) with GLU (Shazeer, 2020) for the MLP, also commonly referred to as swiglu. For normalization, we use RMSNorm (Zhang & Sennrich, 2019) since it’s computationally more efficient than LayerNorm (Ba et al., 2016). The 3B model is trained with a context length of 2048 tokens. 3B Su et al. 2023 Vaswani et al. 2017 Ramachandran et al. 2017 Shazeer 2020 Zhang & Sennrich 2019 Ba et al. 2016 8B: The 8B model has a similar architecture as the 3B model with the exception of using Grouped-Query Attention (GQA) (Ainslie et al., 2023). Using GQA offers a better tradeoff between model performance and inference efficiency at this scale. We train the 8B model with a context length of 4096 tokens. 8B Ainslie et al. 2023 20B: The 20B code model is trained with learned absolute position embeddings. We use Multi-Query Attention (Shazeer, 2019) during training for efficient downstream inference. For the MLP block, we use the GELU activation function (Hendrycks & Gimpel, 2023). For normalizing the activations, we use LayerNorm (Ba et al., 2016). This model is trained with a context length of 8192 tokens. 20B Shazeer 2019 Hendrycks & Gimpel 2023 Ba et al. 2016 34B: To train the 34B model, we follow the approach by Kim et al. for depth upscaling of the 20B model. Specifically, we first duplicate the 20B code model with 52 layers and then remove final 8 layers from the original model and initial 8 layers from its duplicate to form two models. 34B Kim et al. Finally, we concatenate both models to form Granite-34B-Code model with 88 layers (see Figure 2 for an illustration). After the depth upscaling, we observe that the drop in performance compared to 20B model is pretty small contrary to what is observed by Kim et al.. This performance is recovered pretty quickly after we continue pretraining of the upscaled 34B model. Similar, to 20B, we use a 8192 token context during pretraining. 2 Kim et al. 4 Pretraining In this section, we provide details on two phase training (Sec. 4.1), training objectives (Sec. 4.2), optimization (Sec. 4.3) and infrastructure (Sec. 4.4) used in pretraining the models. 4.1 4.2 4.3 4.4 4.1 Two Phase Training Granite Code models are trained on 3.5T to 4.5T tokens of code data and natural language datasets related to code. Data is tokenized via byte pair encoding (BPE, (Sennrich et al., 2015)), employing the same tokenizer as StarCoder (Li et al., 2023a). Following (Shen et al., 2024; Hu et al., 2024), we utilize high-quality data with two phases of training as follows. Sennrich et al. 2015 Li et al. 2023a Shen et al. 2024 Hu et al. 2024 • Phase 1 (code only training): During phase 1, both 3B and 8B models are trained for 4 trillion tokens of code data comprising 116 languages. The 20B parameter model is trained on 3 trillion tokens of code. The 34B model is trained on 1.4T tokens after the depth upscaling which is done on the 1.6T checkpoint of 20B model. Phase 1 (code only training) • Phase 2 (code + language training): In phase 2, we include additional high-quality publicly available data from various domains, including technical, mathematics, and web documents, to further improve the model’s performance in reasoning and problem solving skills, which are essential for code generation. We train all our models for 500B tokens (80% code and 20% language data) in phase 2 training. Phase 2 (code + language training) 4.2 Training Objective For training of all our models, we use the causal language modeling objective and Fill-In-the-Middle (FIM) (Bavarian et al., 2022) objective. The FIM objective is tasked to predict inserted tokens with the given context and subsequent text. We train our models to work with both PSM (Prefix-Suffix-Middle) and SPM (Suffix-Prefix-Middle) modes, with relevant formatting control tokens, same as StarCoder (Li et al., 2023a). Bavarian et al. 2022 Li et al. 2023a The overall loss is computed as a weighted combination of the 2 objectives: We emperically set α = 0.5 during training and find that this works well in practice leading to SOTA performance on both code completion and code infilling tasks. It should be noted that the FIM objective is only used during pretraining, however we drop it during instruction finetuning i.e we set α = 1. α α 4.3 Optimization We use AdamW optimizer ([Kingma & Ba](#_bookmark80), [2017](#_bookmark80)) with β1 = 0.9, β2 = 0.95 and weight decay of 0.1 for training all our Granite code models. For the phase-1 pretraining, the learning rate follows a cosine schedule starting from 3 10−4 which decays to 3 10−5 with an initial linear warmup step of 2k iterations. For phase-2 pretraining, we start from 3 10−4 (1.5 10−4 for 20B and 34B models) and adopt an exponential decay schedule to anneal it to 10% of the initial learning rate. We use a batch size of 4M-5M tokens depending on the model size during both phases of pretraining. To accelerate training, we use FlashAttention 2 (Dao et al., 2022; Dao, 2023), the persistent layernorm kernel, Fused RMSNorm kernel (depending on the model) and the Fused Adam kernel available in NVIDIA’s Apex library. We use a custom fork of NVIDIA’s Megatron-LM (Shoeybi et al., 2019; Narayanan et al., 2021) for distributed training of all our models. We train with a mix of 3D parallelism: tensor parallel, pipeline parallel and data parallel. We also use sequence parallelism (Korthikanti et al., 2023) for reducing the activation memory consumption of large context length during training. We use Megatron’s distributed optimizer with mixed precision training (Micikevicius et al., 2018) in BF16 (Kalamkar et al., 2019) with gradient all-reduce and gradient accumulation in FP32 for training stability. Dao et al. 2022 Dao 2023 NVIDIA’s Apex library Shoeybi et al. 2019 Narayanan et al. 2021 Korthikanti et al. 2023 Micikevicius et al. 2018 Kalamkar et al. 2019 4.4 Infrastructure We train the Granite Code models using IBM’s two supercomputing clusters, namely Vela and Blue Vela, outfitted with NVIDIA A100 and H100 GPUs, respectively. In the Vela A100 GPU cluster, each node has 2 Intel Xeon Scalable Processors with 8 80GB A100 GPUs connected to each other by NVLink and NVSwitch. The Vela cluster adopts RoCE (RDMA over Converged Ethernet) and GDR (GPU-direct RDMA) for high-performance networking. Similarly, each node in Blue Vela cluster consists of dual 48-core Intel processors with 8× 80GB H100 GPUs. Blue Vela employs 3.2Tbps InfiniBand interconnect to facilitate seamless communication between nodes, known for their high throughput and low latency. In addition, Blue Vela employs a separate, dedicated InfiniBand Storage fabric providing 800Gbps per compute node, backed by multiple ESS6000 storage appliances. Both clusters provide a scalable and efficient infrastructure for training our models over thousands of GPUs. We estimate the carbon emissions from pretraining the Granite Code models to be 455 tCO2eq, which is computed based on the total energy usage in the models and US national average carbon intensity factor of 0.423 kg CO2eq/KWh without taking the location of data centers in consideration. The Blue Vela cluster runs on 100% renewable energy to minimize the environmental impact. 5 Instruction Tuning Finetuning code LLMs on a variety of tasks explained via instructions has been shown to improve model usability and general performance. While there has been much progress in code instruction tuning, most of them adopt synthetically generated data from OpenAI models, which limits the model use in many enterprise applications. Thus, following OctoCoder (Muennighoff et al., 2023), we use only a combination of permissively licensed data, with an aim to enhance instruction following capabilities of our models, including logical reasoning and problem-solving skills. Specifically, Granite Code Instruct models are trained on the following types of data. Muennighoff et al. 2023 • Code Commits Dataset: CommitPackFT (Muennighoff et al., 2023), a filtered ver-sion of full CommitPack dataset across 92 programming languages6; Code Commits Dataset Muennighoff et al. 2023 6 Math Datasets: MathInstruct7 (Yue et al., 2023) and MetaMathQA (Yu et al., 2023); Code Instruction Datasets: Glaive-Code-Assistant-v38, Self-OSS-Instruct-SC29, Glaive-Function-Calling-v210, NL2SQL11 and few synthetically generated API call-ing datasets (Basu et al., 2024); Language Instruction Datasets: High-quality datasets like HelpSteer (Wang et al., 2023), an open license-filtered version of Platypus12 (Lee et al., 2023) including a collection of hardcoded prompts to ensure model generates correct outputs given inquiries about its name or developers. Math Datasets: MathInstruct7 (Yue et al., 2023) and MetaMathQA (Yu et al., 2023); Math Datasets 7 Yue et al. 2023 Yu et al. 2023 Code Instruction Datasets: Glaive-Code-Assistant-v38, Self-OSS-Instruct-SC29, Glaive-Function-Calling-v210, NL2SQL11 and few synthetically generated API call-ing datasets (Basu et al., 2024); Code Instruction Datasets 8 9 10 11 Basu et al. 2024 Language Instruction Datasets: High-quality datasets like HelpSteer (Wang et al., 2023), an open license-filtered version of Platypus12 (Lee et al., 2023) including a collection of hardcoded prompts to ensure model generates correct outputs given inquiries about its name or developers. Language Instruction Datasets Wang et al. 2023 12 Lee et al. 2023 For training, we use a cosine scheduler with 250 warmup steps, an initial learning rate 10−5, and train for three epochs. Further, we add random, uniform noise with a magnitude of √ 5 Nh , where N is the sequence length and h is the embedding dimension, to the embedding vector, as proposed by Jain et al.. The additional noise improved overall answer quality of the instruction model. We use FlashAttention 2 (Dao, 2023; Dao et al., 2022) with a Padding-Free Transformer13 implementation to reduce GPU memory usage and redundant FLOPs during finetuning. We also use full activation checkpointing (Korthikanti et al., 2023), which allows us to finetune our Granite-20B-Code models with 8K context length within a single node within a few hours on 8×A100 GPUs. Jain et al. Dao 2023 Dao et al. 2022 13 Korthikanti et al. 2023 6 Evaluation We evaluate Granite Code models on a wide variety of tasks, including code generation, code explanation, code fixing, code editing, math reasoning, etc., as shown in Table 2. We compare our models with several open-source code LLMs: StableCode (Pinnaparaju et al., 2024), Code Llama (Roziere et al., 2023), StarCoder (Li et al., 2023b), StarCoder2 (Lozhkov et al., 2024), and CodeGemma14, including recent high-performing general purpose open LLMs like Mistral (Jiang et al., 2023a) and LLama-315. For all the benchmarks, we evaluate the baseline models (including ours) using the same script and environment for fair comparison. 2 Pinnaparaju et al. 2024 Roziere et al. 2023 Li et al. 2023b Lozhkov et al. 2024 14 Jiang et al. 2023a 15 6.1 Code Generation 6.1.1 HumanEvalSynthesize: Multilingual Code Generation in 6 Languages 6.1.1 HumanEvalSynthesize: Multilingual Code Generation in 6 Languages While most of the prior code LLMs evaluate code generation capabilities only on Python us-ing HumanEval (Chen et al., 2021), we adopt the challenging HumanEvalSynthesize (Muen-nighoff et al., 2023) benchmark in our study, which extends Python problems of Humaneval Benchmark to five additional commonly used programming languages, namely JavaScript, Java, Go, C++, Rust. We evaluate all models in a zero-shot manner using greedy decoding with completion format for the base models, and instruction template for the instruction-tuned models. In constructing prompts for instruction-tuned models, we adhere to the formats provided in their official examples. We search for a suitable prompt format in the HuggingFace model card, GitHub repository, and formal publications or technical reports. Chen et al. 2021 Muen-nighoff et al. 2023 Table 3 shows the results on of base and instruct models HumanEvalSynthesize benchmark. Granite-3B-Code-Base is the best performing small model with +3% improvement over CodeGemma-2B. Overall, among base models, Granite Code models achieve the best average performance at 7B-8B scale, 2nd best average performance in 13B-20B size models, and is very close to the best model (falls behind StarCoder2-15B by 0.1%). While CodeLlama-34B achieves better score on HumanEval Python, Granite-34B-Code-Base achieves much better performance on other languages, leading to a 4% improvement on average across 6 languages. Among the instruct models, Granite Code models consistently outperform equivalent size CodeLlama; the 3B, 8B, and 20B models even outperform CodeLlama models that are two times larger. It’s worth noting that even our smaller model, Granite-3B-Code-Instruct, surpasses the performance of CodeLlama-34B-Instruct. Further, we can also see that Granite Code models outperform much larger state-of-the-art open-source general-purpose language models, including Gemma, Mixtral, and Llama 3 series models. This shows that domain-specific code models could achieve better performance and efficiency, thus making them more suitable for cost- and performance-sensitive enterprise environments. 3 6.1.2 MultiPL-E: Multilingual Code Generation in 18 Languages MultiPL-E (Cassano et al., 2023) is a canonical benchmark for evaluating code models on a more diverse set of 18 different programming languages. On MultiPL-E, we compare all the base models on 18 languages, by sampling 50 completions per prompt at temperature 0.2 with top-p 0.95, as in (Lozhkov et al., 2024). Table 4 shows the results of different models on MultiPL-E. As can be seen from the table, there is no single model that works best at every language across all model sizes. In comparison to the similarly sized open-source model CodeLlama-7B, Granite-8B-Code-Base performs the best on 16/18 programming languages. Of the medium models, StarCoder2-15B performs best. Among the large models, Granite-34B-Code-Base does better than CodeLlama-34B on most languages, demonstrating its effectiveness in code generation across a diverse set of languages. Cassano et al. 2023 Lozhkov et al. 2024 4 6.1.3 MBPP and MBPP+: Code Generation in Python 6.1.3 MBPP and MBPP+: Code Generation in Python MBPP (Austin et al., 2021) and MBPP+ (Liu et al., 2023a) are two of the most widely studied benchmarks for evaluating code models. While the prompt for each MBPP problem includes a natural language description followed by a few tests, MBPP+ consists of 35× more tests than the original benchmarks. We use greedy decoding and report the mean pass@1 for all the models. Table 5 summarizes the results of different base models. As we can see, Granite3B-Code-Base significantly outperforms CodeGemma-2B but falls short of StarCoder2-3B on both benchmarks. At mid parameter ranges, Granite Code models beat both CodeLlama-7B and CodeLLama-13B by a margin of 5% and 15% on average respectively. Additionally, Granite-34B-Code-Base is very competitive with CodeLlama-34B with only a difference of 0.9% on average across both benchmarks. 6.1.4 DS1000: Data Science Tasks in Python DS-1000 (Lai et al., 2023) is a widely studied benchmark which offers a comprehensive collection of 1,000 data science workflows across seven different libraries, from Matplotlib to TensorFlow. We use temperature 0.2 and top-p 0.95 to generate 40 samples per each library and report mean pass@1 with code completion setting for all the models up to 8B parameters. Results on DS-1000 are summarized in Table 7. Of the small models, StarCoder2-3B performs the best. Granite-3B-Code-Base is in second place, outperforming CodeGemma-2B by more than 12 points on average across 7 libraries. Granite-8B-Code-Base achieves the best average performance of 34.5% outperforming all other models of the similar parameter sizes. Lai et al. 2023 7 The Granite Code models achieve relatively high accuracy across all sizes (e.g., outperforming CodeGemma at 2B-3B scale, StarCoder2 at 7B-8B scale and CodeLlama models with half of the sizes). This shows that our Granite Code models are not only capable of generating good code but also of using libraries more accurately in real data science workflows. 6.1.5 RepoBench, CrossCodeEval: Repository-Level Code Generation Code generation in practice often occurs within the context of a repository rather than in isolated files. Thus, we use RepoBench (Liu et al., 2023b) and CrossCodeEval (Ding et al., 2024), to evaluate repository-level code completion capabilities of different models. Liu et al. 2023b Ding et al. 2024 On RepoBench, we evaluate using level 2k across three settings: cross-file-first (12,000 data points), cross-file-random (5,000 data points), and in-file (7,000 data points). We report the average edit similarity and exact match across the settings. Following Liu et al. (2023b), we set generation temperature to 0.2 and the top-p sampling parameter to 0.95 for all models. We constrain the models to generate a maximum of 64 new tokens per prompt, and the first non-empty and non-comment line of the output was selected as the prediction. Liu et al. 2023b For CrossCodeEval, following Ding et al. (2024), we use a max sequence length of 2k using the retrieve-and-generate (RG) method with OpenAI’s ada embedding. We set the maximum cross-file context to 512 tokens and the max generation token to 50 tokens for all the models. We use the uniform prompt formatting in the original implementation, with a temperature of 0.2 and top-p of 0.95 for all model generations. Max sequence length was set to 8,192 for all models, with the exception of Granite-3B-Code-Base (2,048) and Granite-8B-Code-Base (4,096), given their respective context lengths. Ding et al. 2024 Table 6 shows the performance of different models on RepoBench v1.1. Granite-3B-Code-Base demonstrates notable performance among the smaller models, with StarCoderBase-3B achieving the leading performance metrics. Among the medium models, Granite-8B-Code-Base shows very strong performance on Java, while ranks second best one in Python, with CodeGemma-7B being the best performing on both metrics. Among larger models, Granite-20B-Code not only outperforms StarCoder2-15B but also CodeLlama-34B on all 4 metrics across both programming languages. This demonstrates strong repository-level code generation capabilities of the Granite Code models, despite being not trained with repo-level file packing as in (Lozhkov et al., 2024; CodeGemma Team et al., 2024); we leave this as an interesting future work to further improve performance of our models. 6 Results on CrossCodeEval are shown in Table 8. As can be seen from the table, among the similar sized models, CodeGemma-7B is best on Python and TypeScript, while StarCoder2-7B performs best on Java and C#. Likewise, Granite-20B-Code-Base outperforms CodeLlama-13B on 3 programming languages (Python, Java, C#), while falls behind on TypeScript. Across all model sizes and programming languages, there is no single model that is best at all the metrics, similar to the findings in MultiPL-E. This indicates that achieving uniformly high performance across all programming languages remains challenging. 8 6.1.6 FIM: Infilling Evaluations Granite Code models are trained for code completion purposes using FIM objective, as described in Sec. 4.2. We use SantaCoder-FIM benchmark (Allal et al., 2023), for infilling evaluations which tests the ability of models to fill in a single line of code in Python, JavaScript, and Java solutions to HumanEval. We use greedy decoding and report the mean exact match for all the models. Table 9 shows that Granite Code models significantly outperforms StarCoder and StarCoder2 across all model sizes, demonstrating it to be excellent well-rounded models for code completion use cases. 4.2 Allal et al. 2023 9 Moreover, we observe no performance improvement in scaling the model sizes from 8B to 34B, indicating that smaller models are often more suitable for FIM code completion tasks. 6.2 Code Explanation and Fixing While most of the prior code LLMs primarily focus on evaluating performance using code generation benchmarks, users may want to use these models in other challenging scenarios beyond synthesis like explaining and fixing codes. Thus, following (Muennighoff et al., 2023), we test the performance of different code models on the code explanation and fixing variants of HumanEvalPack benchmark, spanning 6 different programming languages. For both HumanEvalExplain and HumanEvalFix, we evaluate all models in a zero-shot manner using greedy decoding with completion format for the base models, and instruction template for the instruction-tuned models. Muennighoff et al. 2023 The results of the HumanEvalExplain benchmark are shown in Table 10. Granite Code Base models significantly outperform other SOTA base code LLMs, including StarCoder2 and CodeGemma, by a large margin. Interestingly, Granite-8B-Code-Base beats CodeLlama-34B by 9.3% on average, while being close to CodeLlama-70B. We attribute this performance to our data mixture and base model training decisions. After instruction tuning, performance of all the base models significantly improves across languages. Among code instruct models, Granite-34B-Code-Instruct performs the best reaching the average score of 41.9%, which is very close of 41.1% score of CodeLlama-70B-Instruct. Remarkably, CodeGemma-7B-IT gains the most improvement from instruction tuning but still falls behind Granite-8b-Code-Instruct by 2.5% on average. Mixtral-8x22B-Instruct-v0.1 performs best among all models benchmarked with a significant margin, indicating the potential advantage of bigger models and training on general natural language data could potentially help on this task. 10 Table 11 reports the results on HumanEvalFix. Like HumanEvalExplain, Granite Code mod-els base models significantly outperform other base models. Notably, Granite-8B-Code-Base again shows impressive performance, making it close to CodeLlama-70B and Llama-3-70B. After instruction tuning, we observe a performance improvement on almost all models. No-tably, our 8B and 20B instruct models achieve the best performance among models with less that has less than 34B parameters. However, we see a significant performance improvement (about 10 points) after moving to larger models with more than 34B parameters. Among large instruct models, Granite-34B-Code-Instruct performs similarly to other models with at least twice the parameters, thus having a better cost and performance balance. 11 Figure 3 compares the performance of Granite-8B-Code-Instruct with state-of-the-art open-source instruction-tuned general LLMs. Granite-8B-Code-Instruct consistently outperforms the compared models, emphasizing the need for domain-specific code models. To summarize, these results show that both our base and instruct models are capable of generating good code but also in code fixing and explanation, demonstrating their ability to solve diverse coding tasks in enterprise software development. 3 6.3 Code Editing and Translation CanItEdit is a recent benchmark designed to evaluate code LLMs on instructional code editing tasks. The benchmark contains 105 hand-crafted Python programs where each problem consists of a code snippet accompanied by an instruction of two types: descriptive or lazy. The goal is to modify the code according to the instruction; both lazy and descriptive instructions should lead to the same edit. Following Cassano et al. (2024), we compare different instruction-tuned models using their corresponding instruction format, by random sampling with a temperature of 0.2 and a top-p of 0.95, with 20 completions per problem. Cassano et al. 2024 From Table 12, we make the following observations on the performance of different models on CanItEdit benchmark. It shows that Granite Code models have better pass rates, as well as less presence of unnecessary code changes, compared to CodeGemma and CodeLlama. This result shows that Granite Code models can better understand users’ intentions and make accurate changes to the existing code in practical situations. 12 CodeLingua (Pan et al., 2024) is a dataset designed to test model’s capabilities in code translation. It contains two sets of programs: one containing 251 programs in Java and 251 prgrams in Python sampled from Avatar (Ahmad et al., 2021), and one from CodeNet Puri et al. (2021) containing 250 programs for each of five languages: C, C++, Go, Java and Python. For each program a set of unit tests in the form of input and expected outputs is provided. The task consists in translating each program from the source language to five target languages (the ones sampled from CodeNet). Pass@1 is used as the metric to evaluate translation accuracy. For every generation, we used greedy decoding and the suggested prompt format for each instruction tuned model. For base models, or cases where the instruction format was not specified, we used the default prompt from the dataset. Basic post-processing is applied to each generation, to remove generation artifacts such as repetition of the input instruction, source language code, target language name and formatting tokens (``` , for example). Pan et al. 2024 Ahmad et al. 2021 Puri et al. 2021 Table 13 shows the results on the CodeLingua benchmark. For the source languages C, C++ and Go the results reported in the table are taken directly from the runs on Codenet, whereas for Java and Python the results are reported as the average of the runs on Avatar and CodeNet. We report the numbers of Octocoder and CodeLlama from the CodeLingua leaderboard 16. The Granite Code models performs comparably to CodeGemma. It is worth noticing that the correctness of the translation is not only due to the code generated by the model, but also by the extra metadata and explanation provided as part of the answer. We tested instruction tuned models, as we observed that base models often struggle to understand the request itself to translate code. Instruct models, on the other hand, tend to add additional information besides the translated code as part of the generations. The CodeLLama family seems to suffer especially from this issue, as post-processing the generations to extract only the relevant code constitutes a non-trivial task. The CodeGemma and Granite Models on the other hand, produce a nicely formatted output that can be easily parsed. Interestingly, Go seems to be the hardest target language to translate to, while C is the source language with the highest translation success rate from, for the Granite models. 13 16 6.4 Code Reasoning, Understanding and Execution CRUXEval (Gu et al., 2024) is a benchmark of 800 Python functions and input-output pairs, consisting of two tasks: CRUXEval-I (input prediction) and CRUXEval-O (output prediction). We use temperature 0.2 to report pass@1 and temperature 0.8 to report pass@5, both using 10 samples, as in Lozhkov et al. (2024); Gu et al. (2024). Table 14 shows that Granite Code models perform competitively with other models. Granite-3B-Code-Base outperforms CodeGemma-2B on CRUXEval-I but lags behind on CRUXEval-O. Interestingly, there is not a single model which performs consistently best at 3B parameters. However, at 7B-8B parameters, CodeGemma-7B outperforms all the models on both tasks. For the large models, Granite-34B-Code-Base lags behind CodeLlama-34B on CRUXEval-I but outperforms on CRUXEval-O. Performance on both CRUXEval-I and CRUXEval-O increases as we scale the size of the Granite Code models from 3B to 34B parameters, demonstrating the advantage of larger models for code reasoning and execution tasks. Gu et al. 2024 Lozhkov et al. 2024 Gu et al. 2024 14 6.5 Math Reasoning We use the following four widely used benchmarks to assess the mathematical reasoning capabilities of Granite-8B-Code-Base and various 7B-8B baseline models: MATH (Hendrycks et al., 2021): a dataset from high-school math competitions; we use the 4-shot experiment setting from Gao et al. (2023); GSM8K (Cobbe et al., 2021): a dataset of middle-school level math word problems; we use the 5-shot experiment setting from Gao et al. (2023); SAT (Azerbayev et al., 2023): a dataset consisting of the 32 math questions with no figures from the May 2023 College Board SAT examination; we use the same experiment setting from Azerbayev et al. (2023) OCW (Lewkowycz et al., 2022): a collection of undergraduate-level STEM problems harvested from MIT’s OpenCourseWare; we use the 4-shot experiment setting from Azerbayev et al. (2023). MATH (Hendrycks et al., 2021): a dataset from high-school math competitions; we use the 4-shot experiment setting from Gao et al. (2023); MATH Hendrycks et al. 2021 Gao et al. 2023 GSM8K (Cobbe et al., 2021): a dataset of middle-school level math word problems; we use the 5-shot experiment setting from Gao et al. (2023); GSM8K Cobbe et al. 2021 Gao et al. 2023 SAT (Azerbayev et al., 2023): a dataset consisting of the 32 math questions with no figures from the May 2023 College Board SAT examination; we use the same experiment setting from Azerbayev et al. (2023) SAT Azerbayev et al. 2023 Azerbayev et al. 2023 OCW (Lewkowycz et al., 2022): a collection of undergraduate-level STEM problems harvested from MIT’s OpenCourseWare; we use the 4-shot experiment setting from Azerbayev et al. (2023). OCW Lewkowycz et al. 2022 Azerbayev et al. 2023 Following Azerbayev et al. (2023), we also evaluate models on two tasks that involve solving problems with access to computational tools: Azerbayev et al. 2023 MATH+Py solving MATH task by writing a Python program that uses built-in numeric operations, the math module, and SymPy; we use the 5-shot prompt and experiment setting from Azerbayev et al. (2023); GSM8K+Py solving GSM8K task by writing a Python program that executes to generate an integer answer; we use the 8-shot prompt and experiment setting from Azerbayev et al. (2023). MATH+Py solving MATH task by writing a Python program that uses built-in numeric operations, the math module, and SymPy; we use the 5-shot prompt and experiment setting from Azerbayev et al. (2023); MATH+Py Azerbayev et al. 2023 GSM8K+Py solving GSM8K task by writing a Python program that executes to generate an integer answer; we use the 8-shot prompt and experiment setting from Azerbayev et al. (2023). GSM8K+Py Azerbayev et al. 2023 Table 15 summarizes the results. Despite not being specifically tuned for mathematical reasoning, Granite-8B-Code-Base shows impressive reasoning ability, outperforming most existing 7B to 8B models. While other models may be particularly strong on a few tasks, our model consistently achieves top-1 or top-2 performance on all tasks. 15 6.6 Calling Functions and Tools We adopt Berkeley Function-Calling Leaderboard (BFCL) (Yan et al., 2024), to evaluate LLM’s ability to call functions and tools. BFCL is a function-calling dataset with 1700 functions across 4 categories: simple, multiple, parallel, and parallel multiple function calls - each differing in the number of potential functions the model has access to and the number of output functions the model has to generate. Yan et al. 2024 We use two popular methods to evaluate the accuracy of the model-generated answers: AST evaluation based on Abstract Syntax Tree (AST) based metric for fuzzy evaluation of output, and Executable evaluation to match the outputs of model-generated and ground-truth functions. Figure 4 shows the results of different Granite Code models on BFCL benchmark. As can be seen from the figure, overall accuracy improves from 25.65% to 57.12% for Granite-3B-Code-Base to Granite-34B-Code-Base, showing the effectiveness of model scaling in function (tool) calling capabilities. We also compare Granite-8B-Code with CodeLlama-7B in Figure 5 and find that Granite-8B-Code-Instruct beats CodeLlama-7B-Instruct by 22%, 14% and 12% on AST Summary, Execution Summary and Overall accuracy respectively. Additionally, Figure 5 shows that instruction tuning consistently improves performance of both base models, with more noticeable improvements in Granite Code models. E.g., +17.88% in overall accuracy from Granite-8B-Code-Base to Granite-8B-Code-Instruct, indicating the effectiveness of our well-curated data mixture in finetuning base models. 4 5 5 6.7 Model Robustness While the performance on canonical code generative tasks is essential, we argue that the evaluation of practical robustness is also necessary to characterize different models system-atically. We therefore consider benchmarking the robustness of code synthesis, one of the most representative downstream tasks of source code. ReCode (Wang et al., 2022) provides 30 different general perturbations on docstrings, function names, and codes to evaluate the robustness of code-generation models. We use the perturbed version of the HumanEval benchmark using greedy generation with 5 seeds, as recommended in (Wang et al., 2022). Wang et al. 2022 Wang et al. 2022 Table 16 shows the worst-case RP@1 of different models for each perturbation category. While Granite-3B-Code-Base consistently outperforms CodeGemma-2B, Granite-8B-Code-Base lags behind CodeGemma-7B on all categories. Granite Code models obtains much better performance compared to CodeLlama models, showing its generalization in a robust way at every sizes. Our largest model, Granite-34B-Code-Base consistently outperforms CodeLlama-34B on all four categories. This indicates that Granite-34B-Code-Base has more capacity to deal with unseen instances and perturbations. In general, we also observe higher RP@1 for larger models within the Granite Code family (e.g., improved from 40.1% to 52.0% for Granite-3B-Code-Base to Granite-34B-Code-Base on average across all perturbations), showing that larger model helps improve worst-case robustness. 16 7 Conclusion We presented a family of decoder-only Granite Code models ranging in size from 3 to 34 bil-lion parameters that are highly versatile in their ability to accomplish a wide range of tasks from code generation to fixing bugs, explaining and documenting code, maintaining reposi-tories, and more. These models have proven to be suitable for applications ranging from complex application modernization tasks (IBM, 2023) to on-device memory-constrained use cases. Extensive evaluation demonstrates that Granite Code models consistently reach state-of-the-art performance among open-source code LLMs, matching or exceeding the performance of recently released CodeGemma, StarCoder2, and Llama3 models on aver-age performance across various code-related tasks of code generation, explanation, and bug fixing in a variety of popular programming languages. Our experience and results demonstrate that Granite Code models have a proven ability to better handle different tasks in enterprise software development workflows. We release all our Granite Code models under an Apache 2.0 license for both research and commercial use. We plan to continuously release updates to these models to improve their performance, e.g. leveraging the CodeNet instruction dataset (Puri et al., 2021), and in the near future we plan to release long-context as well as Python- and Java-specialized model variants. IBM 2023 Puri et al. 2021 Acknowledgments We would like to acknowledge the efforts of numerous teams at IBM Research AI and Hybrid Cloud Platform, IBM AI Infrastructure team, IBM WatsonX Code Assistant and platform team. Special thanks to IBM Research leaders - Dario Gil, Sriram Raghavan, Mukesh Khare, Danny Barnett, Talia Gershon, Priya Nagpurkar, Nicholas Fuller for their support. Thanks and acknowledgement to Trent Gray-Donald, Keri Olson, Alvin Tan, Hillery Hunter, Dakshi Agrawal, Xuan Liu, Mudhakar Srivatsa, Raghu Kiran Ganti, Carlos Costa, Darrell Reimer, Maja Vukovic, Dinesh Garg, Akash Srivastava, Abhishek Bhandwaldar, Aldo Pareja, Shiv Sudalairaj, Atin Sood, Sandeep Gopisetty, Nick Hill, Ray Rose, Tulio Coppola, A´ llysson Oliveira, Aadarsh Sahoo, Apoorve Mohan, Yuan Chi Chang, Jitendra Singh, Yuya Ong, Eric Butler, David Brotherton, Rakesh Mohan, David Kung, Dinesh Khandelwal, Naigang Wang, Nelson Mimura Gonzalez, Olivier Tardieu, Tuan Hoang Trong, Luis Angel Bathen, Kevin O’Connor, Christopher Laibinis, Tatsuhiro Chiba, Sunyanan Choochotkaew, Robert Walkup, Antoni Viros i Martin, Adnan Hoque, Davis Wertheimer and Marquita Ellis. References Wasi Uddin Ahmad, Md Golam Rahman Tushar, Saikat Chakraborty, and Kai-Wei Chang. Avatar: A parallel corpus for java-python program translation. arXiv preprint arXiv:2108.11590, 2021. URL https://github.com/meta-llama/llama3/blob/ main/MODEL CARD.md. arXiv preprint arXiv:2108.11590 https://github.com/meta-llama/llama3/blob/ main/MODEL CARD.md Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebro´n, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023. Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Car-los Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, et al. Santacoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988, 2023. arXiv preprint arXiv:2301.03988 Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021. Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631, 2023. arXiv preprint arXiv:2310.10631 Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016. Kinjal Basu, Ibrahim Abdelaziz, Subhajit Chaudhury, Soham Dan, Maxwell Crouse, Asim Munawar, Sadhana Kumaravel, Vinod Muthusamy, Pavan Kapanipathi, and Luis A Lastras. Api-blend: A comprehensive corpora for training and benchmarking api llms. arXiv preprint arXiv:2402.15491, 2024. arXiv preprint arXiv:2402.15491 Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and Mark Chen. Efficient training of language models to fill in the middle, 2022. Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, et al. Multipl-e: a scalable and polyglot approach to benchmarking neural code generation. IEEE Transactions on Software Engineering, 2023. IEEE Transactions on Software Engineering Federico Cassano, Luisa Li, Akul Sethi, Noah Shinn, Abby Brennan-Jones, Jacob Ginesin, Edward Berman, George Chakhnashvili, Anton Lozhkov, Carolyn Jane Anderson, and Arjun Guha. Can it edit? evaluating the ability of large language models to follow code editing instructions, 2024. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mo-hammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code, 2021. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. arXiv preprint arXiv:2110.14168 CodeGemma Team, Ale Jakse Hartman, Andrea Hu, Christopher A. Choquette-Choo, Heri Zhao, Jane Fine, Jeffrey Hui, Jingyue Shen, Joe Kelley, Joshua Howland, Kshitij Bansal, Luke Vilnis, Mateo Wirth, Nam Nguyen, Paul Michel, Peter Choy, Pratik Joshi, Ravin Kumar, Sarmad Hashmi, Shubham Agrawal, Siqi Zuo, Tris Warkentin, and Zhitao et al. Gong. Codegemma: Open code models based on gemma. 2024. URL https://goo.gle/codegemma. Cohere. Command r+. https://docs.cohere.com/docs/command-r-plus. https://goo.gle/codegemma https://docs.cohere.com/docs/command-r-plus Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023. Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Re´. Flashatten-tion: Fast and memory-efficient exact attention with io-awareness. In S. Koyejo, ![](file:///C:/Users/user/AppData/Local/Temp/msohtmlclip1/01/clip_image035.gif)S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 16344–16359. Curran Asso-ciates, Inc., 2022. URL https://proceedings.neurips.cc/paper files/paper/2022/file/67d57c32e20fd0a7a302cb81d36e40d5-Paper-Conference.pdf. Advances in Neural Information Processing Systems https://proceedings.neurips.cc/paper files/paper/2022/file/67d57c32e20fd0a7a302cb81d36e40d5-Paper-Conference.pdf Databricks. Introducing dbrx: A new state-of-the-art open llm — databricks blog. https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm. https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, and Bing Xiang. Crosscodeeval: A diverse and multilingual benchmark for cross-file code com-pletion. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=wgDcbBMSfh. Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track https://openreview.net/forum?id=wgDcbBMSfh Yangruibo Ding, Zijian Wang, Wasi Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Kr-ishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, et al. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion. Advances in Neural Information Processing Systems, 36, 2024. Advances in Neural Information Processing Systems Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836. https://zenodo.org/records/10256836 Gemma-Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivie`re, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Le´onard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Ame´lie He´liou, Andrea Tacchetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak Shahriari, Charline Le Lan, Christopher A. Choquette-Choo, Cle´ment Crepy, Daniel Cer, Daphne Ippolito, David Reid, Elena Buchatskaya, Eric Ni, Eric Noland, Geng Yan, George Tucker, George-Christian Muraru, Grigory Rozhdestvenskiy, Henryk Michalewski, Ian Tenney, Ivan Grishchenko, Jacob Austin, James Keeling, Jane Labanowski, Jean-Baptiste Lespiau, Jeff Stanway, Jenny Brennan, Jeremy Chen, Johan Ferret, Justin Chiu, Justin Mao-Jones, Katherine Lee, Kathy Yu, Katie Millican, Lars Lowe Sjoesund, Lisa Lee, Lucas Dixon, Machel Reid, Maciej Mikuła, Mateo Wirth, Michael Sharman, Nikolai Chinaev, Nithum Thain, Olivier Bachem, Oscar Chang, Oscar Wahltinez, Paige Bailey, Paul Michel, Petko Yotov, Rahma Chaabouni, Ramona Comanescu, Reena Jana, Rohan Anil, Ross McIlroy, Ruibo Liu, Ryan Mullins, Samuel L Smith, Sebastian Borgeaud, Sertan Girgin, Sholto Douglas, Shree Pandya, Siamak Shakeri, Soham De, Ted Klimenko, Tom Hennigan, Vlad Feinberg, Wojciech Stokowiec, Yu hui Chen, Zafarali Ahmed, Zhitao Gong, Tris Warkentin, Ludovic Peran, Minh Giang, Cle´ment Farabet, Oriol Vinyals, Jeff Dean, Koray Kavukcuoglu, Demis Hassabis, Zoubin Ghahramani, Douglas Eck, Joelle Barral, Fernando Pereira, Eli Collins, Armand Joulin, Noah Fiedel, Evan Senter, Alek Andreev, and Kathleen Kenealy. Gemma: Open models based on gemini research and technology, 2024. Alex Gu, Baptiste Rozie`re, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I. Wang. Cruxeval: A benchmark for code reasoning, understanding and execution. arXiv preprint arXiv:2401.03065, 2024. arXiv preprint arXiv:2401.03065 Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus), 2023. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021. NeurIPS Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024. arXiv preprint arXiv:2404.06395 IBM. watsonx code assistant, 2023. URL https://www.ibm.com/products/ watsonx-code-assistant. https://www.ibm.com/products/ watsonx-code-assistant Neel Jain, Ping yeh Chiang, Yuxin Wen, John Kirchenbauer, Hong-Min Chu, Gowthami Somepalli, Brian R. Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Neftune: Noisy embeddings improve instruction finetuning, 2023. Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023a. arXiv preprint arXiv:2310.06825 Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Le´lio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothe´e Lacroix, and William El Sayed. Mistral 7b, 2023b. Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja Vooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, Jiyan Yang, Jongsoo Park, Alexander Heinecke, Evangelos Georganas, Sudarshan Srinivasan, Abhisek Kundu, Misha Smelyanskiy, Bharat Kaul, and Pradeep Dubey. A study of bfloat16 for deep learning training, 2019. Dahyun Kim, Chanjun Park, Sanghoon Kim, Wonsung Lee, Wonho Song, Yunsu Kim, Hyeonwoo Kim, Yungi Kim, Hyeonju Lee, Jihoo Kim, Changbae Ahn, Seonghoon Yang, Sukyung Lee, Hyunbyung Park, Gyoungjin Gim, Mikyoung Cha, Hwalsuk Lee, and Sunghun Kim. Solar 10.7b: Scaling large language models with simple yet effective depth up-scaling, 2024. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael ![](file:///C:/Users/user/AppData/Local/Temp/msohtmlclip1/01/clip_image035.gif)Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation re-computation in large transformer models. Proceedings of Machine Learning and Systems, 5, 2023. URL https://proceedings.mlsys.org/paper files/paper/2023/hash/ e851ca7b43815718fbbac8afb2246bf8-Abstract-mlsys2023.html. Proceedings of Machine Learning and Systems https://proceedings.mlsys.org/paper files/paper/2023/hash/ e851ca7b43815718fbbac8afb2246bf8-Abstract-mlsys2023.html Andreas Ko¨ pf, Yannic Kilcher, Dimitri von Ru¨ tte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Richa´rd Nagyfi, Shahul ES, Sameer Suri, David Glushkov, Arnav Dantuluri, Andrew Maguire, Christoph Schuh-mann, Huu Nguyen, and Alexander Mattick. Openassistant conversations – democratiz-ing large language model alignment, 2023. Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pp. 18319–18345. PMLR, 2023. International Conference on Machine Learning Ariel N. Lee, Cole J. Hunter, and Nataniel Ruiz. Platypus: Quick, cheap, and powerful refinement of llms. 2023. Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022. Advances in Neural Information Processing Systems Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Cheng-hao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, Joa˜o Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muh-tasham Oblokulov, Zhiruo Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Ku-nakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Mun˜oz Ferrandis, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. Starcoder: may the source be with you!, 2023a. Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Cheng-hao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023b. arXiv preprint arXiv:2305.06161 Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code gener-ation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?id=1qvx610Cu7. Thirty-seventh Conference on Neural Information Processing Systems https://openreview.net/forum?id=1qvx610Cu7 Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository-level code auto-completion systems. arXiv preprint arXiv:2306.03091, 2023b. arXiv preprint arXiv:2306.03091 Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. In International Conference on Machine Learning, pp. 22631–22648. PMLR, 2023. International Conference on Machine Learning Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173, 2024. arXiv preprint arXiv:2402.19173 Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=r1gs9JgRZ. International Conference on Learning Representations https://openreview.net/forum?id=r1gs9JgRZ MistralAI. Mixtral 8x22b. https://mistral.ai/news/mixtral-8x22b/. https://mistral.ai/news/mixtral-8x22b/ Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, and Shayne Longpre. Octopack: Instruction tuning code large language models, 2023. Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catan-zaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Per-formance Computing, Networking, Storage and Analysis, SC ’21, New York, NY, USA, 2021. As-sociation for Computing Machinery. ISBN 9781450384421. doi: 10.1145/3458817.3476209. URL https://doi.org/10.1145/3458817.3476209. Proceedings of the International Conference for High Per-formance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3458817.3476209 Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis, 2023. Rangeet Pan, Ali Reza Ibrahimzada, Rahul Krishna, Divya Sankar, Lambert Pouguem Wassi, Michele Merler, Boris Sobolev, Raju Pavuluri, Saurabh Sinha, and Reyhaneh Jabbarvand. Lost in translation: A study of bugs introduced by large language models while translating code. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, ICSE ’24. Association for Computing Machinery, 2024. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, and Jimmy Ba. Openwebmath: An open dataset of high-quality mathematical web text. arXiv preprint arXiv:2310.06786, 2023. arXiv preprint arXiv:2310.06786 Nikhil Pinnaparaju, Reshinth Adithyan, Duy Phung, Jonathan Tow, James Baicoianu, Ashish Datta, Maksym Zhuravinskyi, Dakota Mahan, Marco Bellagente, Carlos Riquelme, et al. Stable code technical report. arXiv preprint arXiv:2404.01226, 2024. arXiv preprint arXiv:2404.01226 Ruchir Puri, David S. Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, Veronika Thost, Luca Buratti, Saurabh Pujar, Shyam Ramji, Ulrich Finkler, Susan Malaika, and Frederick Reiss. Codenet: A large-scale ai for code dataset for learning a diversity of coding tasks. NeurIPS, 2021. NeurIPS Prajit Ramachandran, Barret Zoph, and Quoc V. Le. Searching for activation functions, 2017. Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Je´re´my Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023. arXiv preprint arXiv:2308.12950 Baptiste Rozie`re, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Je´re´my Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre De´fossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. Code llama: Open foundation models for code, 2023. Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015. arXiv preprint arXiv:1508.07909 Noam Shazeer. Fast transformer decoding: One write-head is all you need, 2019. Noam Shazeer. Glu variants improve transformer, 2020. Yikang Shen, Zhen Guo, Tianle Cai, and Zengyi Qin. Jetmoe: Reaching llama2 performance with 0.1 m dollars. arXiv preprint arXiv:2404.07413, 2024. arXiv preprint arXiv:2404.07413 Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019. arXiv preprint arXiv:1909.08053 Snowflake. Snowflake arctic - llm for enterprise ai. https://www.snowflake.com/blog/ arctic-open-efficient-foundation-language-models-snowflake/. https://www.snowflake.com/blog/ arctic-open-efficient-foundation-language-models-snowflake/ Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothe´e Lacroix, Baptiste Rozie`re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. arXiv preprint arXiv:2302.13971 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, ![](file:///C:/Users/user/AppData/Local/Temp/msohtmlclip1/01/clip_image035.gif)U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-nett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran As-sociates, Inc., 2017. URL https://proceedings.neurips.cc/paper files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. Advances in Neural Information Processing Systems https://proceedings.neurips.cc/paper files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf Shiqi Wang, Li Zheng, Haifeng Qian, Chenghao Yang, Zijian Wang, Varun Kumar, Mingyue Shang, Samson Tan, Baishakhi Ray, Parminder Bhatia, Ramesh Nallapati, Murali Krishna Ramanathan, Dan Roth, and Bing Xiang. Recode: Robustness eval-uation of code generation models. 2022. doi: 10.48550/arXiv.2212.10264. URL https://arxiv.org/abs/2212.10264. https://arxiv.org/abs/2212.10264 Zhilin Wang, Yi Dong, Jiaqi Zeng, Virginia Adams, Makesh Narsimhan Sreedhar, Daniel Egert, Olivier Delalleau, Jane Polak Scowcroft, Neel Kant, Aidan Swope, and Oleksii Kuchaiev. Helpsteer: Multi-attribute helpfulness dataset for steerlm, 2023. Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the trans-former architecture. In Hal Daume´ III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 10524–10533. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/ v119/xiong20b.html. Proceedings of the 37th International Conference on Machine Learning Proceedings of Machine Learning Research https://proceedings.mlr.press/ v119/xiong20b.html Fanjia Yan, Huanzhi Mao, Charlie Cheng-Jie Ji, Tianjun Zhang, Shishir G. Patil, Ion Sto-ica, and Joseph E. Gonzalez. Berkeley function calling leaderboard. https://gorilla.cs. berkeley.edu/blogs/8 berkeley function calling leaderboard.html, 2024. https://gorilla.cs. berkeley.edu/blogs/8 berkeley function calling leaderboard.html Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023. arXiv preprint arXiv:2309.12284 Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023. arXiv preprint arXiv:2309.05653 Biao Zhang and Rico Sennrich. Root mean square layer normalization, 2019. Yifan Zhang. Stackmathqa: A curated collection of 2 million mathematical questions and answers sourced from stack exchange, 2024. A Programming Languages ABAP, Ada, Agda, Alloy, ANTLR, AppleScript, Arduino, ASP, Assembly, Augeas, Awk, Batchfile, Bison, Bluespec, C, C-sharp, C++, Clojure, CMake, COBOL, CoffeeScript, Common-Lisp, CSS, Cucumber, Cuda, Cython, Dart, Dockerfile, Eagle, Elixir, Elm, Emacs-Lisp, Erlang, F-sharp, FORTRAN, GLSL, GO, Gradle, GraphQL, Groovy, Haskell, Haxe, HCL, HTML, Idris, Isabelle, Java, Java-Server-Pages, JavaScript, JSON, JSON5, JSONiq, JSONLD, JSX, Julia, Jupyter, Kotlin, Lean, Literate-Agda, Literate-CoffeeScript, Literate-Haskell, Lua, Makefile, Maple, Markdown, Mathematica, Matlab, Objective-C++, OCaml, OpenCL, Pascal, Perl, PHP, PowerShell, Prolog, Protocol-Buffer, Python, Python-traceback, R, Racket, RDoc, Restructuredtext, RHTML, RMarkdown, Ruby, Rust, SAS, Scala, Scheme, Shell, Smalltalk, Solidity, SPARQL, SQL, Stan, Standard-ML, Stata, Swift, SystemVerilog, Tcl, Tcsh, Tex, Thrift, Twig, TypeScript, Verilog, VHDL, Visual-Basic, Vue, Web-Ontology-Language, WebAssembly, XML, XSLT, Yacc, YAML, Zig This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license. This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license. available on arxiv