អ្នកនិពន្ធ៖ Mayank Mishra⋆, IBM Matt Stallone⋆, IBM Gaoyuan Zhang⋆, IBM Yikang Shen, IBM Aditya Prasad, IBM Adriana Meza Soria, IBM Michele Merler, IBM Parameswaran Selvam, IBM Saptha Surendran, IBM Shivdeep Singh, IBM Manish Sethi, IBM Xuan-Hong Dang, IBM Pengyuan Li, IBM Kun-Lung Wu, IBM Syed Zawad, IBM Andrew Coleman, IBM Matthew White, IBM Mark Lewis, IBM Raju Pavuluri, IBM Yan Koyfman, IBM Boris Lublinsky, IBM Maximilien de Bayser, IBM Ibrahim Abdelaziz, IBM Kinjal Basu, IBM Mayank Agarwal, IBM Yi Zhou, IBM Chris Johnson, IBM Aanchal Goyal, IBM Hima Patel, IBM Yousaf Shah, IBM Petros Zerfos, IBM Heiko Ludwig, IBM Asim Munawar, IBM Maxwell Crouse, IBM Pavan Kapanipathi, IBM Shweta Salaria, IBM Bob Calio, IBM Sophia Wen, IBM Seetharami Seelam, IBM Brian Belgodere, IBM Carlos Fonseca, IBM Amith Singhee, IBM Nirmit Desai, IBM David D. Cox, IBM Ruchir Puri†, IBM Rameswar Panda†, IBM សេចក្ដីសង្ខេប Large Language Models (LLMs) trained on code are revolutionizing the software development process. Increasingly, code LLMs are being inte-grated into software development environments to improve the produc-tivity of human programmers, and LLM-based agents are beginning to show promise for handling complex tasks autonomously. Realizing the full potential of code LLMs requires a wide range of capabilities, including code generation, fixing bugs, explaining and documenting code, maintaining repositories, and more. In this work, we introduce the Granite series of decoder-only code models for code generative tasks, trained with code written in 116 programming languages. The Granite Code models family consists of models ranging in size from 3 to 34 billion parameters, suitable for applications ranging from complex application modernization tasks to on-device memory-constrained use cases. Evaluation on a comprehensive set of tasks demonstrates that Granite Code models consistently reaches state-of-the-art performance among available open-source code LLMs. The Granite Code model family was optimized for enterprise software devel-opment workflows and performs well across a range of coding tasks (e.g. code generation, fixing and explanation), making it a versatile “all around” code model. We release all our Granite Code models under an Apache 2.0 license for both research and commercial use. https://github.com/ibm-granite/granite-code-models 1 ការណែនាំ Over the last several decades, software has been woven into the fabric of every aspect of our society. As demand for software development surges, it is more critical than ever to increase software development productivity, and LLMs provide promising path for augmenting human programmers. Prominent enterprise use cases for LLMs in software development productivity include code generation, code explanation, code fixing, unit test and documentation generation, application modernization, vulnerability detection, code translation, and more. Recent years have seen rapid progress in LLM’s ability to generate and manipulate code, and a range of models with impressive coding abilities are available today. Models range in size from single-digit billions of parameters (e.g. Llama-7B (Touvron et al., 2023), Gemma-7B (Gemma-Team et al., 2024), etc.) to hundreds of billions: DBRX (Databricks), Arctic (Snowflake), Grok, Mixtral 8x22B (MistralAI), Command R+ (Cohere), and vary in the generality of intended use, with some models aiming to cover a range of uses outside of code, while others focus primarily on coding-related tasks (e.g. StarCoder (Li et al., 2023a; Lozhkov et al., 2024), CodeGen (Nijkamp et al., 2023), CodeLlama (Rozie`re et al., 2023), and CodeGemma (CodeGemma Team et al., 2024)). However, there remain important gaps in the current field of LLMs for code, especially in the context of enterprise software development. First, while very large, generalist LLMs can achieve excellent coding performance, their size makes them expensive to deploy. Smaller code-focused models ( , ; , ; , ; , ; , ) can achieve excellent code generation performance in a smaller and more flexible package, but performance in coding tasks beyond generation (e.g. fixing and explanation) can lag behind code generation performance. Li et al. 2023a Lozhkov et al. 2024 Nijkamp et al. 2023 Rozie`re et al. 2023 CodeGemma Team et al. 2024 In many enterprise contexts, code LLM adoption can be further complicated by factors beyond the performance of the models. For instance, even open models are sometimes plagued by a lack of transparency about the data sources and data processing methods that went into model, which can erode trust in models in mission critical and regulated contexts. Furthermore, license terms in today’s open LLMs can encumber and complicate an enterprise’s ability to use a model. Here, we present Granite Code models, a series of highly capable code LLMs, designed to support enterprise software development across a wide range of coding tasks. Granite Code models has two main variants that we release in four different sizes (3B, 8B, 20B, and 34B): base foundation models for code-related tasks; Granite Code Base: instruction following models finetuned using a combination of Git commits paired with human instructions and open-source synthetically generated code instruction datasets. Granite Code Instruct: The base models in the series have been trained from scratch with a two-phase training strategy. In phase 1, our model is trained on 3 to 4 trillion tokens sourced from 116 pro-gramming languages, ensuring a comprehensive understanding of programming languages and syntax. In phase 2, our model is further trained on 500 billion tokens with a carefully designed mixture of high-quality data from code and natural language domains to improve the model’s ability to reason. We use the unsupervised language modeling objective to train the base models in both the phases of training. The instruct models are derived by further finetuning the above trained base models on a combination of a filtered variant of CommitPack ( , ), natural language instruction following datasets (OASST ( , ), HelpSteer ( , )) and open-source math datasets (MathInstruct ( , ) and MetaMathQA ( , )), including synthetically generated code datasets for improving instruction following and reasoning capabilities. Muennighoff et al. 2023 Ko¨ pf et al. 2023 Wang et al. 2023 Yue et al. 2023 Yu et al. 2023 We conduct extensive evaluations of our code LLMs on a comprehensive set of benchmarks, including HumanEvalPack ( , ), MBPP(+) ( , ; , ), RepoBench ( , ), ReCode ( , ), and more. This set of benchmarks encompasses many different kinds of coding tasks beyond just code synthesis in Python, e.g., code fixing, code explanation, code editing, code translation, etc., across most major programming languages (Python, JavaScript, Java, Go, C++, Rust, etc.). Muennighoff et al. 2023 Austin et al. 2021 Liu et al. 2023a Liu et al. 2023b Wang et al. 2022 Our findings reveal that among open-source models, the Granite Code models overall show very strong performance across all model sizes and benchmarks (often outperforming other open-source code models that are twice large compared to Granite). As an illustration, fig-ure (top) shows a comparison of Granite-8B-Code-Base with other open-source base code LLMs, including recent high-performing general purpose base LLMs like Mistral ( , ) and LLama-3 ( , ) on HumanEvalPack ( , ). While CodeGemma and StarCoder2 perform reasonably well in generating code, they perform significantly worse on the code fixing and explanation variants of HumanEvalPack. On av-erage, Granite-8B-Code-Base outperforms the most competitive CodeGemma-8B model by almost 12 points on HumanEvalPack (33.2% vs 21.3%), despite being trained on significantly less number of tokens (4.5T vs 7.5T tokens). Besides base models, the instruction tuned variants of our Granite Code models also show strong performance on HumanEvalPack, outperforming other open-source (code) instruction models, demonstrating benefits to a wider set of coding tasks with natural language instructions (see figure (bottom)). 1 Jiang et al. 2023b AI@Meta 2024 Muennighoff et al. 2023 1 Furthermore, since reasoning is critical for solving complicated questions and tasks, we also test our Granite-8B-Code-Base model on six mathematical benchmarks, including MATH ( , ), GSM8K ( , ) and problem solving with access to computational tools, where our Granite 8B model achieves better performance compared to most state-of-the-art 7B or 8B LLMs. For example, Granite-8B-Code-Base outperforms Llama-3-8B-Base by ∼12 points on GSM8K and ∼6 points on MATH (see table ). Cobbe et al. 2021 Cobbe et al. 2021 15 The key advantages of Granite Code models include: : Granite Code models achieve competitive or state-of-the-art performance on different kinds of code-related tasks, including code generation, explanation, fixing, editing, translation, etc., demonstrating their ability to solve diverse coding tasks; All-rounder Code LLM : All our models are trained on license-permissible data collected following IBM’s AI Ethics principles and guided by IBM’s Corporate Legal team for trustworthy enterprise usage. All the Granite Code models are released under the Apache 2.0 license. Trustworthy Enterprise-Grade LLM 1 We describe our entire data collection, filtering, and preprocessing pipeline in section . Section describes the details of model architecture, followed by training details in Section . Section provides the details about instruction tuning, and Section describes the experiments and results comparing Granite Code models with other open-source LLMs. 2 3 4 5 6 2 ការប្រមូលទិន្នន័យ In this section, we describe the process of crawling and filtering (Sec. ), deduplication (Sec. ), HAP/PII filtering (Sec. ) used to prepare the code data for model training. We also provide an overview of high-quality natural language data used to enhance the model’s language understanding and mathematical reasoning skills. 2.1 2.2 2.3 2.1 ការរុករកនិងការច្រោះទិន្នន័យ The pretraining code data was sourced from a combination of publicly available datasets like Github Code Clean , StarCoderdata , and additional public code repositories and issues from GitHub. We filter raw data to retain a list of 116 programming languages out of 300+ languages, as listed in Appendix . The assignment of data to programming languages is performed based solely on file extension, similar to StarCoder ( , ). After language filtering, we apply four key filtering rules to filter out lower-quality code ( , ): (1) remove files with fewer than 25% alphabetic characters, (2) except for the XSLT language, filter out files where the string “<?xml version=” appears within the first 100 characters, (3) for HTML files, only keep files where the visible text makes up at least 20% of the HTML code and has a minimum length of 100 characters, (4) for JSON and YAML files, only keep files that have a character count ranging from 50 to 5000 characters. We also filter GitHub issues using a set of quality metrics that include removing auto-generated text, filtering out non-English issues, excluding comments from bots, and using the number of users engaged in the conversation as an indicator of quality. We also annotate each code file with license information associated with the respective repository, found via Github APIs and only keep files with permissive licenses for model training. 2 3 A Li et al. 2023a Li et al. 2023a 2.2 ការលុបចោលទិន្នន័យដែលដូចគ្នានិងជិតដូចគ្នា We adopt an aggressive deduplication strategy including both exact and fuzzy deduplication to remove documents having (near) identical code content in our training set. For exact deduplication, we first compute SHA256 hash on the document content and remove records having identical hashes. Post exact deduplication, we apply fuzzy deduplication with the goal of removing code files that may have slight variations and thereby unbiasing the data further. We apply a two-step method for this: (1) compute MinHashes of all the documents and then utilize Locally Sensitive Hashing (LSH) to group documents based on their MinHash fingerprints, (2) measure Jaccard similarity between each pair of documents in the same bucket and annotate documents except one as duplicates based on a similarity threshold of 0.7. We apply this near-deduplication process to all programming languages including GitHub issues to enhance the richness and diversity of the training dataset. 2.3 ការច្រោះ HAP, PII, Malware To reduce the likelihood of generating hateful, abusive, or profane (HAP) language from the models, we make diligent efforts to filter HAP content from the training set. We first create a dictionary of HAP keywords and then annotate each code document with the number of occurrences of such keywords in the content including comments. We filter out documents which exceeds the HAP threshold, computed based on a distributional analysis as well as manual inspection of code files. Moreover, to protect privacy, we follow StarCoder ( , ) and make diligent efforts to redact Personally Identifiable Information (PII) from the training set. Specifically, we leverage the StarPII model to detect IP addresses, keys, email addresses, names, user names, and passwords found in the content. The PII redaction step replaces the PII text with the corresponding tokens NAME , EMAIL , KEY , PASSWORD and change the IP address with a synthetically generated IP address, as in Li et al. (2023a). We also scan our datasets using to identify and remove instances of malware in the source code. Li et al. 2023a 4 2.4 កម្មវិធីទិន្នន័យភាសាធម្មជាតិ In addition to collecting code data for model training, we curate several publicly available high-quality natural language datasets for improving the model’s proficiency in language understanding and mathematical reasoning. Representative datasets under this category in-clude web documents (Stackexchange, CommonCrawl), mathematical web text (OpenWeb-Math; ( ), StackMathQA; ( )), academic text (Arxiv, Wikipedia), and instruction tuning datasets (FLAN; ( ), HelpSteer ( , )). We do not deduplicate these already preprocessed natural language datasets. Paster et al. 2023 Zhang 2024 Longpre et al. 2023 Wang et al. 2023 3 ស្ថាបត្យកម្មម៉ូដែល We train a series of code models of varying sizes based on the transformer decoder architec-ture ( , ). The model hyperparameters for these models are given in Table . For all model architectures, we use pre-normalization ( , ): normalization applied to the input of attention and MLP blocks. Vaswani et al. 2017 1 Xiong et al. 2020 : The smallest model in the Granite-code model family is trained with RoPE embedding ( , ) and Multi-Head Attention ( , ). This model use the swish activation function ( , ) with GLU ( , ) for the MLP, also commonly referred to as swiglu. For normalization, we use RMSNorm ( , ) since it’s computationally more efficient than LayerNorm ( , ). The 3B model is trained with a context length of 2048 tokens. 3B Su et al. 2023 Vaswani et al. 2017 Ramachandran et al. 2017 Shazeer 2020 Zhang & Sennrich 2019 Ba et al. 2016 : The 8B model has a similar architecture as the 3B model with the exception of using Grouped-Query Attention (GQA) ( , ). Using GQA offers a better tradeoff between model performance and inference efficiency at this scale. We train the 8B model with a context length of 4096 tokens. 8B Ainslie et al. 2023 : The 20B code model is trained with learned absolute position embeddings. We use Multi-Query Attention ( , ) during training for efficient downstream inference. For the MLP block, we use the GELU activation function ( , ). For normalizing the activations, we use LayerNorm ( , ). This model is trained with a context length of 8192 tokens. 20B Shazeer 2019 Hendrycks & Gimpel 2023 Ba et al. 2016 : To train the 34B model, we follow the approach by for depth upscaling of the 20B model. Specifically, we first duplicate the 20B code model with 52 layers and then remove final 8 layers from the original model and initial 8 layers from its duplicate to form two models. 34B Kim et al. Finally, we concatenate both models to form Granite-34B-Code model with 88 layers (see Figure for an illustration). After the depth upscaling, we observe that the drop in performance compared to 20B model is pretty small contrary to what is observed by . This performance is recovered pretty quickly after we continue pretraining of the upscaled 34B model. Similar, to 20B, we use a 8192 token context during pretraining. 2 Kim et al. 4 ការបណ្ដុះបណ្ដាលជាមុន In this section, we provide details on two phase training (Sec. ), training objectives (Sec. ), optimization (Sec. ) and infrastructure (Sec. ) used in pretraining the models. 4.1 4.2 4.3 4.4 4.1 ការបណ្ដុះបណ្ដាលពីរដំណាក់កាល Granite Code models are trained on 3.5T to 4.5T tokens of code data and natural language datasets related to code. Data is tokenized via byte pair encoding (BPE, ( , )), employing the same tokenizer as StarCoder ( , ). Following ( , ; , ), we utilize high-quality data with two phases of training as follows. Sennrich et al. 2015 Li et al. 2023a Shen et al. 2024 Hu et al. 2024 • : During phase 1, both 3B and 8B models are trained for 4 trillion tokens of code data comprising 116 languages. The 20B parameter model is trained on 3 trillion tokens of code. The 34B model is trained on 1.4T tokens after the depth upscaling which is done on the 1.6T checkpoint of 20B model. Phase 1 (code only training) • : In phase 2, we include additional high-quality publicly available data from various domains, including technical, mathematics, and web documents, to further improve the model’s performance in reasoning and problem solving skills, which are essential for code generation. We train all our models for 500B tokens (80% code and 20% language data) in phase 2 training. Phase 2 (code + language training) 4.2 គោបំណងបណ្ដុះបណ្ដាល For training of all our models, we use the causal language modeling objective and Fill-In-the-Middle (FIM) ( , ) objective. The FIM objective is tasked to predict inserted tokens with the given context and subsequent text. We train our models to work with both PSM (Prefix-Suffix-Middle) and SPM (Suffix-Prefix-Middle) modes, with relevant formatting control tokens, same as StarCoder ( , ). Bavarian et al. 2022 Li et al. 2023a The overall loss is computed as a weighted combination of the 2 objectives: We emperically set = 0.5 during training and find that this works well in practice leading to SOTA performance on both code completion and code infilling tasks. It should be noted that the FIM objective is only used during pretraining, however we drop it during instruction finetuning i.e we set = 1. α α 4.3 ការអនុលោម We use AdamW optimizer ([Kingma & Ba](#_bookmark80),(#_bookmark80)) with β1 = 0.9, β2 = 0.95 and weight decay of 0.1 for training all our Granite code models. For the phase-1 pretraining, the learning rate follows a cosine schedule starting from 3 10−4 which decays to 3 10−5 with an initial linear warmup step of 2k iterations. For phase-2 pretraining, we start from 3 10−4 (1.5 10−4 for 20B and 34B models) and adopt an exponential decay schedule to anneal it to 10% of the initial learning rate. We use a batch size of 4M-5M tokens depending on the model size during both phases of pretraining. To accelerate training, we use FlashAttention 2 ( , ; , ), the persistent layernorm kernel, Fused RMSNorm kernel (depending on the model) and the Fused Adam kernel available in . We use a custom fork of NVIDIA’s Megatron-LM ( , ; , ) for distributed training of all our models. We train with a mix of 3D parallelism: tensor parallel, pipeline parallel and data parallel. We also use sequence parallelism ( , ) for reducing the activation memory consumption of large context length during training. We use Megatron’s distributed optimizer with mixed precision training ( , ) in BF16 ( , ) with gradient all-reduce and gradient accumulation in FP32 for training stability. Dao et al. 2022 Dao 2023 NVIDIA’s Apex library Shoeybi et al. 2019 Narayanan et al. 2021 Korthikanti et al. 2023 Micikevicius et al. 2018 Kalamkar et al. 2019 4.4 ហេโครงสร้าง We train the Granite Code models using IBM’s two supercomputing clusters, namely Vela and Blue Vela, outfitted with NVIDIA A100 and H100 GPUs, respectively. In the Vela A100 GPU cluster, each node has 2 Intel Xeon Scalable Processors with 8 80GB A100 GPUs connected to each other by NVLink and NVSwitch. The Vela cluster adopts RoCE (RDMA over Converged Ethernet) and GDR (GPU-direct RDMA) for high-performance networking. Similarly, each node in Blue Vela cluster consists of dual 48-core Intel processors with 8× 80GB H100 GPUs. Blue Vela employs 3.2Tbps InfiniBand interconnect to facilitate seamless communication between nodes, known for their high throughput and low latency. In addition, Blue Vela employs a separate, dedicated InfiniBand Storage fabric providing 800Gbps per compute node, backed by multiple ESS6000 storage appliances. Both clusters provide a scalable and efficient infrastructure for training our models over thousands of GPUs. We estimate the carbon emissions from pretraining the Granite Code models to be 455 tCO2eq, which is computed based on the total energy usage in the models and US national average carbon intensity factor of 0.423 kg CO2eq/KWh without taking the location of data centers in consideration. The Blue Vela cluster runs on 100% renewable energy to minimize the environmental impact. 5 ការបណ្ដុះបណ្ដាលតាមបញ្ជា Finetuning code LLMs on a variety of tasks explained via instructions has been shown to improve model usability and general performance. While there has been much progress in code instruction tuning, most of them adopt synthetically generated data from OpenAI models, which limits the model use in many enterprise applications. Thus, following OctoCoder ( , ), we use only a combination of permissively licensed data, with an aim to enhance instruction following capabilities of our models, including logical reasoning and problem-solving skills. Specifically, Granite Code Instruct models are trained on the following types of data. Muennighoff et al. 2023 • : CommitPackFT ( , ), a filtered ver-sion of full CommitPack dataset across 92 programming languages ; Code Commits Dataset Muennighoff et al. 2023 6 : MathInstruct ( Math Datasets 7 Yue