Ababhali: UMayank Mishra⋆, IBM UMatt Stallone⋆, IBM UGaoyuan Zhang⋆, IBM UYikang Shen, IBM UAditya Prasad, IBM U Adriana Meza Soria, IBM UMichele Merler, IBM UParameswaran Selvam, IBM USaptha Surendran, IBM UShivdeep Singh, IBM UManish Sethi, IBM UXuan-Hong Dang, IBM UPengyuan Li, IBM UKun-Lung Wu, IBM USyed Zawad, IBM UAndrew Coleman, IBM UMatthew White, IBM UMark Lewis, IBM URaju Pavuluri, IBM UYan Koyfman, IBM UBoris Lublinsky, IBM UMaximilien de Bayser, IBM UIbrahim Abdelaziz, IBM UKinjal Basu, IBM UMayank Agarwal, IBM UYi Zhou, IBM UChris Johnson, IBM UAanchal Goyal, IBM UHima Patel, IBM UYousaf Shah, IBM UPetros Zerfos, IBM UHeiko Ludwig, IBM UAsim Munawar, IBM UMaxwell Crouse, IBM UPavan Kapanipathi, IBM UShweta Salaria, IBM UBob Calio, IBM USophia Wen, IBM USeetharami Seelam, IBM UBrian Belgodere, IBM UCarlos Fonseca, IBM UAmith Singhee, IBM UNirmit Desai, IBM UDavid D. Cox, IBM URuchir Puri†, IBM URameswar Panda†, IBM UAbstrakthi U(LLMs) ezinkulu eziqeqeshwe ngezimpawu zekhodi ziguqula inqubo yokuthuthukiswa kwesoftware. Ngokuya ngokuya, i-code LLMs iyahlanganiswa ezindaweni zokuthuthukiswa kwesoftware ukuthuthukisa ukukhiqiza kwabantu abathuthukisi, futhi ama-ejenti asekelwe ku-LLM aqala ukukhombisa isithembiso sokubhekana nemisebenzi eyinkimbinkimbi ngokuzenzakalelayo. Ukwenza umlingo ogcwele we-code LLMs kudinga ububanzi obubanzi bezimfanelo, kufaka phakathi ukwenziwa kwekhodi, ukulungisa amabhugi, ukuchaza nokufaka amakhodi amakhodi, ukugcina izindawo zokugcina izinto, nokuningi. Kulo msebenzi, sethula uchungechunge lwamamodeli wekhodi we-Granite wokukhipha kuphela wezikhathi zokukhiqiza ikhodi, eziqeqeshwe ngekhodi ebhaliwe ngezilimi eziyi-116. Umndeni wamamodeli we-Granite Code uhlanganisa amamodeli asukela osayizi abangu-3 kuye kwezingama-34 billion amapharamitha, afanele izicelo ezisukela emisebenzini eyinkimbinkimbi yokubuyekeza izicelo ukuya ezimweni ezidinga imemori encane kudivayisi. Ukuhlola kwiqoqo eliphelele lemisebenzi kubonisa ukuthi amamodeli we-Granite Code a finyelela njalo ukusebenza okusezingeni eliphezulu phakathi kwamamodeli avulekile wekhodi we-LLMs avulekile. Umndeni wamamodeli we-Granite Code wahlelwa imisebenzi yokuthuthukiswa kwesoftware yezimboni futhi usebenza kahle kulo lonke uhla lwemisebenzi yokukopa (isb. ukwenziwa kwekhodi, ukulungiswa nokuchazwa), okwenza kube yimodel engasetyenziswa kabanzi. Sikhulula wonke amamodeli ethu we-Granite Code ngaphansi kwelayisensi ye-Apache 2.0 kokubili ukucwaninga nokusetshenziswa kwezentengiselwano. https://github.com/ibm-granite/granite-code-models 1 Ukuqala Kuyo yonke iminyaka eyishumi edlule, isoftware iye yahlanganiswa kuzo zonke izici zenhlalo yethu. Njengoba isidingo sokuthuthukiswa kwesoftware sikhula, kubaluleke kakhulu ukukhulisa ukukhiqiza kwesoftware, futhi ama-LLM ahlinzeka ngendlela enethemba lokwandisa abathuthukisi babantu. Izimo ezibalulekile zezimboni ze-LLM ekukhiqizeni kwesoftware zihlanganisa ukwenziwa kwekhodi, ukuchazwa kwekhodi, ukulungiswa kwekhodi, ukuhlolwa kweyunithi nokukhiqizwa kwamadokhumenti, ukubuyekeza izicelo, ukutholwa kobuthakathaka, ukuhunyushwa kwekhodi, nokuningi. Iminyaka yakamuva ibone ukuqhubeka okusheshayo emandleni e-LLM okwenza nokulawula ikhodi, futhi uhla lwamamodeli anezimfanelo ezinhle kakhulu zokukopa luyatholakala namuhla. Amamodeli asukela kubiliwuni eziningana ezincane (isb. i-Llama-7B (Touvron et al., 2023), i-Gemma-7B (Gemma-Team et al., 2024), njll.) kuye kwabamashumi emibili: i-DBRX (Databricks), i-Arctic (Snowflake), i-Grok, i-Mixtral 8x22B (MistralAI), i-Command R+ (Cohere), futhi iyahlukahluka ekusebenzeni okujwayelekile okuhloswe, nezinye izikhathi zihlose ukumboza uhla lwezicelo ngaphandle kwekhodi, kanti ezinye zigxila ikakhulukazi emisebenzini ehlobene nekhodi (isb. i-StarCoder (Li et al., 2023a; Lozhkov et al., 2024), i-CodeGen (Nijkamp et al., 2023), i-CodeLlama (Rozie`re et al., 2023), kanye ne-CodeGemma (CodeGemma Team et al., 2024)). Nokho, kusamele kube nezingqinamba ezibalulekile enkundleni yamanje yama-LLM ekukhodi, ikakhulukazi emibhalweni yokuthuthukiswa kwesoftware yezimboni. Okokuqala, noma ngabe ama-LLM amakhulu kakhulu, ajwayelekile angakwazi ukusebenza kahle ekukopini, ubukhulu bawo benza kube ngabizi ukubasebenzisa. Amamodeli amancane agxile ekukhodi ( , ; , ; , ; , ; , ) angakwazi ukuthola ukusebenza okuhle kakhulu kokukhiqiza ikhodi esikhwameni esincane futhi esiguquguqukayo, kodwa ukusebenza emisebenzini yokukopa ngaphezu kokukhiqiza (isb. ukulungiswa nokuchazwa) kungase kungekho ngaphansi kokusebenza kokukhiqiza ikhodi. Li et al. 2023a Lozhkov et al. 2024 Nijkamp et al. 2023 Rozie`re et al. 2023 CodeGemma Team et al. 2024 Ezahlulweni eziningi zezimboni, ukwamukelwa kwe-LLM yekhodi kungabuye kuphazamiseke ngezinto ezingaphezu kokusebenza kwamamodeli. Ngokwesibonelo, noma ngabe amamodeli avulekile ngezinye izikhathi ahlushwa ukuntuleka kokungafihli ngqo ngemithombo yedatha kanye nezindlela zokucubungula idatha ezazisetshenziswe kumodeli, okunganciphisa ukwethembeka kumamodeli ezimo ezibucayi kakhulu nezilawulwayo. Ngaphezu kwalokho, imibandela yelayisensi kuma-LLM avulekile namuhla ingathinta futhi yenze kube nzima ikhono lezemboni lokusebenzisa imodeli. Lapha, sethula amamodeli we-Granite Code, uchungechunge lwama-LLM wekhodi anamandla amakhulu, aklanyelwe ukusekela imisebenzi yokuthuthukiswa kwesoftware yezimboni kulo lonke uhla lwemisebenzi yokukopa. Amamodeli we-Granite Code anezinhlobo ezimbili eziyinhloko esizikhipha ngosayizi abane (3B, 8B, 20B, kanye no-34B): amamodeli ayisisekelo okwakhelwe imisebenzi ehlobene nekhodi; Ukwakhiwa Okuyisisekelo kweKhodi ye-Granite: amamodeli okulalela imiyalelo enziwe kahle kusetshenziswa inhlanganisela yama-commits e-Git ehambisana nemiyalelo yabantu kanye nedatha yomkhiqizo wekhodi evulekile eyenziwe ngokwengezwe lokukhiqiza. Ukwakhiwa Okuyisisekelo kweKhodi ye-Granite: Amamodeli ayisisekelo kulolu chungechunge aqeqeshwe kusukela ekuqaleni ngenqubo yokuqeqesha yezinyathelo ezimbili. Esinyathelweni 1, imodeli yethu iqeqeshwe kumatrillion amathathu kuye kwayi-4 ethuluziwe avela ezilimi eziyi-116 zokukopa, iqinisekisa ukuqonda okuphelele kwezilimi zokukopa kanye nesintakesi. Esinyathelweni 2, imodeli yethu iqeqeshwe kakhulu kumatrillion angama-500 ethuluziwe nenhlanganisela eklanywe ngokucophelela yedatha esezingeni eliphezulu esivela emikhakheni yekhodi nolimi lwabantu ukuze kuthuthukiswe ikhono lemamodeli lokucabanga. Sisebenzisa inhloso yokufunda ulimi engahlolwanga ukuze siqeqeshe amamodeli ayisisekelo kuzo zombili izinyathelo zokuqeqesha. Amamodeli ayalela ukuthi anikwa iziqu ngokufundisa kabusha amamodeli ayisisekelo aqeqeshwe ngenhlanganisela yenguqulo ehlelwe kahle ye-CommitPack ( , ), idatha yokuphatha imiyalelo yolimi lwabantu (i-OASST ( , ), i-HelpSteer ( , )) kanye namadatha wezibalo avulekile (i-MathInstruct ( , ) kanye ne-MetaMathQA ( , )), kufaka phakathi amadatha omkhiqizo wekhodi eyenziwe ngokwengezwe ukuze kuthuthukiswe amakhono okuphatha imiyalelo nokucabanga. Muennighoff et al. 2023 Ko¨ pf et al. 2023 Wang et al. 2023 Yue et al. 2023 Yu et al. 2023 Senza izilinganiso eziphelele zemamodeli ethu ekhodi emqoqeni ophelele wezinhlaka, okuhlanganisa i-HumanEvalPack ( , ), i-MBPP(+) ( , ; , ), i-RepoBench ( , ), i-ReCode ( , ), nokuningi. Leli qoqo lezinhlaka lihlanganisa izinhlobo eziningi ezahlukene zemisebenzi yokukopa ngaphezu kokwenziwa kwekhodi okuyisintakesiPython, isb., ukulungisa ikhodi, ukuchaza ikhodi, ukuhlela ikhodi, ukuhunyushwa kwekhodi, njll., kulo lonke izilimi ezinkulu zokukopa (Python, JavaScript, Java, Go, C++, Rust, njll.). Muennighoff et al. 2023 Austin et al. 2021 Liu et al. 2023a Liu et al. 2023b Wang et al. 2022 Imiphumela yethu ibonisa ukuthi phakathi kwamamodeli avulekile, amamodeli we-Granite Code ngokuvamile abonisa ukusebenza okuqinile kakhulu kuzo zonke izilinganiso zamamodeli nezinhlaka (avame ukudlula amamodeli ekhodi avulekile omunye umuntu aphindwe kabili kune-Granite). Njengesibonelo, isithombe (phezu) sibonisa ukuqhathaniswa kwe-Granite-8B-Code-Base namanye ama-LLM ayisisekelo ekhodi avulekile, kufaka phakathi ama-LLM amanje asebenza kahle kakhulu jikelele afana ne-Mistral ( , ) kanye ne-LLama-3 ( , ) ku-HumanEvalPack ( , ). Noma ngabe i-CodeGemma ne-StarCoder2 zisebenza kahle ekukhiqizeni ikhodi, zisebenza kabi kakhulu ezinhlanganisweni zokulungisa nokuchaza ikhodi ze-HumanEvalPack. Ngokwesilinganiso, i-Granite-8B-Code-Base idlula imodeli ye-CodeGemma-8B enokuncintisana kakhulu cishe ngamaphuzu ayi-12 ku-HumanEvalPack (33.2% vs 21.3%), naphezu kokuba iqeqeshwe ngenani elincane kakhulu lamathuluzi (4.5T vs 7.5T amathuluzi). Ngaphezu kwamamodeli ayisisekelo, izinhlobo ezilungele imiyalo zamamodeli ethu we-Granite Code nazo zibonisa ukusebenza okuqinile ku-HumanEvalPack, zidlula amanye amamodeli omunye omlayelayo (okhodi) avulekile, zibonisa izinzuzo ohlangeni olubanzi lwemisebenzi yokukopa enemininingwane yolimi lwabantu (bona isithombe (phansi)). 1 Jiang et al. 2023b AI@Meta 2024 Muennighoff et al. 2023 1 Ngaphezu kwalokho, njengoba ukucabanga kubalulekile ekuxazululeni imibuzo nezinqumo eziyinkimbinkimbi, sihlola futhi imodeli yethu ye-Granite-8B-Code-Base ezinhlanganisweni zezibalo ezimbalwa, kufaka phakathi i-MATH ( , ), i-GSM8K ( , ) kanye nokuxazulula izinkinga ngokufinyelela kumathuluzi okubala, lapho imodeli yethu ye-Granite 8B ithola ukusebenza okungcono kakhulu uma iqhathaniswa namamodeli amaningi we-LLM asezingeni eliphezulu we-7B noma we-8B. Ngokwesibonelo, i-Granite-8B-Code-Base idlula i-Llama-3-8B-Base ngama-points alinganayo ku-GSM8K kanye nangama-points ayi-6 ku-MATH (bona ithebula ). Cobbe et al. 2021 Cobbe et al. 2021 15 Izinzuzo eziyinhloko zamamodeli we-Granite Code zihlanganisa: : Amamodeli we-Granite Code athola ukusebenza okunobunzima noma okusezingeni eliphezulu kuzo zonke izinhlobo zemisebenzi ehlobene nekhodi, kufaka phakathi ukwenziwa kwekhodi, ukuchazwa, ukulungiswa, ukuhlelwa, ukuhunyushwa, njll., okubonisa amandla abo okuxazulula izinhlobonhlobo zemisebenzi yokukopa; Ukwakhiwa Okuyisisekelo kweKhodi ye-LLM : Wonke amamodeli ethu aqeqeshwe kudatha evunyelwe ukuqoqwa ngokulandela izimiso ze-IBM AI Ethics futhi aqondiswe yithimba le-IBM Corporate Legal ukuze lisetshenziswe ezimbonini ngendlela ethembekile. Wonke amamodeli we-Granite Code akhishwa ngaphansi kwelayisensi ye-Apache 2.0. Ukwakhiwa Okuyisisekelo kwe-LLM Yesimboni Ethembekile 1 Sichaza lonke uhlelo lwethu lokuqoqa idatha, ukuhlunga, kanye nokuphatha esigabeni . Isigaba sichaza imininingwane mayelana ne-arkhitektura yemodeli, kulandele imininingwane yokuqeqesha eSihlokweni . Isigaba sinikeza imininingwane mayelana nokufundisa imiyalelo, futhi Isigaba sichaza izivivinyo kanye nemiphumela eqhathanisa amamodeli we-Granite Code namanye ama-LLM avulekile. 2 3 4 5 6 2 Ukuqoqwa kwedatha Kulesi sigaba, sichaza inqubo yokuqamela nokuhlunga (isigaba. ), ukususa izinto eziphindaphindayo (isigaba. ), ukuhlunga kwe-HAP/PII (isigaba. ) okwasetshenziswa ukulungisa idatha yekhodi yokuqeqesha imodeli. Siphinde sinikeze umbono obanzi wedatha yolimi lwabantu esezingeni eliphezulu esetshenziselwa ukuthuthukisa ukuqonda kolimi nemiphumela yokucabanga kwezibalo yemodeli. 2.1 2.2 2.3 2.1 Ukucamela Nokuhlunga Kwedatha Idatha yekhodi yangaphambi kokushicilelwa yatholwa ngenhlanganisela yedatha etholakala esidlangalaleni njenge-Github Code Clean , i-StarCoderdata , kanye nezinye izindawo zokugcina zezikhompyutha zomphakathi nezindaba ezivela ku-GitHub. Sihlula idatha eluhlaza ukuze sigcine uhlu lwezilimi eziyi-116 zokukopa kwezingu-300+, njengoba kuhlanganisiwe ku-Appendices . Ukunikezwa kwedatha ezilimi zokukopa kwenziwa ngokusekelwe kuphela kusinye sesandiso sefayela, okufana ne-StarCoder ( , ). Ngemuva kokuhlunga ulimi, sisebenzisa imithetho emine eyinhloko yokuhlunga ukuze sihluze ikhodi enekhwalithi ephansi ( , ): (1) susa amafayela anezinhlamvu ezincane kunezingama-25%, (2) ngaphandle kolimi lwe-XSLT, hlunga amafayela lapho umusho "<?xml version=” uvakala ngaphakathi kwezinhlamvu eziyi-100 zokuqala, (3) kumafayela e-HTML, gcina kuphela amafayela lapho umbhalo obonakalayo wenza okungenani u-20% wekhodi ye-HTML futhi unengqayizivele ende eyi-100 izinhlamvu, (4) kumafayela e-JSON kanye ne-YAML, gcina kuphela amafayela anenani lezinhlamvu phakathi kwezingu-50 kuye kwezingama-5000. Siphinde sihlunge izindaba ze-GitHub ngeqoqo lezilinganiso zekhwalithi ezihlanganisa ukususa umbhalo owenziwe ngokuzenzakalelayo, ukuhlunga izindaba ezingezona ezaseNgisi, ukukhipha imibhalo okuvela kuma-bots, nokusebenzisa inani labasebenzisi ababandakanyeka engxoxweni njengesibonakaliso sekhwalithi. Siphinde sihlole ifayela ngalinye lekhodi ngolwazi lwelayisensi elihlobene nendawo yokugcina efanele, etholwa ngezimbobo ze-Github futhi sigcine kuphela amafayela anelayisensi evunyelwe yokuqeqesha imodeli. 2 3 A Li et al. 2023a Li et al. 2023a 2.2 Ukususwa Kwezinto Eziphindaphindayo kanye Nezifana Ncamashi Sikhetha isistimu enolaka yokuphindeka izinto ukuze kususe izincwadi ezinekhodi efanayo noma efana kakhulu kusinye sethu sokuqeqesha. Ukuphindeka okufanayo, sokuqala sibala i-hash ye-SHA256 kokuqukethwe kwencwadi futhi sikhulume izicelo ezine-hash efanayo. Ngemuva kokuphindeka okufanayo, sisebenzisa ukuphindeka okufanayo ngenhloso yokususa amafayela ekhodi angaba nezinguquko ezincane futhi kanjalo ukubona idatha eyengeziwe. Sisebenzisa indlela enezinyathelo ezimbili zalokhu: (1) bala ama-MinHashes azo zonke izincwadi bese usebenzisa i-Locally Sensitive Hashing (LSH) ukuze uqokelele izincwadi ngokusekelwe kumiphumela ye-MinHash yazo, (2) kala ukufana kwe-Jaccard phakathi kwazo zonke izibhangqiwe zezincwadi esigabeni esifanayo futhi uphawule izincwadi ngaphandle kweyodwa njengeziphindaphindayo ngokusekelwe emgqeni wokufana wama-0.7. Sisebenzisa le nqubo yokuphindeka ngasekuqaleni kuzo zonke izilimi zokukopa kufaka phakathi izindaba ze-GitHub ukuze kuthuthukiswe ukunotha nokuhlukahluka kwedatha yokuqeqesha. 2.3 Ukuhlunga Kwe-HAP, PII, Malware Ukunciphisa ithuba lokukhiqiza ulimi olunesihluku, olubolile, noma olungcolile (HAP) kusuka kumamodeli, senza imizamo enobuchule yokuhlunga okuqukethwe kwe-HAP kusinye sokuqeqesha. Sokuqala sakha isichazamazwi samagama angukhiye we-HAP bese sihlanganisa incwadi ngayinye yekhodi nenani lezikhathi zamagama anjalo emi bhalweni kufaka phakathi imibhalo. Sihlula izincwadi ezingadlula umgomo we-HAP, obalwa ngokusekelwe ekuhlaziyeni okusakazwayo kanye nokuhlolwa okwenziwe ngesandla kwezifayela zekhodi. Ngaphezu kwalokho, ukuvikela ubumfihlo, silandela i-StarCoder ( , ) futhi senza imizamo enobuchule yokukhipha imininingwane eyihlonza umuntu (PII) kusinye sokuqeqesha. Ngokukhethekile, sibheka imodeli ye-StarPII ukuthola izinombolo ze-IP, okhiye, amakheli e-imeyili, amagama, amagama omsebenzisi, namaphasiwedi atholakala emibhalweni. Isinyathelo sokukhipha i-PII sibuyisela umbhalo we-PII kumagama afanayo NAME , EMAIL , KEY , PASSWORD bese siguqula inombolo ye-IP nge-IP eyenziwe ngokuzenzakalelayo, njengaku-Li et al. (2023a). Siphinde sihlunge izinqwaba zethu kusetshenziswa ukuthola nokususa izibonelo ze-malware ekukhoneni kwekhodi. Li et al. 2023a 4 2.4 Amadatha Ebolimi Labantu Ngaphezu kokuqoqa idatha yekhodi yokuqeqesha imodeli, siqoqa amadatha amaningana ebolimi lwabantu avulekile asezingeni eliphezulu ukuze sithuthukise ukusetshenziswa kwemodeli ekuqondeni ulimi nemiphumela yokucabanga kwezibalo. Amadatha amele kulesi sigaba ahlanganisa imibhalo yewebhu (i-Stackexchange, i-CommonCrawl), umbhalo wezibalo kuwebhu (i-OpenWeb-Math; ( ), i-StackMathQA; ( )), umbhalo wezemfundo (i-Arxiv, i-Wikipedia), kanye namadatha okuqeqesha imiyalezo (i-FLAN; ( ), i-HelpSteer ( , )). Asizikhiphi izinto eziphindaphindayo kule madatha yolimi lwabantu esivele iphathiwe. Paster et al. 2023 Zhang 2024 Longpre et al. 2023 Wang et al. 2023 3 Imibhalo Yemodeli Siqeqesha uchungechunge lwamamodeli wekhodi osayizi abahlukahlukene ngokusekelwe kumumo wokukhipha umumo wesihumushi ( , ). Ama-hyperparameter wemodeli kulawa mamodeli anikezwe kuthebula . Kuzo zonke izibhalo zamamodeli, sisebenzisa ukuqeqeshwa kwangaphambi kwesikhathi ( , ): ukuqeqeshwa okusebenza kumhloli kanye neziqhebezo ze-MLP. Vaswani et al. 2017 1 Xiong et al. 2020 : Imodeli encane kunazo zonke emndenini wamamodeli we-Granite-code iqeqeshwe nge-embedding ye-RoPE ( , ) kanye ne-Multi-Head Attention ( , ). Le modeli isebenzisa umsebenzi wokuvuselela we-swish ( , ) ne-GLU ( , ) ye-MLP, eyaziwa nangokuthi i-swiglu. Ukuze kuthuthukiswe, sisebenzisa i-RMSNorm ( , ) njengoba iphumelela kakhulu ngokubala kune-LayerNorm ( , ). Imodeli engu-3B iqeqeshwe nobude bomongo obungama-2048 amathuluzi. 3B Su et al. 2023 Vaswani et al. 2017 Ramachandran et al. 2017 Shazeer 2020 Zhang & Sennrich 2019 Ba et al. 2016 : Imodeli engu-8B inobumbano olufanayo nemodeli engu-3B ngaphandle kokusebenzisa i-Grouped-Query Attention (GQA) ( , ). Ukusebenzisa i-GQA kunikeza ukwenza kahle ok 8B Ainslie et al. 2023