Abatshili: Mayank Mishra⋆, IBM Matt Stallone⋆, IBM Gaoyuan Zhang⋆, IBM Yikang Shen, IBM Aditya Prasad, IBM Adriana Meza Soria, IBM Michele Merler, IBM Parameswaran Selvam, IBM Saptha Surendran, IBM Shivdeep Singh, IBM Manish Sethi, IBM Xuan-Hong Dang, IBM Pengyuan Li, IBM Kun-Lung Wu, IBM Syed Zawad, IBM Andrew Coleman, IBM Matthew White, IBM Mark Lewis, IBM Raju Pavuluri, IBM Yan Koyfman, IBM Boris Lublinsky, IBM Maximilien de Bayser, IBM Ibrahim Abdelaziz, IBM Kinjal Basu, IBM Mayank Agarwal, IBM Yi Zhou, IBM Chris Johnson, IBM Aanchal Goyal, IBM Hima Patel, IBM Yousaf Shah, IBM Petros Zerfos, IBM Heiko Ludwig, IBM Asim Munawar, IBM Maxwell Crouse, IBM Pavan Kapanipathi, IBM Shweta Salaria, IBM Bob Calio, IBM Sophia Wen, IBM Seetharami Seelam, IBM Brian Belgodere, IBM Carlos Fonseca, IBM Amith Singhee, IBM Nirmit Desai, IBM David D. Cox, IBM Ruchir Puri†, IBM Rameswar Panda†, IBM Isishwankathelo Iimodeli ezinkulu zolwimi (LLMs) eziqeqeshwe ngekhowudi ziguqula inkqubo yophuhliso lwe software. Ngoku ngoku, iikhowudi ze-LLM zihlanganiswa kwizikhenketho zophuhliso lwe software ukuze kuphuculwe imveliso yabasebenzi babantu, kwaye ii-ejenti ezisekelwe kwi-LLM ziqala ukubonisa isithembiso sokujongana nemisebenzi enzima ngokuzimeleyo. Ukwenza ukuba ubukho obupheleleyo beekhowudi ze-LLM bufumaneke kufuna uluhlu olubanzi lwezakhono, kubandakanya ukuveliswa kwekhowudi, ukulungisa iibugs, ukuchaza nokuchaza ikhowudi, ukugcina iindawo zokugcina, kunye nokunye. Kulo msebenzi, siza kwazisa uthotho lwezinto eziphambili zeGranite zezizathu zokukhutshwa kwekhowudi ze-decoder-only eziqeqeshwe ngekhowudi ebhalwe kwiilwimi zokubala ezili-116. Usapho lweemodeli zeGranite Code luyakhiwa ziimodeli ezisukela kubuninzi beebhiliyoni ezi-3 ukuya kwezi-34, zilungele izicelo ezisukela kwimisebenzi enzima yokuhlaziya isicelo ukuya kwimisebenzi egcinwe kwisixhobo kwaye iyimfuneko. Uvavanyo oluyintlanganisela yemisebenzi lubonisa ukuba iimodeli zeGranite Code zihlala zifikelela kwinqanaba eliphezulu phakathi kweemodeli ze-LLM zekhowudi ezivulekileyo ezikhoyo. Usapho lweemodeli zeGranite Code lulungisiwe kwimisebenzi yophuhliso lwe software yenkampani kwaye lusebenza kakuhle kwimisebenzi eyahlukeneyo yokubala (umzek. ukuveliswa kwekhowudi, ukulungisa nokuchaza), nto leyo eyenza ibe yimodeli yekhowudi "yonke-kujikeleza" eguquguqukayo. Siyakhulula zonke iimodeli zethu zeGranite Code phantsi kwelayisensi ye-Apache 2.0 kokubini uphando kunye nokusetyenziswa kwezorhwebo. https://github.com/ibm-granite/granite-code-models 1 Intshayelelo Kwiminyaka elishumi elidlulileyo, isoftware iye yaba yinxalenye yalo lonke uhlobo loluntu lwethu. Njengoko imfuno yophuhliso lwe software inyuka, kubaluleke kakhulu ukunyusa imveliso yophuhliso lwe software, kwaye ii-LLM zinikezela ngendlela ethembisayo yokuxhasa abaqhubi babantu. Iimfuno ezibalulekileyo zenkampani ze-LLM kwimveliso yophuhliso lwe software zibandakanya ukuveliswa kwekhowudi, ukuchazwa kwekhowudi, ukulungiswa kwekhowudi, ukuveliswa kovavanyo lweyunithi kunye namaxwebhu, ukuhlaziywa kwesicelo, ukufunyanwa kokungabikho kukhuseleko, ukuguqulelwa kwekhowudi, kunye nokunye. Iminyaka yakutshanje ibone inkqubela ekhawulezayo kwisakhono se-LLM sokuvelisa nokulawula ikhowudi, kwaye uluhlu lweemodeli ezinamakhono okubala anomtsalane ziyafumaneka namhlanje. Iimodeli ziyahluka ngobukhulu ukusuka kubuninzi bamawaka eebhiliyoni (umzek. iLlama-7B (Touvron et al., 2023), iGemma-7B (Gemma-Team et al., 2024), njl.njl.) ukuya kumawaka amakhulu: iDBRX (Databricks), iArctic (Snowflake), iGrok, iMixtral 8x22B (MistralAI), iCommand R+ (Cohere), kwaye ziyahluka ngobubanzi besicelo, kunye nezinye iimodeli zijolise ekugubungeleni uluhlu lwezicelo ngaphandle kokubala, ngelixa ezinye zijonga ikakhulu kwimisebenzi enxulumene nokubala (umzek. iStarCoder (Li et al., 2023a; Lozhkov et al., 2024), iCodeGen (Nijkamp et al., 2023), iCodeLlama (Rozie`re et al., 2023), kunye neCodeGemma (CodeGemma Team et al., 2024)). Nangona kunjalo, kusekho izikhewu ezibalulekileyo kwinkalo yangoku ye-LLM yekhowudi, ngakumbi kwimeko yophuhliso lwe software yenkampani. Okokuqala, ngelixa ii-LLM ezinkulu, ezijolise kubo zichanile, ezinobukho obukhulu kwimisebenzi yokubala, ubukhulu bazo benza kube dyorho ukuzisebenzisa. Iimodeli ezincinci ezijolise kwikhodi ( , ; , ; , ; , ; , ) zingafikelela kubukho obugqwesileyo bokukhiqizwa kwekhowudi kwiphakheji encinci kwaye iguquguqukayo, kodwa ubukho kwimisebenzi yokubala ngaphezu kokukhiqizwa (umzek. ukulungisa nokuchaza) kungabikho kwibukho bokukhiqizwa kwekhowudi. Li et al. 2023a Lozhkov et al. 2024 Nijkamp et al. 2023 Rozie`re et al. 2023 CodeGemma Team et al. 2024 Kumeko ezininzi zenkampani, ukwamkelwa kwe-LLM yekhowudi kungaboniswa ngobunzima ngezinto ezingaphandle kobukho beemodeli. Umzekelo, nkqu iimodeli ezivulekileyo ngamanye amaxesha ziba nengxaki yokungabikho kolwazi malunga nemithombo yedatha kunye neendlela zokucubungula idatha eziye zazisa kwimodeli, ezingaqinisekisa ukuthembela kwiimodele kwiimeko eziphambili zobuthathaka kunye nezemigaqo. Ngaphezu koko, iimeko zelayisensi kwi-LLM ezivulekileyo zanamhlanje zingalufaka uluhlu kunye nokubona igalelo lenkampani ekusebenziseni imodeli. Apha, sifaka iimodeli zeGranite Code, uthotho lwe-LLM zekhowudi eziphakamileyo, ezenzelwe ukuxhasa uphuhliso lwe software yenkampani kwimisebenzi eyahlukeneyo yokubala. Iimodeli zeGranite Code zine ezimbini ezahlukeneyo esizikhulula kubukhulu obune (3B, 8B, 20B, kunye no-34B): iimodeli zangaphambili zezicelo ezinxulumene nekhowudi; I-Granite Code Base: iimodeli zilandelayo imiyalelo ezisekelwe kwidatha eqeqeshwe ngekhowudi evela kwii-Git commits ezidibene nemiyalelo yomntu kunye nedatha yemiyalelo yekhowudi evulekileyo. I-Granite Code Instruct: Iimodeli zangaphambili kuluhlu ziye zaqeqeshwa ukusuka ekuqaleni ngesicelo sokuqeqeshwa kwezinyathelo ezimbini. Kwinqanaba 1, imodeli yethu iqeqeshwe kwi-3 ukuya kwi-4 trillion tokens ezisuka kwiilwimi zokubala ezili-116, iqinisekisa ukuqonda okupheleleyo kweelwimi zokubala kunye nesyntactic. Kwinqanaba 2, imodeli yethu iqeqeshwe ngakumbi nge-500 yeebhiliyoni tokens kunye nomxube olungiselelweyo wedatha esemgangathweni ukusuka kumacandelo ekhowudi kunye nolwimi oluqhelekileyo ukuze kuphuculwe ubukho bemodeli bokucinga. Sisebenzisa injongo yemodeli yolwimi engaqeqeshwanga ukuqeqesha iimodeli zangaphambili kwiinqanaba zombini zokuqeqeshwa. Iimodeli zemiyalelo ziveliswa ngokongeza ulungiso lweemodeli zangaphambili ezilungeleleneyo kwidatha eqeqeshwe kwidatha eqeqeshwe kwi-CommitPack ehlanjululweyo ( , ), iidatha zemiyalelo yolwimi oluqhelekileyo (OASST ( , ), iHelpSteer ( , )) kunye needatha zemath eziveliswe ngokusemthethweni (iMathInstruct ( , ) kunye neMetaMathQA ( , )), kubandakanya iidatha zezicelo zekhowudi eziveliswe ngokusemthethweni ukuphucula imiyalelo yokulandela kunye nezakhono zokucinga. Muennighoff et al. 2023 Ko¨ pf et al. 2023 Wang et al. 2023 Yue et al. 2023 Yu et al. 2023 Senza uvavanyo olubanzi lwee-LLM zethu zekhowudi kwiintlaninge ezikhethekileyo zeebenchmark, kubandakanya iHumanEvalPack ( , ), iMBPP(+) ( , ; , ), iRepoBench ( , ), iReCode ( , ), kunye nokunye. Eli qela leebenchmarks libandakanya iintlobo ezininzi zemisebenzi yokubala ngaphezu kokuhlanganiswa kwekhowudi kwiPython, umzek. ukulungisa ikhowudi, ukuchazwa kwekhowudi, ukuhlela ikhowudi, ukuguqulelwa kwekhowudi, njl.njl. kuzo zonke iilwimi zokubala eziphambili (Python, JavaScript, Java, Go, C++, Rust, njl.). Muennighoff et al. 2023 Austin et al. 2021 Liu et al. 2023a Liu et al. 2023b Wang et al. 2022 Iziphumo zethu zibonisa ukuba phakathi kweemodeli ezivulekileyo, iimodeli zeGranite Code ngokuphakathi zibonisa ubukho obuqatha kuzo zonke iibenchmarks kunye nobuninzi beemodeli (ngamanye amaxesha zidibanisa iimodele zekhowudi ezivulekileyo ezikhulu kabini kuneGranite). Njengomzekelo, umfanekiso (phezulu) ubonisa ukuthelekiswa kweGranite-8B-Code-Base kunye nezinye ii-LLM zekhowudi zangaphambili ezivulekileyo, kubandakanya ii-LLM zangaphambili zokubala eziphakamileyo ezifana neMistral ( , ) kunye neLLama-3 ( , ) kwiHumanEvalPack ( , ). Ngelixa iCodeGemma kunye neStarCoder2 zisebenza kakuhle ekukhiqizeni ikhowudi, zisebenza kakubi kakhulu kwiindawo zokulungisa ikhowudi kunye nokuchaza izinto zeHumanEvalPack. Ngokwesiqhelo, iGranite-8B-Code-Base idibanisa imodeli yeCodeGemma-8B enqaku eliphambili malunga ne-12 kwiHumanEvalPack (33.2% vs 21.3%), nangona iqeqeshwe ngenani elincinci leetoni (4.5T vs 7.5T tokens). Ngaphandle kweemodeli zangaphambili, iinguqulelo zemiyalelo yeGranite Code yethu zibonisa ubukho obuqatha kwiHumanEvalPack, zidibanisa iimodeli zemiyalelo ezivulekileyo ezikwisayizi efanayo, zibonisa iinzuzo kuluhlu olubanzi lwemisebenzi yokubala kunye nemiyalelo yolwimi oluqhelekileyo (bona umfanekiso (ezantsi)). 1 Jiang et al. 2023b AI@Meta 2024 Muennighoff et al. 2023 1 Ngaphezu koko, njengoko ukucinga kubaluleke kakhulu kusombululo yeemibuzo kunye nemisebenzi enzima, sikwatsala imodeli yethu yeGranite-8B-Code-Base kwiibenchmarks ezine zemathematics, kubandakanya i-MATH ( , ), iGSM8K ( , ) kunye nokusombulula ingxaki ngokufikelela kwizixhobo zokubala, apho imodeli yethu yeGranite 8B ifikelela kubukho obungcono xa kuthelekiswa nezonaimodeli zangoku ze-7B okanye ze-8B ze-LLM. Ngokomzekelo, iGranite-8B-Code-Base idibanisa iLlama-3-8B-Base nge-12 kwiGSM8K kunye ne-6 kwiMATH (bona itafile ). Cobbe et al. 2021 Cobbe et al. 2021 15 Iinzuzo eziphambili zeemodeli zeGranite Code zibandakanya: : Iimodeli zeGranite Code zifikelela kubukho obuqhathanayo okanye obuphezulu kwimisebenzi eyahlukeneyo enxulumene nekhowudi, kubandakanya ukuveliswa kwekhowudi, ukuchazwa, ukulungiswa, ukuhlela, ukuguqulelwa, njl.njl., zibonisa ukukwazi kwazo ukusombulula imisebenzi yokubala eyahlukeneyo; I-LLM Yekhowudi Yobungcali Bonke : Zonke iimodeli zethu ziqeqeshwe kwidatha evumelekileyo ilayisensi eqokelelwe ngokulandela imigaqo yobulungisa ye-IBM AI kwaye ilawulwa liqela lezomthetho le-IBM Corporate ukuze zisetyenziswe ziinkampani ngendlela ethembekileyo. Zonke iimodeli zeGranite Code zikhutshiwe phantsi kwelayisensi ye-Apache 2.0. I-LLM Enokuthenjwa Yobuninzi BeNkampani 1 Sichaza yonke indlela yokuqokelela idatha, ukucoca, kunye nokucubungula kwakhona kwicandelo . Isahluko sichaza iinkcukacha zobume bemodyeli, kulandelwe ziinkcukacha zokuqeqeshwa kwiSahlulo . Isahluko sibonelela ngeenkcukacha malunga noqeqesho lwemiyalelo, kwaye Isahluko sichaza ulingo kunye neziphumo zokuthelekisa iimodeli zeGranite Code kunye nezinye ii-LLM ezivulekileyo. 2 3 4 5 6 2 Ukuqoqwa kwedatha Kwesi sahlulo, sichaza inkqubo yokukhangela kunye nokucoca (iCandelo ), ukungafani (iCandelo ), ukucocwa kwe-HAP/PII (iCandelo ) kusetyenziswa ukulungisa idatha yekhowudi yokuqeqeshwa kwemodeli. Sikwasekhona sibonelela ngesishwankathelo sedatha yolwimi oluqhelekileyo esetyenziselwa ukuphucula ukuqonda kolwimi lwemodeli kunye nezakhono zokucinga imathematics. 2.1 2.2 2.3 2.1 Ukukhangela nokuCoca idatha Idatha yekhowudi yangaphambili yakhutshwa kwidatha evulekileyo efana neGithub Code Clean , iStarCoderdata , kunye nezinye iindawo zokubala ezivulekileyo kunye neengxaki ezivela kwiGitHub. Sicoca idatha ekrwada ukuze sigcine uluhlu lweelwimi zokubala ezili-116 ukusuka kuma-300+ ulwimi, njengoko zidweliswe kwiSith Appendice . Ukwabiwa kwedatha kwiilwimi zokubala kwenziwa ngokusekelwe kuphela kwisandiso sefayile, njengoko iStarCoder ( , ). Emva kokucoca ulwimi, sifaka imigaqo emine yecala yokucoca ikhowudi esezantsi ( , ): (1) susa iifayile ezineentsumi zemibhalo ezincinci kune-25%, (2) ngaphandle kolwimi lwe-XSLT, cima iifayile apho umtya "<?xml version=" uvele kwiindawo ezingama-100 zokuqala, (3) kwiifayile ze-HTML, gcina kuphela iifayile apho umbhalo obonakalayo ubangela ubuncinci i-20% yekhowudi ye-HTML kwaye unobude obuncinci obuzi-100 characters, (4) kwiifayile ze-JSON kunye ne-YAML, gcina kuphela iifayile ezineCalcula yokulinganisa phakathi kwama-50 ukuya kuma-5000 characters. Sikwacoca iingxaki zeGitHub kusetyenziswa uluhlu lweenjongo zokubala ezibandakanya ukususa umbhalo owenziwe ngokuzenzekelayo, ukucoca iingxelo ezingezizo ezesiNgesi, ukukhupha izimvo ezivela kwiibhoti, kunye nokusebenzisa inani labasebenzisi ababandakanyeka kwincoko njengesalathisi sokubala. Sikwanonota ifayile nganye yekhowudi ngolwazi lwelayisensi enxulumene nendawo yokugcina, ifunyenwe ngeGithub APIs kwaye igcine kuphela iifayile ezinelayisensi evumelekileyo yokuqeqeshwa kwemodeli. 2 3 A Li et al. 2023a Li et al. 2023a 2.2 Ukungafani ngokuchanekileyo nangokufutshane Samkela isicelo sokungafani esinzima kubandakanya ukungafani ngokuchanekileyo nangokufutshane ukususa amaxwebhu anomxholo wokubala ofanayo (okanye osondele) kwidatha yethu yokuqeqeshwa. Ngokungafani ngokuchanekileyo, siqala sibala ihashsha le-SHA256 kwisakhono sedokumente kwaye sisebenzisa irekhodi ezineehashi ezifanayo. Emva kokungafani ngokuchanekileyo, sifaka ukungafani okufutshane ngomsebenzi wokugcina iifayile zekhowudi ezingahluka kancinci kwaye ngaloo ndlela zikunqande idatha ngakumbi. Sisebenzisa indlela yenyathelo ezimbini koku: (1)bala iMinHashes zazo zonke iidokumete kwaye emva koko usebenzise iHashing ebonakalayo (LSH) ukudibanisa iidokumete ngokusekelwe kwiimprinti zabo zeMinHash, (2) ukulinganisa ubudlelwano beJaccard phakathi kwebhali nganye yeenjongo kwibhokisi efanayo kwaye unikeze idatha ngaphandle kweyinye njengezinto ezingafaniyo ngokusekelwe kwindawo yobudlelwano be-0.7. Sisebenzisa le nkqubo yokungafani kwakhona kuzo zonke iilwimi zokubala kubandakanya iingxaki zeGitHub ukuphucula ingcaciso kunye nokwahlukahluka kwedatha yokuqeqeshwa. 2.3 Ukucocwa kwe-HAP, PII, Malware Ukunciphisa amathuba okuvelisa ulwimi olubandakanya intiyo, ubundlobongela, okanye ubungcolisa (HAP) kwiimodele, senza imizamo engathethekiyo yokucoca umxholo we-HAP kwidatha yokuqeqeshwa. Siqala sidalwe isichazi-magama samagama-eqhatha ezijongene ne-HAP kwaye emva koko sikhathaze yonke idokumete yekhowudi nenani lokubonakala kwamagama anjalo kumxholo kubandakanya izimvo. Sicoca iidokumete ezingaphezu komda we-HAP, obalwe ngokusekelwe kuhlalutyo lokuhanjiswa kunye nokuhlolwa ngesandla kweefayile zekhowudi. Ngaphezu koko, ukukhusela imfihlo, silandela iStarCoder ( , ) kwaye senze imizamo engathethekiyo yokukhusela ulwazi olubonisa umntu (PII) kwidatha yokuqeqeshwa. Ngokukhethekileyo, sisebenzisa imodeli yeStarPII ukufumana ii-IP addresses, izitshixo, ii-imeyile, amagama, amagama omsebenzisi, kunye neziqulatho ezifumaneka kumxholo. Indawo yokukhusela i-PII ibuyisela umbhalo we-PII ngeetokeni ezihambelanayo NAME , EMAIL , KEY , PASSWORD kwaye iguqule i-IP address ngesigqityelelo se-IP esenziwe ngokuyilwa, njengoko kuLi et al. (2023a). Sikwanohlola izigqibo zethu kusetyenziswa ukuchonga nokususa imizekelo ye-malware kwikhowudi yemvelaphi. Li et al. 2023a 4 2.4 IiDatha zolwimi Ngaphandle kokuqoqa idatha yekhowudi yokuqeqeshwa kwemodeli, siqokelela iidatha ezininzi zolwimi eziqhelekileyo nezisemgangathweni ukuphucula ubuchule bemodyeli ekuqondeni ulwimi kunye nokucinga ngemathematics. Iidatha ezimelayo phantsi kweli candelo zibandakanya iidokumete zewebhu (Stackexchange, CommonCrawl), umbhalo wewebhu wemathematics (OpenWeb-Math; ( ), StackMathQA; ( )), umbhalo wezemfundo (Arxiv, Wikipedia), kunye needatha zokuqeqeshwa kwemiyalelo (FLAN; ( ), HelpSteer ( , )). Asingazicoci ezi datha zolwimi eziye zacocwa kwangaphambili. Paster et al. 2023 Zhang 2024 Longpre et al. 2023 Wang et al. 2023 3 Ubume beModyeli Siqeqesha uthotho lweemodeli zekhowudi zobukhulu obahlukeneyo esekelwe kubume bokuhambisa itransformer ( , ). Iiparameter zeemodeli zeziimodeli zethu zibonisiwe kwiTafile . Kuzo zonke iikho khona zemodeli, sisebenzisa i-pre-normalization ( , ): normalization esetyenziswa kwizibhalo zangaphakathi zoqwalaselo kunye neebhlokhi ze-MLP. Vaswani et al. 2017 1 Xiong et al. 2020 : Imodeli encinci kwintsapho yemodeli yeGranite-code iqeqeshwe nge-RoPE embedding ( , ) kunye neMulti-Head Attention ( , ). Le modeli isebenzisa umsebenzi wokwenza okwenzekayo we-swish ( , ) kunye neGLU ( , ) kwii-MLP, ekwabizwa ngokuba yi-swiglu. Ukunqamleka, sisebenzisa iRMSNorm ( , ) njengoko ingeyona yeyona nzuzo ngokubala kuneLayerNorm ( , ). Imodyeli ye-3B iqeqeshwe ngobude bokubhala obungama-2048 tokens. 3B Su et al. 2023 Vaswani et al. 2017 Ramachandran et al. 2017 Shazeer 2020 Zhang & Sennrich 2019 Ba et al. 2016 : Imodyeli ye-8B inobume obufanayo nemodeli ye-3B ngaphandle kokusebenzisa iGrouped-Query Attention (GQA) ( , ). Ukusebenzisa iGQA kunika imbambano engcono phakathi kobukho bemodyeli kunye nokusebenza kakuhle kokuphelisa ngeli nqanaba. Siqeqeshe imodeli ye-8B ngobude bokubhala obungama-4096 tokens. 8B Ainslie et al. 2023 : Imodyeli yekhowudi ye-20B iqeqeshwe nge-learned absolute position embeddings. Sisebenzisa i-Multi-Query Attention ( , 20B Shazeer