Abalaweli: Mayank Mishra⋆, IBM Matt Stallone⋆, IBM Gaoyuan Zhang⋆, IBM Yikang Shen, IBM Aditya Prasad, IBM Adriana Meza Soria, IBM Michele Merler, IBM Parameswaran Selvam, IBM Saptha Surendran, IBM Shivdeep Singh, IBM Manish Sethi, IBM Xuan-Hong Dang, IBM Pengyuan Li, IBM Kun-Lung Wu, IBM Syed Zawad, IBM Andrew Coleman, IBM Matthew White, IBM Mark Lewis, IBM Raju Pavuluri, IBM Yan Koyfman, IBM Boris Lublinsky, IBM Maximilien de Bayser, IBM Ibrahim Abdelaziz, IBM Kinjal Basu, IBM Mayank Agarwal, IBM Yi Zhou, IBM Chris Johnson, IBM Aanchal Goyal, IBM Hima Patel, IBM Yousaf Shah, IBM Petros Zerfos, IBM Heiko Ludwig, IBM Asim Munawar, IBM Maxwell Crouse, IBM Pavan Kapanipathi, IBM Shweta Salaria, IBM Bob Calio, IBM Sophia Wen, IBM Seetharami Seelam, IBM Brian Belgodere, IBM Carlos Fonseca, IBM Amith Singhee, IBM Nirmit Desai, IBM David D. Cox, IBM Ruchir Puri†, IBM Rameswar Panda†, IBM Isishwankathela IIModela Esemkhulu (LLMs) aqeqeshwe ngekhodi avuselela kakhulu inkqubo yophuhliso lwe software. Ngoku ngokubanzi, ii-code LLMs zihlanganiswa kwizixokelelwano zophuhliso lwe software ukonyusa imveliso yababhali bekhowudi, kwaye ii-agent ezisekelwe kwi-LLM ziqala ukubonisa izithembiso zokujongana nemisebenzi enzima ngokuzimeleyo. Ukufezekisa izakhono ezipheleleyo ze-code LLMs kufuna uluhlu olubanzi lwezakhono, kubandakanya ukuveliswa kwekhodi, ukulungisa iimbungubungu, ukuchaza nokucacisa ikhowudi, ukugcina iirepositories, nokunye. Kulo msebenzi, sethula uthotho lwee-modeli zokuphela kweGranite zezemodeli zokuphela kwazo, eziqeqeshwe ngekhodi ebhaliweyo ngeelwimi zokwenkqubo ezili-116. Usapho lwee-modeli zeGranite Code luneemodeli ezisuka kubungakanani obusuka kwi-3 ukuya kwi-34 yeebhiliyoni yeeparameter, ezifanelekileyo kwiisicelo ezisuka kwimisebenzi enzima yokuhlaziya isicelo ukuya kwiisicelo ezineememori ezilinganiselweyo. Uvavanyo kwiiseti ezibanzi zemisebenzi lubonisa ukuba ii-modeli zeGranite Code zihlala zifikelela kwinqanaba eliphezulu lokusebenza phakathi kwezimbalwa zemodeli ye-code LLMs evulekileyo. Usapho lwee-modeli zeGranite Code lwenziwe lulungelelaniswe kwiindlela zophuhliso lwe-software yeshishini kwaye lusebenza kakuhle kwiindidi zemisebenzi yokubhalela ikhowudi (umzek. ukuveliswa kwekhowudi, ukulungisa nokuchazwa), nto leyo eyenza ukuba ibe yimotolo ye "all around" yekhowudi. Sikhulula zonke ii-modeli zethu zeGranite Code phantsi kwelayisenisi ye-Apache 2.0 ukuze kusetyenziswe uphando kunye nezentengiso. https://github.com/ibm-granite/granite-code-models 1 Isingeniso Kule minyaka ingamashumi aliqela, isoftware iye yahlanganiswa kwisisekelo sawo wonke umnqweno kuluntu lwethu. Njengoko imfuno yophuhliso lwe-software ikhula, kubaluleke kakhulu ukonyusa imveliso yophuhliso lwe-software, kwaye ii-LLMs zibonelela ngendlela ethembisayo yokonyusa ababhali bekhowudi. Iindidi ezibalulekileyo zesicelo shishini zee-LLMs kuphuhliso lwe-software zibandakanya ukuveliswa kwekhowudi, ukuchazwa kwekhowudi, ukulungiswa kwekhowudi, ukuveliswa kweyunithi kunye nokucaciswa, ukuhlaziywa kwesicelo, ukufunyanwa kwezinto ezingalunganga, ukuguqulelwa kwekhowudi, nokunye. Kule minyaka ingoku, kube nenkqubela ekhawulezayo kubuchule buka-LLM bokwenza nokuphatha ikhowudi, kwaye uluhlu lwee-modeli ezineenkqubo zokubhalela ikhowudi ezimangalisayo ziyafumaneka namhlanje. Ii-modeli zihlala zingekho phantsi kweebhiliyoni eziliqela zeparameter (umzek. I-Llama-7B (Touvron et al., 2023), iGemma-7B (Gemma-Team et al., 2024), njl.njl.) ukuya kwiikhulu leebhiliyoni: iDBRX (Databricks), iArctic (Snowflake), iGrok, iMixtral 8x22B (MistralAI), iCommand R+ (Cohere), kwaye ziyahluka ekubeni zilungele phi ukusetyenziswa, ngelixa ezinye ii-modeli zijolise ikakhulu kwimisebenzi enxulumene nekhowudi (umzek. iStarCoder (Li et al., 2023a; Lozhkov et al., 2024), iCodeGen (Nijkamp et al., 2023), iCodeLlama (Rozie`re et al., 2023), kunye neCodeGemma (CodeGemma Team et al., 2024)). Nangona kunjalo, kusenezikhala ezibalulekileyo kwinkalo yangoku yee-LLMs zezobhali, ngakumbi kwisimo sophuhliso lwe-software yeshishini. Okokuqala, nangona ii-LLMs ezinkulu, ezingezizo-ezithile zodidi zinokufumana ukusebenza okugqwesileyo kokubhalela ikhowudi, ubungakanani bazo buyazibiza ukuba zibeke. Ii-modeli ezincinci ezijolise kakhulu kwikhodi ( , ; , ; , ; , ; , ) zinokufumana ukusebenza okugqwesileyo kokwenziwa kwekhowudi kwiphakheji encinci kwaye ilula, kodwa ukusebenza kwimisebenzi yokubhalela ikhowudi ngaphandle kokwenziwa (umzek. ukulungisa nokuchazwa) kungaluhlala ngaphantsi kokusebenza kokwenziwa kwekhowudi. Li et al. 2023a Lozhkov et al. 2024 Nijkamp et al. 2023 Rozie`re et al. 2023 CodeGemma Team et al. 2024 Kwiimeko ezininzi zesishishini, ukwamkelwa kwe-code LLM kungadibaniswa ziingxaki ngaphandle kokusebenza kweemodeli. Ngokomzekelo, nkqu nemodeli ezivulekileyo ngamanye amaxesha zibuthathaka ngokungabikho kokubona okungokwemithombo yedatha kunye neendlela zokucwangcisa idatha eziye zazisa kwimodeli, nto leyo enokuchitha ukuthembana kwiimodi ezibalulekileyo kwaye zilawulwe ziimeko. Ngaphezu koko, imigaqo yelayisenisi kwi-LLMs evulekileyo yangoku ingabangela inkathazo kunye nenkqubela phambili kubuchule beshishini lokusebenzisa imodeli. Apha, sethula ii-modeli zeGranite Code, uthotho lwee-LLM zezobhali ezinamandla aphezulu, ezilwenzelwe ukuxhasa uphuhliso lwe-software yeshishini kwiindidi ezibanzi zemisebenzi yokubhalela ikhowudi. Ii-modeli zeGranite Code zineentlobo ezimbini eziphambili esizikhulula kubungakanani obune (3B, 8B, 20B, kunye ne-34B): iim-modeli zesisiseko zokubhalela imisebenzi enxulumene nekhowudi; I-Granite Code Base: ii-modeli zokulandela imiyalelo ezilungiswe ngonxibelelwano lwe-Git kunye nemiyalelo yabantu kunye neenkqubo zedatha yomiyalelo yekhowudi evulekileyo. I-Granite Code Instruct: Iim-modeli zesisiseko zolu ngcelele ziye zaqeqeshwa ukusuka ekuqaleni ngenkqubo yokuqeqesha yamabanga amabini. Kwinga-1, imodeli yethu iqeqeshwe kwiitokhi ezingama-3 ukuya kwezi-4 trillion ezivela kwiilwimi zokwenkqubo ezili-116, iqinisekisa ukuqonda okubanzi kweelwimi zokwenkqubo kunye nesintaks. Kwinga-2, imodeli yethu iqeqeshwe kwiitokhi ezingama-500 yebhiliyoni kunye nomxube ocwangciswe ngononophelo wedatha esemgangathweni ephezulu kwiindawo zekhowudi kunye nolwimi ukuphucula ubuchule bem-odeli bokucinga. Sisebenzisa injongo yemodeli yolwimi engagadiyo ukuqeqesha iim-modeli zesisiseko kwiimaga zombini zoqeqesho. Iim-modeli zomiyalelo ziveliswa ngokuqhubekeka ngenkqubo yokuqeqesha iim-modeli zesisiseko eziqeqeshwe ngaphambili kwi-ngxube yohlobo oluhluphiweyo lweCommitPack ( , ), izicwangciso zolwimi lolwimi lolwimi (OASST ( , ), iHelpSteer ( , )) kunye neenkqubo zedatha ye-math evulekileyo (iMathInstruct ( , ) kunye neMetaMathQA ( , )), kubandakanya izicwangciso zedatha yekhowudi ezivelisiweyo ukuphucula ukulandela imiyalelo kunye nezakhono zokucinga. Muennighoff et al. 2023 Ko¨ pf et al. 2023 Wang et al. 2023 Yue et al. 2023 Yu et al. 2023 Siqhuba uvavanyo olubanzi lwee-code LLMs zethu kwiiseti ezibanzi zemisebenzi, kubandakanya iHumanEvalPack ( , ), iMBPP(+) ( , ; , ), iRepoBench ( , ), iReCode ( , ), nokunye. Le seti yemisebenzi ibandakanya iintlobo ezininzi zemisebenzi yokubhalela ikhowudi ngaphandle kokwenza ikhowudi kwiPython, umzek. ukulungisa ikhowudi, ukuchazwa kwekhowudi, ukuhlela ikhowudi, ukuguqulelwa kwekhowudi, njl.njl. kuzo zonke iilwimi zokwenkqubo ezibalulekileyo (iPython, iJavaScript, iJava, iGo, iC++, iRust, njl.njl.). Muennighoff et al. 2023 Austin et al. 2021 Liu et al. 2023a Liu et al. 2023b Wang et al. 2022 Izifundo zethu zityhila ukuba phakathi kweem-modeli ezivulekileyo, ii-modeli zeGranite Code ngokuphakathi zibonisa ukusebenza okunamandla kakhulu kuzo zonke iim-modeli kunye neebhenchmarks (ngamanye amaxesha zidibanisa ii-modeli zezobhali ezivulekileyo ezi-2x zinkulu kuneGranite). Njengomfanekiso, umfanekiso (phezulu) ubonisa uthelekiso lweeGranite-8B-Code-Base kunye nezinye ii-LLMs zezobhali ezivulekileyo, kubandakanya ii-LLMs zamva nje ezinamandla ezijolise kuzo zonke iinjongo njengeMistral ( , ) kunye neLLama-3 ( , ) kwiHumanEvalPack ( , ). Ngelixa iCodeGemma kunye neStarCoder2 zisebenza kakuhle ekwenzeni ikhowudi, zisebenza kakubi kakhulu kwiinguqulelo zokulungisa nokuchazwa kweHumanEvalPack. Ngokomndilili, iGranite-8B-Code-Base idibanisa imodeli yeCodeGemma-8B enkulu nge-12 yamanqaku kwiHumanEvalPack (33.2% vs 21.3%), nangona iqeqeshwe kwinani elincinci lee-tokeni (4.5T vs 7.5T tokens). Ngaphandle kwee-modeli zesisiseko, iinguqulelo ze-instruction tuned zemodeli yethu yeGranite Code zibonisa ukusebenza okunamandla kwiHumanEvalPack, zidibanisa ii-modeli zomiyalelo evulekileyo (yezobhali), zibonisa izibonelelo kuluhlu olubanzi lwemisebenzi yokubhalela ikhowudi kunye nemiyalelo yolwimi (bona umfanekiso (ezantsi)). 1 Jiang et al. 2023b AI@Meta 2024 Muennighoff et al. 2023 1 Ngaphezu koko, njengokuba ukuqonda kubalulekile ekusonjululeni imibuzo enzima kunye nemisebenzi, sikwabandakanya imodeli yethu yeGranite-8B-Code-Base kwiibhenchmarks ezithandathu zemath, kubandakanya i-MATH ( , ), iGSM8K ( , ) kunye nokusonjululwa kweengxaki ngokufikelela kwizixhobo zokubala, apho imodeli yethu yeGranite 8B ifumana ukusebenza okungcono xa kuthelekiswa neekhowudi zemodeli ye-7B okanye ye-8B eyona yokusebenza. Ngokomzekelo, iGranite-8B-Code-Base idibanisa iLlama-3-8B-Base nge-12 yamanqaku kwiGSM8K nangama-6 amanqaku kwi-MATH (bona itheyibhile ). Cobbe et al. 2021 Cobbe et al. 2021 15 Iziphumo eziphambili zee-modeli zeGranite Code zibandakanya: : Ii-modeli zeGranite Code zifumana ukusebenza okukhuphisanayo okanye okugqwesileyo kwiindidi zemisebenzi enxulumene nekhowudi, kubandakanya ukuveliswa kwekhowudi, ukuchazwa, ukulungiswa, ukuhlelwa, ukuguqulelwa, njl.njl., kubonisa ubuchule bazo bokusonjulula imisebenzi eyahlukeneyo yokubhalela ikhowudi; I-LLM yeKhawudhi yoMsebenzi wonke : Zonke ii-modeli zethu ziqeqeshwe kwidatha evumelekileyo yelayisenisi eqokelelwe ngokuthobela imigaqo ye-IBM ye-AI Ethics kwaye ikhokelwe liqela leMthetho le-IBM le-Corporate ukuze kusetyenziswe ishishini esithembekileyo. Zonke ii-modeli zeGranite Code zikhutshiwe phantsi kwelayisenisi ye-Apache 2.0. I-LLM yeteknoloji yeshishini ethembekileyo 1 Sichaza yonke inkqubo yethu yokuqokelela idatha, ukucoca, kunye nokulungiselela kwicandelo . Isicelo sichaza iinkcukacha zemodeli yobume, kulandelwa ziinkcukacha zokuqeqesha kwiCandelo . Isicelo sibonelela ngeenkcukacha malunga nokuqeqesha imiyalelo, kwaye isicelo sichaza imizamo kunye neziphumo zithelekisa ii-modeli zeGranite Code kunye nezinye ii-LLMs ezivulekileyo. 2 3 4 5 6 2 Ukuqokelelwa kwedatha Kwicandelo elingentla, sichaza inkqubo yokukhangela nokucoca (iCandelo ), ukucoca okungafaniyo (iCandelo ), ukucoca i-HAP/PII (iCandelo ) ezisetyenziselwa ukulungisa idatha yekhowudi yoqeqesho lwemodeli. Sikwanikezela ngombono obanzi wedatha yolwimi ephezulu esetyenziselwa ukuphucula ukuqonda kolwimi lwemodeli kunye nezakhono zokucinga zemath. 2.1 2.2 2.3 2.1 Ukukhangela nokucoca idatha Idatha yekhowudi yokuqeqesha ngaphambili yaveliswa kuluhlu lweedatha ezikhoyo ezifana neGithub Code Clean , iStarCoderdata , kunye nezinye iirepositories zekhowudi zikawonke-wonke kunye nemibuzo yeGitHub. Sicoca idatha ekrwada ukuze sigcine uluhlu lweelwimi zokwenkqubo ezili-116 kwiilwimi ezingama-300+, njengoko zidweliswe kwiSihlomelo . Ukuhanjiswa kwedatha kwiilwimi zokwenkqubo kwenziwa ngokusekelwe kuphela kusiphelo sefayile, ngokufanayo neStarCoder ( , ). Emva kokucoca ulwimi, sisebenzisa imigaqo emine ephambili yokucoca ukucoca ikhowudi esezantsi ( , ): (1) susa iifayile ezinama-25% ama-alphabetic characters okanye ngaphantsi, (2) ngaphandle kolwimi lwe-XSLT, coca iifayile apho umtya "<\?xml version=” ubhekisa kwii-characters ezili-100 zokuqala, (3) kwiifayile ze-HTML, gcina kuphela iifayile apho isicatshulwa esibonakalayo senza ubuncinci obu-20% beklawudi ye-HTML kwaye sinobude obuncinci obu-100 characters, (4) kwiifayile ze-JSON kunye ne-YAML, gcina kuphela iifayile ezineenkcukacha zama-characters ezisuka kuma-50 ukuya kuma-5000 characters. Sikwayicoca imibuzo yeGitHub kusetyenziswa iseti yeemilinganiselo yomgangatho ezibandakanya ukususa isicatshulwa esenziweyo, ukucoca imibuzo engesiNgesi, ukukhuphela izimvo zabathile, kunye nokusebenzisa inani labasebenzisi ababandakanyekayo kwincoko njengomlinganiselo womgangatho. Sikwayiphawula ifayile nganye yekhowudi ngolwazi lwelayisenisi olunxulumene ne-repository, elifunyenwe nge-Github APIs kwaye sigcine kuphela iifayile ezineelayisenisi ezithambileyo zokuqeqesha imodeli. 2 3 A Li et al. 2023a Li et al. 2023a 2.2 Ukucoca okuchanekileyo nokufuzzy Siyayamkela inkqubo eqatha yokucoca eyandisiweyo equka ukucoca okuchanekileyo kunye nokufuzzy ukususa amaxwebhu anoku-content efanayo okanye efana nekhowudi kwiisethi zethu zokuqeqesha. Ukucoca okuchanekileyo, siqala sibala i-hash ye-SHA256 kwisicatshulwa soxwebhu kwaye sicele irekhodi ezineehASHI ezifanayo. Emva kokucoca okuchanekileyo, sisebenzisa ukucoca okufuzzy ngenjongo yokususa iifayile zekhowudi ezingaba neenguqulelo ezincinci kwaye ngaloo ndlela zibuyisele idatha ngqo. Sisebenzisa inkqubo yamanyathelo amabini kule nto: (1) bala iiMinHash zonke amaxwebhu kwaye emva koko usebenzise iHashing e-Locally Sensitive (LSH) ukudibanisa amaxwebhu ngokusekelwe kwi-MinHash fingerprints zabo, (2) imilinganiselo yeJaccard similarity phakathi kwesibini ngasinye samaxwebhu kwibhakethi elifanayo kwaye uphawule amaxwebhu ngaphandle komnye njengeziphindaphindiweyo ngokusekelwe kumlinganiselo wobuchule obuyi-0.7. Sisebenzisa le nkqubo yokucoca ngokusondeleyo kuzo zonke iilwimi zokwenkqubo kubandakanya imibuzo yeGitHub ukuphucula ubutyebi kunye nokwahluka kweseti yoqeqesho. 2.3 Ukucoca i-HAP, i-PII, i-Malware Ukunciphisa ukwenzeka kokuveliswa kolwimi oluneentloni, oluchasayo, okanye oludlwayo (i-HAP) kwiim-modeli, senza imizamo yokucoca umxholo we-HAP kwiseti yoqeqesho. Siyila isichazi-magama sezihloko ze-HAP kwaye emva koko siphawule uxwebhu lwekhowudi ngamanani okwenzeka kwezihloko kumxholo kubandakanya izimvo. Sicoca amaxwebhu adlula umda we-HAP, obalwe ngokusekelwe kuhlalutyo olukhulu kunye nokujonga ngesandla kwee-fayile zekhowudi. Ngapha koko, ukukhusela imfihlo, silandela iStarCoder ( , ) kwaye senza imizamo yokucoca Ulwazi Lomntu Olubonakalisa (PII) kwiseti yoqeqesho. Ngokukodwa, sisebenzisa imodeli yeStarPII ukufumana ii-dilesi ze-IP, izitshixo, ii-dilesi ze-imeyile, amagama, amagama abasebenzisi, kunye neziqhalo kumxholo. Ukucocwa kwe-PII kuthatha indawo ye-PII ngomtya ogqithisileyo NAME , EMAIL , KEY , PASSWORD kwaye kutshintsha idilesi ye-IP ngedilesi ye-IP eyenziweyo, njengaku-Li et al. (2023a). Sikwayihlola imibhalo yethu usebenzisa ukufumana nokususa imizekelo ye-malware kwikhowudi yemithombo. Li et al. 2023a 4 2.4 Iidatha zeLwimi Ngokongezelela kokuqokelelwa kwedatha yekhowudi yoqeqesho lwemodeli, siqokelela iidatha ezininzi zolwimi ezikhoyo ezikumgangathweni ophezulu ukuze kuphuculwe ubuchule bem-odeli ekuqondeni kolwimi kunye nokuqonda imath. Iidatha eziphawulekayo phantsi kwesi sigaba zibandakanya amaxwebhu ewebhu (iStackexchange, CommonCrawl), umxholo we-web math (OpenWeb-Math; ( ), iStackMathQA; ( )), umxholo wezifundo (iArxiv, iWikipedia), kunye nezicwangciso zokuqeqesha imiyalelo (iFLAN; ( ), iHelpSteer ( , )). Asizicoci ezi datha zolwimi sele zilungisiwe. Paster et al. 2023 Zhang 2024 Longpre et al. 2023 Wang et al. 2023 3 Ubume beModeli Siqeqesha uthotho lwee-modeli zezobhali zobungakanani obahlukeneyo esekelwe kubume be-decoder transformer ( , ). Iiparameter ze-modeli zezi-modeli zinikezelwa kwiThebula . Kuzo zonke iim-modeli zobume, sisebenzisa ukulungiswa kwangaphambili ( , ): ukulungiswa okusetyenziswa kwisigxina se-attention kunye neebhloko ze-MLP. Vaswani et al. 2017 1 Xiong et al. 2020 : Imodeli encinci kusapho lwee-modeli zeGranite-code iqeqeshwe nge-RoPE embedding ( , ) kunye neMulti-Head Attention ( , ). Le modeli isebenzisa umsebenzi wokuvuselela we-swish ( , ) kunye ne-GLU ( , ) kwi-MLP, ekwaziwa nangokuthi yi-swiglu. Ngokulungiswa, sisebenzisa iRMSNorm ( , ) njengoko ibiza kakhulu kune-LayerNorm ( , ). Imodeli ye-3B iqeqeshwe ngobude besigxina bo-2048 tokens. 3B Su et al. 2023 Vaswani et al. 2017 Ramachandran et al. 2017 Shazeer 2020 Zhang & Sennrich 2019 Ba et al. 2016 : Imodeli ye-8B inobume obufanayo nemodeli ye-3B ngaphandle kokusebenzisa i-Grouped-Query Attention (GQA) ( , ). Ukusebenzisa i-GQA kubonelela ngomlinganiselo ongcono phakathi kokusebenza kwemodeli kunye nokusebenza kwe-inference kubungakanani obunjalo. Siqeqesha imodeli ye-8B ngobude besigxina bo-4096 tokens. 8B Ainslie et al. 2023 : Imodeli ye-20B code iqeqeshwe nge-learned absolute position embeddings. Sisebenzisa i-Multi-Query Attention ( , ) ngexesha lokuqeqesha ukuze kusebenze 20B Shazeer 2019