Ababhali: UMayank Mishra⋆, IBM UMatt Stallone⋆, IBM UGaoyuan Zhang⋆, IBM UYikang Shen, IBM UAditya Prasad, IBM UAdriana Meza Soria, IBM UMichele Merler, IBM UParmeswaran Selvam, IBM USaptha Surendran, IBM UShivdeep Singh, IBM UManish Sethi, IBM UXuan-Hong Dang, IBM UPengyuan Li, IBM UKun-Lung Wu, IBM USyed Zawad, IBM UAndrew Coleman, IBM UMatthew White, IBM UMark Lewis, IBM URaju Pavuluri, IBM UYan Koyfman, IBM UBoris Lublinsky, IBM UMaximilien de Bayser, IBM UIbrahim Abdelaziz, IBM UKinjal Basu, IBM UMayank Agarwal, IBM UYi Zhou, IBM UChris Johnson, IBM UAanchal Goyal, IBM UHima Patel, IBM UYousaf Shah, IBM UPetros Zerfos, IBM UHeiko Ludwig, IBM UAsim Munawar, IBM UMaxwell Crouse, IBM UPavan Kapanipathi, IBM UShweta Salaria, IBM UBob Calio, IBM USophia Wen, IBM USeetharami Seelam, IBM UBrian Belgodere, IBM UCarlos Fonseca, IBM UAmith Singhee, IBM UNirmit Desai, IBM UDavid D. Cox, IBM URuchir Puri†, IBM URameswar Panda†, IBM Isishwankathelo Iimodeli ezinkulu zeLwimi (iiLLMs) eziqeqeshwe kwikhowudi zichaza inkqubo yophuhliso lwesoftware. Ngokuphandle, iiLLMs zekhowudi ziyafumaneka kwizindlu zoophuhliso lwesoftware ukukhulisa imveliso yababhali benkqubo, kwaye ii-ejenti ezisekelwe kwi-LLM ziqala ukubonisa izithembiso zokujongana nemisebenzi enzima ngokuzimeleyo. Ukuzalisekisa ubuchule obupheleleyo beeLLMs zekhowudi kufuna uluhlu olubanzi lwezakhono, kubandakanya ukwenziwa kweekhowudi, ukulungisa iingxaki, ukuchaza nokuchaza ikhowudi, ukugcina izindlu, nokunye. Kulo msebenzi, siyazisa uthotho lweemodeli zekhowudi kuphela zokwenza iikhowudi zeGranite, eziqeqeshwe ngamakhodi abhalwe ngeelwimi zenkqubo ezili-116. Usapho lweemodeli zeGranite Code luquka iimodeli ezisukela kububanzi obusuka kwiiparameter ezi-3 ukuya kuma-34 ezigidi, ezilungele izicelo ukusuka kwimisebenzi enzima yokuhlaziya usetyenziso ukuya kwizicelo ezisebenzisa imemori ezigcinwe kwisixhobo. Uvavanyo olusetyenziswa ngokubanzi kwimisebenzi lubonisa ukuba iimodeli zeGranite Code zihlala zifikelela kwintsebenzo yobugcisa phakathi kweemodeli zeeLLMs zekhowudi ezikhoyo nezivulekileyo. Usapho lweemodeli zeGranite Code lwalujoliswe kwimisebenzi yophuhliso lwesoftware lweshishini kwaye lusebenza kakuhle kwimisebenzi yobhalo lweekhowudi (umzekelo, ukwenziwa kweekhowudi, ukulungiswa nokuchazwa), nto leyo eyenza ibe yimodeli yobhalo lweekhowudi "eyonke". Siyakhulula zonke iimodeli zethu zeGranite Code phantsi kwelayisenisi ye-Apache 2.0 kokubini kwisifundo kunye nokusetyenziswa kwezorhwebo. https://github.com/ibm-granite/granite-code-models 1 Intshayelelo Kwiminyaka emashumi embalwa edlulileyo, isoftware iye yaba yinxalenye yenkqubo yobomi bethu bonke. Njengoko isidingo sophuhliso lwesoftware sikhula, kubaluleke kakhulu ukwandisa imveliso yophuhliso lwesoftware, kwaye iiLLMs zinika indlela ethembisayo yokukhulisa ababhali benkqubo. Iinkqubo ezibalulekileyo zoshishino zeeLLMs kuphuhliso lwesoftware zibandakanya ukwenziwa kweekhowudi, ukuchazwa kweekhowudi, ukulungiswa kweekhowudi, ukwenziwa kweeyunithi kunye nokwenziwa kwedokumende, uhlaziyo losetyenziso, ukufunyanwa kobuthathaka, ukuguqulelwa kweekhowudi, nokunye. Kule minyaka mihlanu idlulileyo kubonakale inkqubela ekhawulezayo kubuchule beeLLMs bokwenziwa nokunyanzeliswa kweekhowudi, kwaye uluhlu lweemodeli ezinobuchule obumangalisayo bokubhala iikhowudi ziyafumaneka namhlanje. Iimfuno zisukela kwii-billions ezimbini zamaparameter (umzekelo, iLlama-7B (Touvron et al., 2023), iGemma-7B (Gemma-Team et al., 2024), njl.) ukuya kumakhulu ezigidi: iDBRX (Databricks), iArctic (Snowflake), iGrok, iMixtral 8x22B (MistralAI), iCommand R+ (Cohere), kwaye zahlukile ekusetyenzisweni okujoliswe kuko, kunye nezinye iimfuno ezijolise ekufunzeni uluhlu lwezicelo ngaphandle kweekhowudi, ngelixa ezinye zijolise ikakhulu kwimisebenzi enxulumene neekhowudi (umzekelo, iStarCoder (Li et al., 2023a; Lozhkov et al., 2024), iCodeGen (Nijkamp et al., 2023), iCodeLlama (Rozie`re et al., 2023), kunye neCodeGemma (CodeGemma Team et al., 2024)). Nangona kunjalo, kusenezikhewu ezibalulekileyo kwinkalo ye-LLMs yekhowudi, ngakumbi kwisicatshulwa sophuhliso lwesoftware lweshishini. Okokuqala, ngelixa ii-LLMs ezinkulu, ezijonge yonke into, zinokufikelela kwintsebenzo elungileyo yobhalo lweekhowudi, ubukhulu bazo benza kube yinto ebiza kakhulu ukuzisebenzisa. Iimfuno ezincinci ezigxile kwikhowudi ( , ; , ; , ; , ; , ) zinokufikelela kwintsebenzo elungileyo yokwenziwa kweekhowudi kwiphakheji encinci kwaye inokuguquguquka, kodwa intsebenzo kwimisebenzi yobhalo lweekhowudi ngaphandle kokwenziwa (umzekelo, ukulungiswa nokuchazwa) inokubethwa yintsebenzo yokwenziwa kweekhowudi. Li et al. 2023a Lozhkov et al. 2024 Nijkamp et al. 2023 Rozie`re et al. 2023 CodeGemma Team et al. 2024 Kwiimeko ezininzi zeshishini, ukwamkelwa kwe-LLM yekhowudi kunokubandlululwa ziingxaki ngaphandle kwentsebenzo yazo. Ngokomzekelo, nkqu nemfuno ezivulekileyo ngamanye amaxesha zihlaselwa kuku-nqunusa malunga nemithombo yedatha kunye neendlela zokucwangcisa idatha eziye zangena kwimodeli, ezingonakalisa ukuthembela kwiimfuno kwiimeko ezibalulekileyo kunye nemithetho. Ngaphezu koko, imimiselo yelayisenisi kwi-LLM ezivulekileyo zanamhlanje ingakhubekisa kwaye yenze kube nzima ukusetyenziswa kwemodeli ngeshishini. Apha, siyazisa iimfuno zeGranite Code, uthotho lweeLLMs zekhowudi ezinobuchule obuphezulu, eziyilelwe ukuxhasa uphuhliso lwesoftware lweshishini kwimisebenzi yobhalo lweekhowudi. Iimfuno zeGranite Code zineezinhlobo ezimbini eziphambili esizikhuphelayo kububanzi obune (3B, 8B, 20B, kunye ne-34B): iimfuno ezisisiseko zabasebenzi kwimisebenzi enxulumene neekhowudi; Isiseko seGranite Code: iimfuno zokulandela imiyalelo eziqeqeshwe kusetyenziswa umxube wokuthelekisa ii-Git kunye nemiyalelo yabantu kunye nezicwangciso zedatha yemiyalelo ezenziwe ngekhowudi evulekileyo. Isikhokelo seGranite Code: Iimfuno ezisisiseko kolu thotho ziye zaqeqeshwa ukusuka ekuqaleni ngendlela yokuqeqesha eyinqanaba ezimbini. Kwinqanaba 1, imodeli yethu iqeqeshwe kwiitoni ezingama-3 ukuya kwezi-4 ezigidi ezivela kwiilwimi zenkqubo ezili-116, iqinisekisa ukuqonda okupheleleyo kweelwimi zenkqubo kunye nesintaksi. Kwinqanaba 2, imodeli yethu iqeqeshwe kakhulu kwiitoni ezingama-500 ezigidi kunye nomxube ocwangciswe ngononophelo wedatha esemgangathweni ephezulu kwiindawo zekhowudi kunye nolwimi ukuphucula ubuchule bemodeli bokucinga. Sisebenzisa injongo yolwimi engazenzisiyo ukqeqesha iimfuno ezisisiseko kuzo zonke iinqanaba zokuqeqesha. Iimfuno zesikhokelo zivela ekufundeni okungaphezulu kwezi mfuno ezisisiseko eziqeqeshwe kwi-dataset ethile yeCommitPack ( , ), iidatha zokulandela imiyalelo yolwimi (iOASST ( , ), iHelpSteer ( , )) kunye nee-datasets zemath ezivulekileyo (iMathInstruct ( , ) kunye neMetaMathQA ( , )), kubandakanya ii-datasets zemiyalelo yekhowudi eyenziwe ngekhompyutha ukukhulisa ubuchule bokulandela imiyalelo kunye nokucinga. Muennighoff et al. 2023 Ko¨ pf et al. 2023 Wang et al. 2023 Yue et al. 2023 Yu et al. 2023 Senza uvavanyo olubanzi lweeLLMs zekhowudi zethu kwiqela elibanzi leebhonisi, kubandakanya iHumanEvalPack ( , ), iMBPP+ ( , ; , ), iRepoBench ( , ), iReCode ( , ), nokunye. Eli qela leebhonisi libandakanya iintlobo ezininzi ezahlukeneyo zemisebenzi yobhalo lweekhowudi ngaphandle kokwenza ikhowudi kuphela kwiPython, umzekelo, ukulungiswa kweekhowudi, ukuchazwa kweekhowudi, ukuhlelwa kweekhowudi, ukuguqulelwa kweekhowudi, njl. kwiilwimi zenkqubo ezinkulu (iPython, iJavaScript, iJava, iGo, iC++, iRust, njl.). Muennighoff et al. 2023 Austin et al. 2021 Liu et al. 2023a Liu et al. 2023b Wang et al. 2022 Iziphumo zethu zityhola ukuba phakathi kweemodeli ezivulekileyo, iimfuno zeGranite Code ngokubanzi zibonisa intsebenzo egqwesileyo kuzo zonke iisayizi zemodeli kunye neebhonisi (ngokuphindaphinda ukugqithisa iimfuno zekhowudi ezivulekileyo eziphindaphindwe kabini kuneGranite). Njengomzekelo, umfanekiso (ngentla) ubonisa uthelekiso lweGranite-8B-Code-Base kunye nezinye iiLLMs zekhowudi ezisisiseko ezivulekileyo, kubandakanya iiLLMs zamva nje ezijonge yonke into ezisebenza kakuhle ezifana neMistral ( , ) kunye neLLama-3 ( , ) kwiHumanEvalPack ( , ). Ngelixa iCodeGemma kunye neStarCoder2 zisebenza kakuhle ekwenzeni ikhowudi, zisebenza kancinci kakhulu kwizinto zokulungisa ikhowudi kunye nokuchaza iindawo zeHumanEvalPack. Ngokomndilili, iGranite-8B-Code-Base igqithisa imodeli yeCodeGemma-8B enokhuphiswano malunga neempawu ezili-12 kwiHumanEvalPack (33.2% vs 21.3%), nangona iqeqeshwe kwii-tokens ezincinci kakhulu (4.5T vs 7.5T tokens). Ngaphandle kweemodeli ezisisiseko, iinguqulelo zesikhokelo zeemodeli zethu zeGranite Code nazo zibonisa intsebenzo egqwesileyo kwiHumanEvalPack, igqithisa iimodeli ze (code) ezivulekileyo eziyimiyalelo, ibonisa iinzuzo kwisethi ebanzi yemisebenzi yobhalo lweekhowudi kunye nemiyalelo yolwimi (jonga umfanekiso (ngaphantsi)). 1 Jiang et al. 2023b AI@Meta 2024 Muennighoff et al. 2023 1 Ngaphezu koko, njengokuba ukucubungula kubalulekile ekusombululeni imibuzo kunye nemisebenzi enzima, sikwatyela iGranite-8B-Code-Base yethu kwiibhonisi zemath ezintandathu, kubandakanya iMATH ( , ), iGSM8K ( , ) kunye nokusombulula iingxaki ngokufikelela kwizixhobo zokubala, apho imodeli yethu yeGranite 8B ifikelela kwintsebenzo engcono xa ithelekiswa neyona imodeli ye-LLM ye-7B okanye i-8B yobugcisa. Ngokomzekelo, iGranite-8B-Code-Base igqithisa iLlama-3-8B-Base ngama-12 empawu kwiGSM8K nangama-6 empawu kwiMATH (jonga ithebula ). Cobbe et al. 2021 Cobbe et al. 2021 15 Iinzuzo eziphambili zee-LLMs zeGranite Code zibandakanya: : Iimfuno zeGranite Code zifikelela kwintsebenzo enemincintiswano okanye yobugcisa kwiintlobo ezahlukeneyo zemisebenzi enxulumene neekhowudi, kubandakanya ukwenziwa kweekhowudi, ukuchazwa, ukulungiswa, ukuhlelwa, ukuguqulelwa, njl., ibonisa ubuchule bazo bokusombulula imisebenzi eyahlukeneyo yobhalo lweekhowudi; I-LLM Yobhalo Lwekhowudi Eyenze Yonke Into : Zonke iimfuno zethu ziqeqeshwe kwidatha evunyelweyo ilayisensi eqokelelwe ngokuhambelana neengqiqo ze-IBM AI Ethics kunye neyikhokhelwa liqela lezomthetho le-IBM Corporate ukuze isetyenziswe ngobushishini ethenjwayo. Zonke iiLLMs zeGranite Code zikhutshiwe phantsi kwelayisensi ye-Apache 2.0. I-LLM Yetreyini Yobushishini Enokuthenjwa 1 Sichaza yonke inkqubo yethu yokuqokelelwa kwedatha, ukucoca, kunye nokucwangcisa kwicandelo . ICandelo lichaza iinkcukacha malunga nolwakhiwo lwemodeli, kulandelwe ziinkcukacha zokuqeqesha kwiCandelo . ICandelo linika iinkcukacha malunga nokuqeqeshwa kwemiyalelo, kwaye ICandelo lichaza uvavanyo kunye neziphumo zokuthelekisa iimfuno zeGranite Code kunye nezinye iiLLMs ezivulekileyo. 2 3 4 5 6 2 Ukuqokelelwa kwedatha Kule candelo, sichaza inkqubo yokukhangela nokucoca (icandelo. ), ukudeyithathayo (icandelo. ), ukucoca i-HAP/PII (icandelo. ) esetyenziselwa ukulungisa idatha yekhowudi yokuqeqesha imodeli. Sikwabonelela ngesishwankathelo sedatha yolwimi ephezulu esetyenziselwa ukuphucula ubuchule bemodeli bokukuqonda ulwimi kunye nokucinga ngezibalo. 2.1 2.2 2.3 2.1 Ukukhangela nokucoca idatha Idatha yekhowudi yePretraining yavela kwidatha edibeneyo yedatha edumileyo efana neGithub Code Clean , iStarCoderdata , kunye nezinye iirepositories zekhowudi ezidumeleyo kunye nemibuzo esuka kwiGitHub. Sihlola idatha ekrwada ukugcina uludwe lweelwimi zenkqubo ezili-116 kwiilwimi ezingama-300+, njengoko zidweliswe kwiSihlomelo . Ukwabiwa kwedatha kwiilwimi zenkqubo kwenziwa ngokusekelwe kwifayile extension kuphela, efana neStarCoder ( , ). Emva kokucoca ulwimi, sisebenzisa imithetho emine yokuco-ca ukucoca ikhowudi esezantsi ( , ): (1) susa iifayile ezineenkcitho zohlamvu ezincinci kunama-25%, (2) ngaphandle kolwimi lwe-XSLT, coci iifayile apho umtya "<?xml version=” uvele kwiinkcitho zokuqala ezili-100, (3) kwiifayile ze-HTML, gcina kuphela iifayile apho umxholo obonakalayo ubangela ubuncinci obungama-20% yekhowudi ye-HTML kwaye unobude obuncinci obungama-100 characters, (4) kwiifayile zeJSON kunye neYAML, gcina kuphela iifayile ezineenkcitho ezisukela kwii-50 ukuya kuma-5000 characters. Sikwacoca imibuzo yeGitHub kusetyenziswa iseti yeenqobo zomgangatho ezibandakanya ukususa umxholo owenziwe yi-automation, ukucoca imibuzo engesiNgesi, ukukhetha izimvo ezivela kwiibhotwe, kunye nokusebenzisa inani leembalelwano njengophawu lomgangatho. Sikwaphawula ifayile yekhowudi nganye ngolwazi lwelayisensi ehambelana nerepositori efanelekileyo, efumaneka ngeGitHub APIs kwaye sigcina kuphela iifayile ezineelayisensi ezivunyelweyo zokuqeqeshwa kwemodeli. 2 3 A Li et al. 2023a Li et al. 2023a 2.2 Ukudeyithatha ngokuchanekileyo kunye ne-Fuzzy Siyamkela inkqubo yokudeyithatha egqithisileyo kubandakanya ukudeyithatha ngokuchanekileyo kunye ne-fuzzy ukususa amaxwebhu anomxholo wekhowudi (ophantse) ofanayo kwisethi yethu yokuqeqesha. Ukudeyithatha ngokuchanekileyo, siqala sibala i-hash ye-SHA256 kumxholo oxwebhu kwaye sibe sisondeze iirekhodi ezineehhashi ezifanayo. Emva kokudeyithatha ngokuchanekileyo, sisebenzisa ukudeyithatha kwe-fuzzy ngoncedo lokususa iifayile zekhowudi ezinokuba neenguqulelo ezincinci kwaye ngaleyo ndlela zenze idatha ingabi nobuthathaka ngakumbi. Sisebenzisa indlela enezinqanaba ezimbini koku: (1) sibala iiMinHashsazo zonke amaxwebhu kwaye emva koko sisebenzise i-Hashing ebabazekayo (LSH) ukudibanisa amaxwebhu ngokusekelwe kwii-fingerprints zabo zeMinHash, (2) si-odola ubudlelwano beJaccard phakathi kwazo zonke iipair zeexwebhu kwibhakethi efanayo kwaye siphawule amaxwebhu ngaphandle kwesinye njengedeyitha ngokusekelwe kwinqanaba lobudlelwano eliyi-0.7. Sisebenzisa le nkqubo yokudeyithatha kancinci kuzo zonke iilwimi zenkqubo kubandakanya imibuzo yeGitHub ukuphucula ubutyebi kunye nokwahluka kweseti yokqeqesha. 2.3 Ukucoca i-HAP, PII, iMalware Ukunciphisa amathuba okwenziwa kweenkcoko, ubu Gubha, okanye ulwimi olubi (i-HAP) kwiimfuno, senze imigudu engummangaliso yokucoca umxholo we-HAP kwiseti yokuqeqesha. Sokuqala senze isichazi-magama samagama-isihluthulelo se-HAP kwaye emva koko siphawule yonke idokumende yekhowudi ngamanani eziganeko zamagama-isihluthulelo anjalo kumxholo kubandakanya izimvo. Sico-ca amaxwebhu agqitha kwinqanaba le-HAP, elibaliweyo ngokusekelwe kuhlalutyo lolwabiwo kunye nokuhlolwa kwamanani kweefayile zekhowudi. Ngaphezu koko, ukukhusela imfihlo, silandela iStarCoder ( , ) kwaye senze imigudu engummangaliso ukukhusela ulwazi olubonisa umntu (PII) kwiseti yokuqeqesha. Ngokukodwa, sisebenzisa imodeli yeStarPII ukufumana ii-IP, ii-key, ii-imeyile, amagama, amagama omsebenzisi, kunye neepassword ezifunyenwe kumxholo. Inqanaba lokhuselo lwe-PII libuyisela isicatshulwa se-PII ngeetoken ezihambelanayo NAME , EMAIL , KEY , PASSWORD kwaye litshintshe i-IP nge-IP eyenziwe ngekhompyutha, njengaku Li et al. (2023a). Sikwakhangela iidatha zethu kusetyenziswa ukuze sifumanise kwaye sisondeze imizekelo ye-malware kwikhodi yemithombo. Li et al. 2023a 4 2.4 Iidatha zolwimi Ngaphezu kokuqokelela idatha yekhowudi yokuqeqesha imodeli, sihlomela iidatha ezininzi zolwimi ezidumileyo ezikumgangatho ophezulu ukuphucula ubuchule bemodeli ekukuqonda ulwimi nokucinga ngezibalo. Iidatha ezimelweyo phantsi kwesi sigaba zibandakanya amaxwebhu ewebhu (iStackexchange, iCommonCrawl), umxholo wewebhu wezibalo (iOpenWeb-Math; ( ), iStackMathQA; ( )), umxholo wezifundo (iArxiv, iWikipedia), kunye nee-datasets zokufundisa imiyalelo (iFLAN; ( ), iHelpSteer ( , )). Asidideyithatha ezi datha zolwimi eziye zacocwa. Paster et al. 2023 Zhang 2024 Longpre et al. 2023 Wang et al. 2023 3 Ulwakhiwo lweModeli Siqeqesha uthotho lweemodeli zekhowudi zobubanzi obahlukeneyo ngokusekelwe kulwakhiwo lomthumeli we-transformer ( , ). Iihyperparameters zemodeli zale mfuno zifumaneka kwiThebhile . Kuzo zonke iindibano zemodeli, sisebenzisa ukulungiswa kwangaphambi kwexesha ( , ): ukulungiswa okusetyenziswa kumgomo wokungena kunye nezibhloko ze-MLP. Vaswani et al. 2017 1 Xiong et al. 2020 : Imodeli encinci kakhulu kusapho lwemodeli yeGranite-code iqeqeshwe nge-RoPE embedding ( , ) kunye ne-Multi-Head Attention ( , ). Le mfuno isebenzisa umsebenzi wokubhenela we-swish ( , ) kunye ne-GLU ( , ) kwi-MLP, ekwabizwa ngokubizwa ngokuthi i-swiglu. Ukulungiswa, sisebenzisa iRMSNorm ( , ) kuba isebenza ngcono ngokubala kuneLayerNorm ( , ). Imodeli ye-3B iqeqeshwe ngobude bomgangatho we-2048 tokens. 3B Su et al. 2023 Vaswani et al. 2017 Ramachandran et al. 2017 Shazeer 2020 Zhang & Sennrich 2019 Ba et al. 2016 : Imodeli ye-8B inolwakhiwo olufanayo nemodeli ye-3B ngaphandle kokusebenzisa i-Grouped-Query Attention (GQA) ( , ). Ukusebenzisa i-GQA kunika ulungelelwaniso olungcono phakathi kwentsebenzo yemodeli kunye nokusebenza kwe-inference kule nkqubo. 8B Ainslie et al. 2023 : Imodeli yekhowudi ye-20 20B