Abhala: Mayank Mishra⋆, IBM Matt Stallone⋆, IBM Gaoyuan Zhang⋆, IBM Yikang Shen, IBM Aditya Prasad, IBM Adriana Meza Soria, IBM Michele Merler, IBM Parameswaran Selvam, IBM Saptha Surendran, IBM Shivdeep Singh, IBM Manish Sethi, IBM Xuan-Hong Dang, IBM Pengyuan Li, IBM Kun-Lung Wu, IBM Syed Zawad, IBM Andrew Coleman, IBM Matthew White, IBM Mark Lewis, IBM Raju Pavuluri, IBM Yan Koyfman, IBM Boris Lublinsky, IBM Maximilien de Bayser, IBM Ibrahim Abdelaziz, IBM Kinjal Basu, IBM Mayank Agarwal, IBM Yi Zhou, IBM Chris Johnson, IBM Aanchal Goyal, IBM Hima Patel, IBM Yousaf Shah, IBM Petros Zerfos, IBM Heiko Ludwig, IBM Asim Munawar, IBM Maxwell Crouse, IBM Pavan Kapanipathi, IBM Shweta Salaria, IBM Bob Calio, IBM Sophia Wen, IBM Seetharami Seelam, IBM Brian Belgodere, IBM Carlos Fonseca, IBM Amith Singhee, IBM Nirmit Desai, IBM David D. Cox, IBM Ruchir Puri†, IBM Rameswar Panda†, IBM Isishwankathelo Iimodeli ezinkulu zolwimi (iiLLMs) eziqeqeshwe ngezikhombiso ziphuhlisa inkqubo yophuhliso lwesoftware. Ngoku nangoku, iiLLMs zezikhombiso zingezwa kwizikhewu zophuhliso lwesoftware ukuze kuphuculwe imveliso yabasebenzi babuntu, kwaye ii-ejenti ezisekelwe kwi-LLM ziqala ukubonisa ithemba lokuphatha imisebenzi enzima ngokuzimeleyo. Ukufezekisa ubuninzi be-LLMs zezikhombiso kufuna uluhlu olubanzi lwezakhono, kubandakanya ukuveliswa kwekhodi, ukulungisa iibugs, ukuchaza nokubeka izikhombiso kwisikhombiso, ukugcina iirepositories, nokunye. Kulo msebenzi, siphakamisa uthotho lweemodeli zezikhombiso zeGranite ezikhethela izibhalo zezikhombiso, eziqeqeshwe ngezikhombiso ezibhalwe kwiilwimi zokwenziwa kwekhompyutha ezili-116. Usapho lweemodeli zeGranite Code lunemodeli ezisuka kubungakanani obusuka kwi-3 ukuya kwi-34 yeebhiliyoni zeparameters, ezilungele izicelo ezisuka kwimisebenzi enzima yokuhlaziya isicelo ukuya kwimisebenzi egciniweyo yesiqabu yenkumbulo kwisixhobo. Uvavanyo kwiseti ebanzi yemisebenzi lubonisa ukuba iimodeli zeGranite Code zihlala zifikelela ekusebenzeni okungaphezu komgangatho phakathi kweemodeli zezikhombiso ezikhoyo ezivulekileyo. Usapho lweemodeli zeGranite Code lwenziwa ngokugqwesileyo kwiindlela zophuhliso lwesevisi yombutho kwaye lusebenza kakuhle kwiindidi zemisebenzi yokwenza izikhombiso (umzekelo, ukuveliswa kwekhodi, ukulungisa nokuchaza), nto leyo eyenza kube yimodeli yezikhombiso "yonke indawo". Sisasaza zonke iimodi zethu zeGranite Code phantsi kwelayisensi ye-Apache 2.0 kwiinjongo zophando kunye nezorhwebo. https://github.com/ibm-granite/granite-code-models 1 Intshayelelo Kule minyaka imbalwa idlulileyo, isoftware iye yaba yinxalenye yayo yonke imiba yentlalo yethu. Njengoko isidingo sokwenziwa kwesoftware sikhula, kubaluleke kakhulu ukonyusa imveliso yophuhliso lwesoftware, kwaye iiLLMs zinikezela ngendlela ethembisayo yokonyusa abaphuhlisi babantu. Iziphumo eziphambili zokusebenzisa iiLLMs kwimveliso yophuhliso lwesoftware zibandakanya ukuveliswa kwekhodi, ukuchazwa kwekhodi, ukulungiswa kwekhodi, ukuveliswa kwezivavanyo zemodeli kunye nokubeka izikhombiso, ukuphuculwa kwesoftware, ukubhaqwa kobutyebi, ukuguqulelwa kwekhodi, nokunye. Kule minyaka idlulileyo kubekho inkqubela ekhawulezayo kwisakhono se-LLM sokuvelisa nokuphatha izikhombiso, kwaye uluhlu lweemodeli ezinokhono oluhle kakhulu luyafumaneka namhlanje. Iimodeli ziyahluka ngobukhulu ukusuka kwii-billions ezimbalwa zamaparameter (umzekelo, iLlama-7B (iTouvron et al., 2023), iGemma-7B (iGemma-Team et al., 2024), njl.) ukuya kumakhulu eebhiliyoni: iDBRX (iDatabricks), iArctic (iSnowflake), iGrok, iMixtral 8x22B (iMistralAI), iCommand R+ (iCohere), kwaye iyahluka kububanzi beinjongo yokusetyenziswa, nezinye iimodeli zijonge ukugubungela uluhlu lwezicelo ngaphandle kwekhodi, ngelixa ezinye zigxila kakhulu kwimisebenzi enxulumene nezikhombiso (umzekelo, iStarCoder (iLi et al., 2023a; iLozhkov et al., 2024), iCodeGen (iNijkamp et al., 2023), iCodeLlama (iRozie`re et al., 2023), kunye neCodeGemma (iCodeGemma Team et al., 2024)). Nangona kunjalo, kusekho izikhewu ezibalulekileyo kwinkalo yangoku ye-LLMs zezikhombiso, ngakumbi xa kujongwa uphuhliso lwesevisi yombutho. Okokuqala, ngelixa iiLLMs ezipheleleyo, ezipheleleyo zinokufumana ukusebenza okugqwesileyo kwekhodi, ubukhulu bazo benza ukuba kubize kakhulu ukuzisebenzisa. Iimodeli ezincinci ezigxile kwizikhombiso ( , ; , ; , ; , ; , ) zinokufumana ukusebenza okugqwesileyo kokuveliswa kwekhodi kwindawo encinci nenolwazi olunolwazi, kodwa ukusebenza kwimisebenzi yezikhombiso ngaphandle kokwenziwa (umzekelo, ukulungisa nokuchaza) kungadala ukusebenza kokwenziwa kwekhodi. Li et al. 2023a Lozhkov et al. 2024 Nijkamp et al. 2023 Rozie`re et al. 2023 CodeGemma Team et al. 2024 Kwiimeko ezininzi zombutho, ukwamkelwa kwe-LLM yezikhombiso kungangqineka kuku ngakumbi ngezinto ezingaphezu kokusebenza kweemodeli. Ngokomzekelo, nkqu iimodi ezivulekileyo ngamanye amaxesha zihlaselelwa kukungabikho kokubona ngendlela idatha yemithombo kunye neendlela zokuphatha idatha eziye zafaka imodeli, enokuthi yonyuse ukuthembela kwiimodi kwiimeko ezibalulekileyo zemisebenzi kunye nemigaqo. Ngaphezu koko, imimiselo yeelayisensi kwi-LLMs ezivulekileyo zanamhlanje ingathintela kwaye yenze inkampani ikwazi ukusebenzisa imodeli. Apha, siphakamisa iimodi zeGranite Code, uthotho lwama-LLM anezakhono eziphezulu zezikhombiso, ezenzelwe ukuxhasa uphuhliso lwesevisi yombutho kwiindidi ezahlukeneyo zemisebenzi yezikhombiso. Iimodi zeGranite Code zinazo ii-variants eziphambili ezimbini esizikhuphele kwii-sizes ezine ezahlukeneyo (3B, 8B, 20B, kunye ne-34B): iimodi zesiseko zokwenza imisebenzi enxulumene nezikhombiso; Isisekelo seKhodi yeGranite: ii-modeli zokulandela imiyalelo ezilungisiweyo kusetyenziswa umxube wezibhalo zekhompyutha zeGit ezidityaniswe nemiyalelo yomntu kunye neenqanaba lokwenza izibhalo zekhompyutha ezivulekileyo. Umyalelo weKhodi yeGranite: Iimodi zesiseko zoluhlu ziye zaqeqeshwa ukusuka ekuqaleni ngesicwangciso sokuqeqesha izigaba ezimbini. Kwinqanaba 1, imodeli yethu iqeqeshwe kwi-3 ukuya kwi-4 yeetayiti ezivela kwiilwimi zokwenziwa kwekhompyutha ezili-116, iqinisekisa ukuqonda okubanzi kweelwimi zokwenziwa kwekhompyutha kunye nes syntax. Kwinqanaba 2, imodeli yethu iqeqeshwe ngakumbi kwi-500 yeetayiti enomxube ocwangcisiweyo ngononophelo wedatha esemgangathweni esuka kwisikhombiso kunye neendawo zolwimi ukuphucula ubuchule bemali yomdibaniso. Sisebenzisa injongo yomdibaniso wolwimi olungaphantsi kokujonga ukuqeqesha iimodi zesiseko kwiizigaba zombini zokuqeqesha. Iimodi zomiyalelo ziveliswe ngokubeka phambili iimodi zesiseko eziqeqeshwe ngaphezulu kumxube wenqanaba elihlufeleyo leCommitPack ( , ), iindidi zokuphatha imiyalelo yolwimi lwabantu (OASST ( , ), iHelpSteer ( , )) kunye neenqanaba zedatha zemathatiki zenzululwazi (iMathInstruct ( , ) kunye neMetaMathQA ( , )), kubandakanya iinkcukacha eziveliswe ngendlela yokwenza izibhalo eziveliswe ngendlela ukuphucula amandla okulandela imiyalelo kunye namandla okucatshangelwa. Muennighoff et al. 2023 Ko¨ pf et al. 2023 Wang et al. 2023 Yue et al. 2023 Yu et al. 2023 Siqhuba uvavanyo olubanzi lwee-LLMs zezikhombiso kwiisethi ezipheleleyo zemithamo, kubandakanya iHumanEvalPack ( , ), iMBPP(+) ( , ; , ), iRepoBench ( , ), iReCode ( , ), nokunye. Le sethi yemithamo ihlanganisa iintlobo ezininzi zemisebenzi yezikhombiso ngaphandle kokwenziwa kwezikhombiso kwiPython, umzekelo, ukulungiswa kwekhodi, ukuchazwa kwekhodi, ukuhlelwa kwekhodi, ukuguqulelwa kwekhodi, njl., kwiilwimi eziphambili zokwenziwa kwekhompyutha (Python, JavaScript, Java, Go, C++, Rust, njl.). Muennighoff et al. 2023 Austin et al. 2021 Liu et al. 2023a Liu et al. 2023b Wang et al. 2022 Izifunyaniso zethu zibonisa ukuba phakathi kwee-modeli ezivulekileyo, ii-modeli zeGranite Code zibonisa ukusebenza okunamandla kwiisayizi zemodeli kunye nemithamo (idla ngokoyiswa iimodi zezikhombiso ezivulekileyo ezingaphantse ziphindwe kabini ubungakanani beGranite). Njengomzekelo, umfanekiso (ingaphezulu) ubonisa ukuthelekiswa kweGranite-8B-Code-Base nezinye ii-LLMs zezikhombiso ezisisiseko ezivulekileyo, kubandakanya iiLLMs zamva nje eziphakamileyo ezinje ngeMistral ( , ) kunye neLLama-3 ( , ) kwiHumanEvalPack ( , ). Ngelixa iCodeGemma kunye neStarCoder2 zisebenza kakuhle ekuveliseni izikhombiso, zisebenza kakubi kwiinguqulelo zokulungisa kunye nokuchaza zeHumanEvalPack. Ngokomndilili, iGranite-8B-Code-Base yoyisa imodeli yeCodeGemma-8B enokhuphiswano kwiindawo ezili-12 phantse kwiHumanEvalPack (33.2% vs 21.3%), nangona iqeqeshwe kwiinani elincinci leetayiti (4.5T vs 7.5T tokens). Ngaphandle kweemodeli zesiseko, iinguqulelo ezihlufeleyo zokwenza imiyalelo zeemodeli zethu zeGranite Code zibonisa ukusebenza okunamandla kwiHumanEvalPack, zoyisa iimodi zokwenza imiyalelo ezivulekileyo ezilinganayo, zibonisa izibonelelo kuluhlu olubanzi lwemisebenzi yezikhombiso kunye nemiyalelo yolwimi lwabantu (bona umfanekiso (ingezantsi)). 1 Jiang et al. 2023b AI@Meta 2024 Muennighoff et al. 2023 1 Ngaphezu koko, njengoko ukucinga kubaluleke kakhulu kusombululo semibuzo kunye nemisebenzi enzima, sikwatyila imodeli yethu yeGranite-8B-Code-Base kwiimithamo ezintandathu zemathatiki, kubandakanya MATH ( , ), GSM8K ( , ) kunye nokusombulula ingxaki ngokufikelela kwizixhobo zokubala, apho imodeli yethu yeGranite 8B ifumana ukusebenza okungcono xa kuthelekiswa neyona nto iphambili ye-LLM ye-7B okanye ye-8B. Ngokomzekelo, iGranite-8B-Code-Base yoyisa iLlama-3-8B-Base nge-∼12 yamanqaku kwiGSM8K kunye ne-∼6 yamanqaku kwiMATH (bona itheyibhile ). Cobbe et al. 2021 Cobbe et al. 2021 15 Iziphumo eziphambili zeemodeli zeGranite Code zibandakanya: : Iimodi zeGranite Code zifumana ukusebenza okuncintisana okanye okungaphezu komgangatho kwimisebenzi eyahlukeneyo enxulumene nezikhombiso, kubandakanya ukuveliswa kwekhodi, ukuchazwa, ukulungiswa, ukuhlelwa, ukuguqulelwa, njl., zibonisa amandla abo okusombulula imisebenzi eyahlukeneyo yezikhombiso; I-LLM Yezikhombiso Yonke : Zonke iimodi zethu ziqeqeshwe kwiidatha eziphethe iilayisensi eziqokelelwe ngokuhambelana nemigaqo ye-IBM ye-AI Ethics kwaye zikhokhelwa liqela lezomthetho le-IBM Corporate ukuze zisetyenziswe ngumbutho okholelekileyo. Zonke iimodi zeGranite Code zikhutshwa phantsi kwelayisensi ye-Apache 2.0. I-LLM Yombutho ethembekileyo 1 Sichaza yonke indlela yethu yokuqokelela idatha, ukucoca, kunye nokuphatha kwakhona kwicandelo . ICandelo lichaza iinkcukacha malunga nolwakhiwo lwemodeli, kulandelwa ziinkcukacha zokuqeqesha kwiCandelo . ICandelo sibonelela ngeenkcukacha malunga nokufunda imiyalelo, kwaye ICandelo lichaza imilingo kunye neziphumo ezithelekisa iimodi zeGranite Code kunye nezinye iiLLMs ezivulekileyo. 2 3 4 5 6 2 Ukuqokelelwa kweDatha Kule candelo, sichaza inkqubo yokuqokelela kunye nokucoca (uC. ), ukungafani (uC. ), ukucoca iHAP/PII (uC. ) ezisetyenziselwa ukulungisa idatha yezikhombiso zokuqeqesha imodeli. Sikwayibonelela ngovavanyo lwedatha yolwimi lwabantu esemgangathweni esetyenziselwa ukuphucula ukuqonda imodeli yolwimi kunye namandla okucatshangelwa kwe mathatiki. 2.1 2.2 2.3 2.1 Ukuqokelelwa nokuCocwa kweDatha Idatha yezikhombiso zokubeka phambili yaveliswa kwidatha edibeneyo efumaneka ngokusesidlangalaleni njengeGithub Code Clean , iStarCoderdata , kunye nezinye iirepositories zezikhombiso ezidlangalaleni kunye neengxelo ezivela kwiGitHub. Siyihluza idatha eluhlaza ukuze sigcine uluhlu lweelwimi zokwenziwa kwekhompyutha ezili-116 kwiilwimi ezingama-300+, njengoko zidweliswe kwiSihlomelo . Ukubeka idatha kwiilwimi zokwenziwa kwekhompyutha kwenziwa ngokusekelwe kuphela kwisiphelo sefayile, njengeStarCoder ( , ). Emva kokuhluza ulwimi, sifaka imigaqo emine ephambili yokucoca ukuze sihlutye izikhombiso ezisemgangathweni ophantsi ( , ): (1) susa iifayile ezineenkcukacha ezincinci zolwimi ezincinci kune-25%, (2) ngaphandle kolwimi lwe-XSLT, hluta iifayile apho ibinzana elithi “<?xml version=” livakala kwiinkcukacha zokuqala ezili-100, (3) kwiifayile ze-HTML, gcina kuphela iifayile apho umbhalo obonakalayo ubangela ubuncinci i-20% yesikhombiso se-HTML kwaye unobude obuncinci obuyi-100 characters, (4) kwiifayile ze-JSON kunye ne-YAML, gcina kuphela iifayile ezinobude bohlamvu obusuka kwi-50 ukuya kwi-5000 characters. Sikwayihluza ingxelo yeGitHub kusetyenziswa iseti yeemilinganiselo zomgangatho ezibandakanya ukususa umbhalo oveliswe ngokuzenzekelayo, ukuhluza iingxelo ezingesiNgesi, ukukhulula izimvo ezivela kwiibhot, kwaye sisebenzise inani labasebenzisi ababandakanyeka kwincoko njengomboniso womgangatho. Sikwayakhe inkcazo yefayile nganye yesikhombiso kunye nolwazi lwelayisensi enxulumene ne repository efanelekileyo, efumanekayo ngezi-API zeGithub kwaye sigcine kuphela iifayile ezineelayisensi ezivulelekileyo zokuqeqesha imodeli. 2 3 A Li et al. 2023a Li et al. 2023a 2.2 Ukungafani Okuchanekileyo kunye ne-Fuzzy Sikhetha isicwangciso esinamandla sokungafani kubandakanya ukungafani okuchanekileyo kunye ne-fuzzy ukuze sisuse amaxwebhu anezikhombiso ezichanekileyo okanye ezichanekileyo kwiseti yethu yokuqeqesha. Ngokungafani okuchanekileyo, siqala siqale i-SHA256 hash kumbhalo wokubhala kwaye sisuse iirekhodi ezineehasi ezifanayo. Emva kokungafani okuchanekileyo, sifaka ukungafani kwe-fuzzy ngenjongo yokususa iifayile zezikhombiso ezinokuthi zibe neenguqulelo ezincinci kwaye ngaloo ndlela zihlambulule idatha ngakumbi. Sisebenzisa indlela yamanyathelo amabini koku: (1) sibala iiMinHashes zonke iixwebhu kwaye emva koko sisebenzise i-Locally Sensitive Hashing (LSH) ukudibanisa iixwebhu ngokusekelwe kwi-MinHash fingerprints zazo, (2) si-linganisa i-Jaccard similarity phakathi kweqela ngalinye leexwebhu kwibhokisi efanayo kwaye sichaze iixwebhu ngaphandle kokuba enye iibhabhathane njengee-duplicates ngokusekelwe kumda wokufana we-0.7. Sisebenzisa le nkqubo yokungafani kwakhona kuzo zonke iilwimi zokwenziwa kwekhompyutha kubandakanya ingxelo yeGitHub ukuphucula ubutyebi kunye nokwahluka kwesethi yokuqeqesha. 2.3 Ukuhluza i-HAP, i-PII, i-Malware Ukunciphisa amathuba okuvelisa ulwimi olunenqala, olunobundlobongela, okanye olungekho mthethweni (i-HAP) kwii-modeli, senze imizamo yokucoca umxholo we-HAP kwiseti yokuqeqesha. Siqala siqale isichazi-magama sezitshixo ze-HAP kwaye emva koko sichaze isikhombiso ngasinye ngamanani okubonakala kwezitshixo ezinjalo kumbhalo kubandakanya izimvo. Siya kwicocwa iixwebhu ezigqitha kumda we-HAP, oluphuhliswe ngokusekelwe kuhlalutyo olusoloko luhamba ngendlela kunye nokujongwa kwesandla kweefayile zezikhombiso. Ngaphezu koko, ukukhusela imfihlo, silandela iStarCoder ( , ) kwaye senze imizamo yokuqokelela ulwazi oluphathwayo (i-PII) kwiseti yokuqeqesha. Ngokukodwa, sisebenzisa imodeli yeStarPII ukubona iidilesi ze-IP, izitshixo, iidilesi ze-imeyile, amagama, amagama abasebenzisi, kunye neegama lokugqitha ezifunyenweyo kumbhalo. Inyathelo lokudlula ulwazi oluphathwayo libuyisela umbhalo we-PII ngeeteyiphu ezihambelanayo NAME , EMAIL , KEY , PASSWORD kwaye utshintshe idilesi ye-IP nge-IP eyenziweyo, njengakwi-Li et al. (2023a). Sikwayijonga idatha yethu kusetyenziswa ukubona nokususa iimeko zemalware kwisikhombiso somthombo. Li et al. 2023a 4 2.4 Iinqaniso zoLwimi Lwabantu Ngaphezu kokuqokelela idatha yezikhombiso zokuqeqesha imodeli, siqokelela iindidi ezininzi zodatha yolwimi lwabantu ezisemgangathweni ezikhoyo ukuthuthukisa ubuchule bemali bemali kunye namandla okucatshangelwa kwe mathatiki. Iinqanaba ezimelayo phantsi kwesi sigaba zibandakanya amaxwebhu ewebhu (iStackexchange, iCommonCrawl), umbhalo wewebhu we-mathatiki (iOpenWeb-Math; ( ), iStackMathQA; ( )), umbhalo wezifundo (iArxiv, iWikipedia), kunye neenqanaba zokuphatha imiyalelo (iFLAN; ( ), iHelpSteer ( , )). Asibuyiseli ngasemva ezi nqanaba zolwimi lwabantu sele ziphathwe kwakhona. Paster et al. 2023 Zhang 2024 Longpre et al. 2023 Wang et al. 2023 3 Uyilo Lwemodeli Siqeqesha uthotho lweemodeli zezikhombiso ezine-sizes ezahlukeneyo ezisekelwe kumzobo we-transformer decoder ( , ). Iiparameter ze-model hyperparameters zezi modeli zichazwe kwiThebhile . Kuzo zonke izakhiwo zemodeli, sisebenzisa ukulungiswa kwangaphambili ( , ): ukulungiswa okufakwe kwisibonelelo kunye nezibhloko ze-MLP. Vaswani et al. 2017 1 Xiong et al. 2020 : Imfono encinci kuluhlu lweemodeli zeGranite-code iqeqeshwe nge-RoPE embedding ( , ) kunye neMulti-Head Attention ( , ). Le modeli isebenzisa i-swish activation function ( , ) kunye neGLU ( , ) kwi-MLP, ekwayaziwa ngokufuthi njenge-swiglu. Ngokulungiswa, sisebenzisa iRMSNorm ( , ) njengoko iyifini kakhulu ngokubala kuneLayerNorm ( , ). Imfono ye-3B iqeqeshwe ngonobude bomlinganiselo we-2048 tokens. 3B Su et al. 2023 Vaswani et al. 2017 Ramachandran et al. 2017 Shazeer 2020 Zhang & Sennrich 2019 Ba et al. 2016 : Imfono ye-8B inombutho ofanayo nemfono ye-3B ngaphandle kokusebenzisa i-Grouped-Query Attention (GQA) ( , < 8B Ainslie et al.