Umbhali: Yue Wang, wang.y@salesforce.com (Salesforce Research Asia) Weishi Wang, weishi.wang@salesforce.com (Salesforce Research Asia; Nanyang Technological University, Singapore) Shafiq Joty, sjoty@salesforce.com (Salesforce Research Asia; Nanyang Technological University, Singapore) Steven C.H. Hoi, shoi@salesforce.com (Salesforce Research Asia) Umbhali: Yue Wang, wang.y@salesforce.com (Salesforce Research Asia) Weishi Wang, weishi.wang@salesforce.com (Salesforce Research Asia; Nanyang Technological University, Singapore) Shafiq Joty, sjoty@salesforce.com (Salesforce Research Asia; Nanyang Technological University, Singapore) Steven C.H. Hoi, shoi@salesforce.com (Salesforce Research Asia) Ukucinga Iimodeli ze-Natural Languages (NL) eyenziwe ngexesha elide (i-BERT kunye ne-GPT) zibonakalisa ukuba zithunyelwe kakuhle kwi-Programming Languages (PL) kwaye zithunyelwe kakhulu kwiintlobo ezininzi zeengxaki ezinxulumene ne-code. Ngaphandle kokuphumelela kwayo, iimodeli ezininzi zilandelayo zithunyelwe kwi-encoder-only (okanye yi-decoder-only) i-pre-training ebonakalayo kwiingxaki ze-generation (i-understanding) okanye i-processing ye-code snippet ngexesha elifanayo ne-NL, ukunciphisa izici ezizodwa zeengxaki ze-PL ezifana ne-token types. Sinikezela i-CodeT5, i . https://github.com/salesforce/CodeT5 1 Ukuqalisa Iimveliso zangaphambili ze-language efana ne-BERT ( Ukucinga (Wagqibela ukuhlaziywa) Ukucinga ), kunye T5 ( Ukucinga i-NLP (i-Natural Language Processing) Izixhobo ze-NLP (i-Natural Language Processing) Izixhobo ze-NLP (i-Natural Language Processing) Izixhobo ze-NLP (i-Natural Language Processing) Izixhobo ze-NLP (i-Natural Language Processing) Izixhobo ze-NLP (Natural Language Processing) Izixhobo ze-NLP (Natural Language Processing) Izixhobo ze-NLP (Natural Language Processing) Izixhobo ze-NLP (Natural Language Processing) Izixhobo ze-NLP (Natural Language Processing) Ukucinga iimveliso Ukucinga iimveliso Ukucinga ), Ukubonisa imiphumo emangalisayo kwiimeko ezixhomekeke ne-code. I-Devlin et al. 2019 I-Radford et al. 2019 Raffles et al. 2020 Ukucinga Yintoni al. 2020 I-Canadian kunye ne-Al. 2020 I-Feng et al. 2020 Nangona kunjalo, ngaphandle kokuphumelela kwabo, iimodeli ezininzi zihlanganisa okanye i-encoder-only model efana ne-BERT ( Ukucinga iimveliso Ukucinga ) okanye i-decoder-only model efana ne-GPT ( Ukucinga ), nto leyo i-suboptimal yeemveliso kunye nokufunda iingxaki, ngokufanayo. Ngokwesibonelo, CodeBERT ( Ukucinga ) kufuneka i-decoder eyongezelelweyo xa ifumaneka kwi-code summarization task, apho le decoder ayikwazi ukufumana i-pre-training. Ukongezelela, iindlela ezininzi ezikhoyo zikhusela kuphela i-conventional NLP pre-training techniques kwi-source code ngokuxhomekeka njenge-sequence ye-tokens efana ne-NL. Oku kubaluleke kakhulu iinkcukacha ze-structural ezininzi ezininzi kwi-code, nto leyo kubalulekile ukufumana ngokupheleleyo i-semantics ye-code. Iimveliso zeNdlovukovsky et al. 2020 I-Feng et al. 2020 I-Canada Yintoni al. 2020 I-Feng et al. 2020 Kule nqaku, sinikezela CodeT5, i-encoder-decoder model eyakhelwe ngexabiso ye-token-type kwi-code. I-CodeT5 yethu ibekwe kwi-architecture T5 ( Ukucinga ) leyo isetyenzise denoising sequence-to-sequence (Seq2Seq) pre-training kwaye kuboniswa ukuba ibonelela kwimfuneko yokufunda kunye nemveliso kwilwimi yemvelo. Ukongezelela, sincoma ukufumana iididijithali ezidijithali ezidijithali kwi-code. Xa kubhalwe iiprograms, iidijithali ziyafumaneka ukusetyenziswa idijithali ezininzi ukuze ikhowudi lula kakhulu, ukuze ezi idijithali ziyafumaneka ngokubanzi semantics ikhowudi emangalisayo, idivayisi “binarySearch” kwi-Figure Ukuqhathanisa izazi ezininzi ze-code-specific, sincoma izixhobo ezintsha ze-identifiers-aware elawula i-model yokuzonwabisa i-tokens ze-identifiers kunye nokuthintela xa zithambile. Raffles et al. 2020 Ngathi 2 Ukongezelela, sincoma ukusetyenziswa kwe-code kunye neengxelo zayo ukufundisa ukufikelela kwe-NL-PL. I-Developers ngokuvamile ibonise i-documentation yeeprograms ukunceda ukugcina i-software engcono ( Ukucinga ), ngoko i-PL-NL i-pairs ezininzi ziyafumaneka kakhulu kwi-source code. Ngokutsho, sincoma i-NL→PL generation kunye ne-PL→NL generation njengezinxalenye ezimbini kunye nokuphuculisa i-model kwakhona. U-Suzuki et al. 2005 Izixhobo ze-CodeT5 ze-CodeSearchNet ( Ukucinga (Imininingwane ) Ukucinga ) leyo iqukethe iinkcukacha ze-unimodal (i-PL-only) kunye ne-bimodal (PL-NL) kwi-6 PLs. Ukongezelela kwakhona, sinikezela iinkcukacha ezininzi ze-C/C# kwi-open-source Github repositories. Sinikezela i-CodeT5 kwiingxaki ezininzi kwi-CodeXGLUE benchmark ( Ukucinga ), kuquka izivumelwano ezimbini: ukucinga iingxaki ze-code kunye ne-clone ukucinga, kunye nezivumelwano ze-generation njenge-code summarization, generation, translation, kunye ne-refinement. Njengoko ibonisa kwi-Figure , siya kuhlola ukufundisa i-multi-task ukunciphisa i-CodeT5 kwiinkqubo ezininzi ngexesha elifanayo usebenzisa i-task control code njenge-source prompt. Ngokutsho, sinikezela ezilandelayo: Iimveli kunye ne-Al. 2019 I-Feng et al. 2020 Uluzi et al. 2021 1 I-CodeT5 yinto yokuqala ye-encoder-decoder ye-unified ye-code-related understanding kunye neengxaki ze-generation kunye ne-multi-task learning. I-Identifier-aware pre-training objective ye-Identifier-aware ye-Identifier ye-Identifier ye-Identifier ye-Identifier ye-Identifier ye-Identifier ye-Identifier ye-Identifier ye-Identifier ye-Identifier ye-Identifier ye-Identifier ye-Identifier ye-Identifier ye-Identifier ye-Identifier ye-Identifier ye-Identifier ye-Identificator ye-Identificator ye-Identificator ye-Identificator ye-Identificator ye-Identificator ye-Identificator ye-Identificator ye-Identificator ye-Identificator ye-Identificator ye-Identificator ye-Identificator ye-Identificator ye- Iimvavanyo ezininzi zibonisa ukuba i-CodeT5 ibonelela iziphumo ezininzi kwi-sub-tasks ze-14 kwi-CodeXGLUE. Ukuhlolwa kwakhona ibonelela ukuba i-CodeT5 yethu inokufumana ngcono i-semantics ye-code nge-identifier-aware pre-training ebonakalayo kunye ne-bimodal dual-generation ibonelela ikakhulu kwiimfuneko ze-NL↔PL. 2 Izixhobo ezinxulumene Iimveliso ezisekelwe kwi-Transformer architectures ( Ukucinga ) ziyafumaneka kwiimveliso ze-state-of-the-art kwiingxaki ezininzi ze-NLP. Zifumaneka ngokubanzi kwiintlobo ezintathu: iimveliso ze-encoder-only ezifana ne-BERT ( Ukucinga (Wagqibela ukuhlaziywa) Ukucinga ), kunye ELECTRA ( Ukucinga ), iimodeli kuphela decoder ezifana GPT ( Ukucinga ), kunye neencoder-decoder iimodeli ezifana MASS ( Ukucinga Ukucaciswa ( Ukucinga ), kunye T5 ( Ukucinga Ngaphandle kweemodeli e-encoder-only kunye ne-decoder-only eyenza iingxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeengxaki zeeng Pre-training on Natural Language. Waze et al. 2017 I-Devlin et al. 2019 Ngathi Yintoni al. 2019b U-Clark et al. 2020 I-Radford et al. 2019 Umculo kunye ne-al. 2019 U-Lewis et al. 2020 Raffles et al. 2020 Ukufundwa kwe-pre-training kwiilwimi ye-programming iyiphi iinkalo ezidlulileyo apho imisebenzi ezininzi ezidlulileyo zibonisa iindlela ze-pre-training ye-NLP kwi-code source. Cu-BERT ( Ukucinga ) kunye neCodeBERT ( Ukucinga I-Cubert isetyenziselwa i-OBERT yobugcisa ye-masked language modeling yobugcisa yokufaka i-representation ye-code-specific ye-generic, kwaye i-CodeBERT ibandakanya i-replace token detection ( Ukucinga ) umsebenzi ukufundisa i-NL-PL cross-modal representation. Ukongezelela kwiimodeli ze-BERT-style, (ngoku) (i) kunye (ngoku) ) ukusetyenziswa GPT kunye UniLM ( Ukucinga ) ukuze isicelo sokugqibela. I-Transcoder ( Ukucinga Iingcebiso zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi kunye neLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi zeLwimi Pre-training on Programming Language. I-Canadian kunye ne-Al. 2020 Ukucinga Yintoni al. 2020 U-Clark et al. 2020 Iimveliso zeNdlovukovsky et al. 2020 Uluhlu kunye neel. 2020 Ngathi et al. 2019 Ukucinga et al. 2020 Zonke iimveliso ezintsha ( Ukucinga iimveliso Ukucinga iimveliso Ukucinga ) kwizifundo ezidlulileyo ziquka i-Framework T5 kwi-code, kodwa zihlanganisa kuphela kwi-subset ephantsi yeengxaki ze-generation kwaye ayikwazi ukufumana iingxaki efana nathi. Ngaphandle kokuba, i-PLBART ( Ukucinga ) esekelwe kwimodeli ye-encoder-decoder enye, i-BART inokuxhomekeka kunye neengxaki ze-generation. Nangona kunjalo, zonke iimpahla ezidlulileyo ezidlulileyo zibonisa ikhowudi ngexesha elifanayo ne-language yamanzi kwaye ibonakalisele kakhulu iimpawu ze-code-specific. Ngoku, sincoma ukunceda ukufumana iinkcukacha ze-identifier kwi-code kwi-pre-training. Iingxoxo et al. 2020 Iimveliso ze-Mastropaolo et al. 2021 Iimvakalelo kunye ne-Al. 2021 Ahmad et al. 2021 Ukulungiselela ( Ukucinga ) ibandakanya umgangatho weedatha eyenziwa kwi-code structure kwi-CodeBERT, kwaye (ngoku) I-Defuscation Objective (i-Defuscation Objective) ifumaneka ukusetyenziswa kwe-PL. Ezi iimodeli zihlanganisa kuphela ukufundisa i-encoder ye-code-specific engcono. (ngoku) Ngokuqhelekileyo, sinxibelelana ngokukodwa kwi-identifiers ezihlabathi i-semantics ye-code kunye nokufaka ulwazi efanayo kwimodeli ye-Seq2Seq ngeengxaki ze-identifier ezintsha ezimbini kunye neengxaki ze-prediction. Iingubo kunye ne-al. 2021 Ukucinga et al. 2021 Ukucinga et al. 2021 3 I-CodeT5 I-CodeT5 yethu ibekwe kwi-encoder-decoder framework kunye ne-architecture efanayo ne-T5 ( Ukucinga ). Indawo yokwenza imibuzo generic for programming language (PL) kunye natural language (NL) nge-pre-training on unlabeled source code. Njengoko ifumaneka kuFigura , sinikezela i-denoising Seq2Seq objective kwi-T5 ngokuvumela iingxaki ze-identifier kunye ne-prediction ezimbini ukuze i-model inokukwazi ukufumana okungcono iinkcukacha ze-token-type evela kwi-PL, eziqhelekileyo i-identifier eyenziwe ngama-developer. Ukuphucula i-alignment ye-NL-PL, sinikezela ngakumbi i-bimodal dual learning objective ye-bi-directional conversion phakathi kwe-NL ne-PL. Raffles et al. 2020 2 Kwixesha elandelayo, sinikezela indlela CodeT5 ikhowudi ingxelo PL kunye NL (§ ) kwaye izicelo zethu identifier-aware umsebenzi pre-training (§ ), emva kokufanisa ngokufanelekileyo kunye nokufundiswa kwe-transfer learning kunye ne-multi-task training (§ inguqulelo 3.1 3.2 3.3 3.1 Ukubhalisa NL kunye PL Kwi-pre-training stage, imodeli yethu uya kufumana okanye PL-only okanye NL-PL njenge-inputs ngokutsho ukuba i-snippet ye-code ibandakanya imibuzo ye-NL okanye akukho. Kwi-in-puts ye-NL-PL bimodal, sinikezela kwi-sequence kunye ne-token ye-delimiter [SEP] kwaye zibonise i-sequence ephelele ye-input kwi-format njengoko = (I-CLS) 1*, ..., wholesale*, [SEP], 1*, ..., cm*, [SEP]), apho iimveliso denote the number of NL word tokens and PL code tokens, respectively. The NL word sequence will be empty for PL-only unimodal inputs. x w c n m Ukuphathelisa iimpawu ze-code-specific, sincoma ukusetyenziswa kwe-token-type information ukusuka kwi-code. Sincoma i-identifiers ye-type ( iimveliso iimpawu kunye neengxaki) njengoko enye iimpawu ezininzi ze-PL-agnostic kwaye zihlanganisa i-semantics ye-code. Ngokufanelekileyo, thina ukuguqulwa i-PL ye-Segment kwi-Abstract Syntax Tree (AST) kunye nokukhuthaza iintlobo ze-nodes ngamnye i-token ye-code. Ekugqibeleni, thina ukwakha i-sequence ye-labels ye-binary ∈ {0*, * 1} Kuba i-PL segment, apho wonke umntu ∈ {0*,* 1} ibonisa ukuba i-token ye-code I-Identification ye-Identification ye-Identification Ngathi y m Ukucinga Ukucinga 3.2 Izifundo ezidlulileyo Ngoku sinikezela iingxaki zethu zophuhliso zokusebenza pre-training ezinikezela CodeT5 ukufundisa iimodeli ezisebenzayo kwi-PL-only okanye i-NL-PL bimodal idatha. I-De-noise Sequence-to-Sequence (Seq2Seq) i-pre-training iye yonyango kakhulu kwiingxaki ezininzi ze-NLP ( Ukucinga iimveliso Ukucinga iimveliso , Ukulungelelaniso lwe-denoising ngokuvamile kuqala ukuchithwa kwe-source sequence nge-functions ezininzi ze-noise kwaye ke kufuneka i-decoder ukufumana iingxelo zokuqala. Kule nqaku, sisebenzisa i-spanning-masking objective efana ne-T5 ( Ukucinga ) nto leyo ngempumelelo iinkcukacha kunye neengxaki ezininzi kunye neengxaki zeenkcukacha kunye neenkcukacha ze-sentinel ezininzi kwi-decoder. Thina sinxulumene lo mfuneko ukuba Ngokutsho kwe-Figure (a). Identifier-aware Denoising Pre-training. Song et al. 2019 Nqakraza et al. 2020 U-Lewis et al. 2020 Raffles et al. 2020 Masked Span Prediction (MSP) 2 Ngokutsho, sinikezela isantya se-corruption ye-15% efanayo ne-T5 kwaye sinikezela ubude wokusetyenziswa kwe-3 ngokufanelekileyo ngama-1 ukuya kwi-5 i-token. Ngaphezu kwalokho, sinikezela i-T5 Ukusebenzisa i-sampling spaces ngaphambi kwe-subword tokenization, leyo ilungele ukunciphisa i-masking ye-sub-tokens ezincinane kwaye kuboniswa ukuba kufanelekileyo ( Ukucinga ). Ngokufanelekileyo, thina pre-train isampuli esebenzayo PLs ezahlukeneyo ukufundisa izibonelelo cross-lingual ezininzi. Thina zihlanganisa ukuphazamiseka ubomi ubomi njengoko: Ukubalwa kwe-masking Sun et al. 2019 apho i- θ iiparametre ye-model, i-x \mask yi-masked input, i-x mask yi-masked sequence yokubonisa kwi-decoder kunye ne-k ibonisa inani le-tokens e-x mask, kwaye i-xmask <t yi-span sequence ebonakalayo. Ukuqhathanisa ulwazi lwezakhiwo ezininzi ze-code-specific (i-ID node-type kwi-AST) kwi-model, sincoma izicelo ezimbini ezongezelelweyo: iimveliso Ukongeza i-denoising pre-training. Ukucinga Ukucinga (IT) Ukubuyekezwa kwe-Masked Identifier Prediction (MIP) •Uku Ukusetyenziswa kwimodeli kunye nokufumana ukuba le token ye-code iyiphi na i-identifier okanye akukho, leyo ibonisa umdla elifanelekileyo lwe-syntax ekukhuselwa kwizixhobo ezithile ezihambelana nabasebenzi. Njengoko ibonisa kwi-Figure (b), sinikezela iimeko ezininzi ezihambelana kwi-PL segment kwi-CodeT5 encoder kwi-sequence yeengxaki iimveliso ( 1*, ..., pm*), kwaye ukubala isisindo se-entropy ye-binary ye-sequence labeling: Identifier Tagging (IT) 2 p p Ngaba are the encoder parameters. Note that by casting the task as a sequence labeling problem, the model is expected to capture the code syntax and the data flow structures of the code. Ukucinga •Uku Different from the random span masking in MSP, we mask all identifiers in the PL segment and employ a unique sentinel token for all occurrences of one specific identifier. In the field of software engineering, this is called where changing identifier names does not impact the code semantics. Inspired by ( ), sinikezela idivayisi ezizodwa kunye ne-sentinel tokens kwi-target sequence as shown in Figure (c) Ngoko ke, sincoma ngokucacileyo ngokucacileyo: Masked Identifier Prediction (MIP) Ukucinga Rozière et al. 2021 I 2 where \I is the masked input. Qaphela ukuba is a more challenging task that requires the model to comprehend the code semantics based on obfuscated code and link the occurrences of the same identifiers together. x deobfuscation Izixhobo ezimbini zibonisa iintlobo ezimbini kunye ne-probability efanayo, nto leyo ilungiselelo yethu yokufuneka i-identifier-aware denoising pre-training. In the pre-training phase, the decoder only sees discrete masked spans and identifiers, which is disparate from the downstream tasks where the decoder needs to generate either fluent NL texts or syntactically correct code snippets. To close the gap between the pre-training and fine-tuning, we propose to leverage the NL-PL bimodal data to train the model for a bidirectional conversion as shown in Figure (d). Specifically, we regard the NL→PL generation and PL→NL generation as dual tasks and simultaneously optimize the model on them. For each NL- Bimodal Dual Generation. 2 I-PL bimodal datapoint, i-construction of two training instances with reverse directions and add language ids ( <java> and <en> for Java PL and English NL, respectively). This operation can be also seen as a special case of T5’s span masking by either masking the full NL or PL segment from the bimodal inputs. This task aims to improve the alignment between the NL and PL counterparts. Ngathi 3.3 Fine-tuning CodeT5 After pre-training on large-scale unlabeled data, we adapt CodeT5 to downstream tasks via either task-specific transfer learning or multi-task learning. Code-related tasks can be categorized into generation and understanding tasks. For the former one, our CodeT5 can be naturally adapted with its Seq2Seq framework. For understanding tasks, we investigate two ways of either generating the label as a unigram target sequence ( , ), okanye ukuhlaziywa kwi-vocabulary ye-class labels ezisekelwe kwi-decoder yokugqitywa kwe-state elandelayo (ngoku) inguqulelo Task-specific Transfer Learning: Generation vs. Understanding Tasks. Raffel et al. 2020 Lewis et al. 2020 Ukufundwa kwe-multi-task ukufundwa kwimeko ye-multi-task ngokucacisa i-model eyenziwe ngexesha elinye. Ukufundwa kwe-multi-task inokukwazi ukunciphisa i-cost ye-calculation ngokucacisa i-model we-weights kwiimeko ezininzi kwaye kuboniswa ukuba kubonelela i-model generalization ye-model in NL pre-training ( , (Ndinga ukuyifaka ( ) to employ the same unified model for all tasks without adding any task-specific networks but allow to select different best checkpoints for different tasks. To notify the model with which task it is dealing with, we design a unified format of task control codes and prepend it into the source inputs as shown in Figure . For instance, we employ “Translate Java to CSharp:” as the source prompt for the code-to-code translation task from Java to CSharp. Multi-task Learning. Liu et al. 2019a Raffel et al. 2020 1 As different tasks have different dataset sizes, we follow Conneau and Lample (2019) to employ a balanced sampling strategy. For N number of datasets (or tasks), with probabilities {qi} N i=1, we define the following multinomial distribution to sample from: where ni is number of examples for i-th task and α is set to 0.7. This balanced sampling aims to alleviate the bias towards high-resource tasks. 4 Experimental Setup 4.1 Pre-training Dataset We follow Feng et al. (2020) to employ CodeSearchNet (Husain et al., 2019) to pre-train CodeT5, which consists of six PLs with both unimodal and bimodal data. Apart from that, we additionally collect two datasets of C/CSharp from BigQuery1 to ensure that all downstream tasks have overlapped PLs with the pre-training data. In total, we employ around 8.35 million instances for pretraining. Table 1 shows some basic statistics. To obtain the identifier labels from code, we leverage the tree-sitter2 to convert the PL into an abstract syntax tree and then extract its node type information. We filter out reserved keywords for each PL from its identifier list. We observe that PLs have different identifier rates, where Go has the least rate of 19% and Ruby has the highest rate of 32%. 4.2 Code-specific Tokenizer Tokenization is a key ingredient for the success of pre-trained language models like BERT and GPT. They often employ a Byte-Pair Encoding (BPE) to-kenizer ( , ) to alleviate the Out-of-Vocabulary (OoV) issues. Specifically, we train a Byte-level BPE tokenizer following (ngoku) ) and set the vocabulary size to 32,000 as T5. We add additional special tokens ([PAD], [CLS], [SEP], [MASK0], ..., [MASK99]). This tokenzier is trained on all of our pre-training data with non-printable characters and low-frequent tokens (occurring <3 times) filtered. We compare it with T5’s default tokenizer and find that our tokenizer largely reduces the length of tokenized code sequence by 30% - 45% on downstream tasks. This will accelerate the training and especially benefit generation tasks due to the shorter sequence to predict. We also spot a severe problem for applying the T5’s default tokenizer on source code, where it would encode some common code tokens such as brackets [‘{’, ‘}’] into unknown tokens. Sennrich et al. 2016 I-Radford et al. 2019 4.3 Downstream Tasks and Metrics Izixhobo ze-Generation kunye neengxaki ze-CodeXGLUE ze-benchmark ( , ) kwaye usebenzise i-published datasets eyongezelelweyo kunye ne-data splits efanayo kwelinye iimfuneko ezininzi. Lu et al. 2021 We first consider two cross-modal generation tasks. aims to summarize a function-level code snippet into English descriptions. The dataset consists of six PLs including Ruby, JavaScript, Go, Python, Java, and PHP from CodeSearchNet ( , Ukulungiswa kwe-BLEU-4 ( Ukucinga ) ukucacisa le umsebenzi. is the task to gen-erate a code snippet based on NL descriptions. We employ the Concode data ( , ) kwi-Java apho ingxelo iqukethe iimveliso ze-NL kunye ne-class environment contexts, kwaye i-output yi-function. Sifumaneka ngexesha le-BLEU-4, ukuxhaswa ngokufanelekileyo (EM) kunye neCodeBLEU ( Ukucinga ) that considers syntactic and semantic matches based on the code structure in addition to the n-gram match. Code summarization Husain et al. 2019 Lin and Och 2004 Code generation Iyer et al. 2018 Ren et al. 2020 Besides, we consider two code-to-code generation tasks. aims to migrate legacy software from one PL to another, where we focus on translating functions from Java to CSharp and vice versa. aims to convert a buggy function into a correct one. We employ two Java datasets provided by ( ) with various function lengths: small (fewer than 50 tokens) and medium (50-100 tokens). We use BLEU-4 and exact match to evaluate them. Code translation Code refinement Tufano et al. 2019 We also investigate how CodeT5 performs on two understanding-based tasks. The first one is that aims to predict whether a code is vulnerable to software systems or not. We use the C dataset provided by ( ) ukuba isampuli. Izixhobo yesibini leyo ilungele ukucacisa ukufanana phakathi kwezinye iingcebiso ze-code kunye nokufikelela ukuba ziyafumaneka ngexesha elifanayo. Siza kubandakanya idatha ze-Java eyenziwa ngu-Java. ( ). Singasetyenziswa i-F1 kunye ne-accuracy ukucaciswa kwezi zonyango ezimbini ngokufanayo. Ngokufanelekileyo, i-CodeT5 yethu ibonelela iingxaki ezisixhenxe kunye neengxaki ze-14 kwi-CodeXGLUE nge-encoder-decoder model eyodwa. defect detection U-Zhou et al. 2019 clone detection Wang et al. 2020 4.4 Comparison Models We compare CodeT5 with state-of-the-art (SOTA) pre-trained models that can be categorized into three types: encoder-only, decoder-only, and encoder-decoder models. As models, we consider RoBERTa ( , ), RoBERTa (code) trained with masked language modeling (MLM) on code, CodeBERT ( , ) trained with both MLM and replaced token detection ( , ), GraphCode-BERT ( , ) using data flow from code, and DOBF ( Ukucinga ) trained with the identifier deobfuscation objective. Note that although DOBF employs a Seq2Seq model during pre-training, it only aims to train a better encoder for downstream tasks without exploring the poten-tial benefit of the pre-trained decoder. encoder-only Uluhlu kunye neel. 2019b I-Feng et al. 2020 Clark et al. 2020 Guo et al. 2021 Rozière et al. 2021 Kuba models, we compare GPT-2 ( Ukucinga ) and its adaptations on code domain including CodeGPT-2, and CodeGPT-adapted. The difference is that the latter one utilizes a GPT-2 checkpoint for model initialization while the former one is trained from scratch. As models, the current SOTA model for the CodeXGLUE benchmark is PLBART ( Ukucinga ) based on BART ( , ) architecture. Ukuze ukulungiselela idatha pre-training, iimodeli ezininzi ezisetyenziswa CodeSearchNet ( Ukucinga ) except DOBF and PLBART. DOBF is pre-trained on 7.9M Java and 3.6M Python files from BigQuery while PLBART employs a much larger data with 470M Python and 210M Java functions, and 47M NL posts from StackOverflow. decoder-only Radford et al. 2019 encoder-decoder Ah-mad et al. 2021 Lewis et al. 2020 Iimveli kunye ne-Al. 2019 4.5 Model Configurations We build CodeT5 based on Huggingface’s T5 ( , ) Ukusebenza PyTorch and employ two sizes of CodeT5-small (60M) and CodeT5-base (220M). We set the maximum source and target sequence lengths to be 512 and 256, respectively. We use the mixed precision of FP16 to accelerate the pre-training. We set the batch size to 1024 and employ the peak learning rate of 2e-4 with linear decay. We pre-train the model with the denoising objective for 100 epochs and bimodal dual training for further 50 epochs on a cluster of 16 NVIDIA A100 GPUs with 40G memory. The total training time for CodeT5-small and CodeT5-base is 5 and 12 days, respectively. Raf-fel et al. 2020 3 In the fine-tuning phase, we find that the tasks in CodeXGLUE ( , ) are quite sensitive to some hyper parameters such as learning rate, training steps, and batch size. We conduct a grid search and select the best parameters based on the validation set. In multi-task learning, we cover all downstream tasks except clone detection. Lu et al. 2021 5 Iziphumo kunye ne-Analysis In this section, we compare CodeT5 with SOTA models on a broad set of CodeXGLUE downstream tasks (§ ), and investigate the effects of our bimodal dual generation and multi-task learning (§ ), followed by a detailed analysis on the proposed identifier-aware pre-training (§ ). 5.1 5.2 5.3 5.1 CodeXGLUE Downstream Tasks We evaluate two sizes of our model: CodeT5-small and CodeT5-base that are pre-trained with identifier-aware denoising. In addition, we consider the model that continues to train with bimodal dual generation (dual-gen) and show the results with multi-task fine-tuning. The results of all comparison models are obtained from their original papers and also the CodeXGLUE paper ( Ukucinga ). Lu et al. 2021 Ukubonisa imiphumela ye-code summarization ye-BLEU-4 ebonakalayo kwi-6 iinkcukacha ze-PL kwi-Table . We observe all our model variants significantly outperform prior work with either an encode-only (RoBERTa, CodeBERT, DOBF) or encoder-decoder framework (PLBART). Moreover, the salient performance gap between these two groups of models confirms that encode-only frameworks are suboptimal for generation tasks. Compared to the SOTA encoder-decoder model PLBART, we find that even our CodeT5-small yields better overall scores (also on Python and Java) given that our model is much smaller (60M vs. 140M) and PLBART is pre-trained with much larger Python and Java data (> 100 times). We attribute such improvement to our identifier-aware denoising pre-training and better employment of bi-modal training data Ukwandisa ubungakanani umzila, isiseko seCodeT5 yethu ukwandisa ukusebenza jikelele ngaphezulu kwe-1.2 iiponyo ephakeme kune-PLBART. Code Summarization. 2 4 We compare CodeT5 with GPT-style models and PLBART in Table . Our CodeT5-small outperforms all decoder-only mod-els and also the SOTA PLBART, which again confirms the superiority of encoder-decoder models at generating code snippets. Moreover, our CodeT5-base further significantly pushes the SOTA results across three metrics. Particularly, it achieves around 4.7 points improvement on CodeBLEU over PLBART, indicating our CodeT5 can better comprehend the code syntax and semantics with the __help of identi__fier-aware pre-training. Code Generation. 3 Ukubala iingxaki ezimbini ze-code-to-code: i-code translation kunye ne-code refinement kwi-Table and further consider one naive copy baseline by copying the source input as the target prediction. In the code translation task, our CodeT5-small outperforms most of base-lines and obtains comparable results with PLBART, which shows the advantages of encoder-decoder models in the code-to-code generation setting. Our CodeT5-base further achieves consistent improvements over PLBART across various metrics for translating from Java to C# and vice versa. Code-to-Code Generation Tasks. 4 Here we show one CodeT5’s output of translating C# to Java in Figure . In this case, despite the poor BLEU score, CodeT5 is able to generate a function that reserves the same functionality and even has better readability compared to the ground-truth. This reveals that CodeT5 has a good generalization ability instead of memorizing and repeating what it has seen before. On the other hand, it also suggests that BLEU score is not a perfect evaluation metric for code generation tasks, where sometimes a higher score can instead reflect the problematic copy issues of neural models. 3 Another code-to-code generation task is code refinement, a challenging task that requires detecting which parts of code are buggy and fix them via generating a bug-free code sequence. Due to the large overlap of source and target code, even the naive copy approach yields very high BLEU scores but zero exact matches. Therefore, we focus on the exact match (EM) metric to evaluate on this task. As shown in Table , sincoma ukuba i-EM scores yeendaba ezincinane ziquka yi-medium, oku kuthetha ukuba i-bug ye-fixing ye-snippet ye-code elide. I-CodeT5 base yethu ibonelela kakhulu zonke iingxaki ze-baseline kwi-EM kwaye ikakhulukazi i-boost kwi-4.8 yeepoint yeengxaki ezininzi (i-13.96 vs. i-GraphCodeBERT ye-9.10), nto leyo ibonakalisa amandla yayo yokufunda i-code. 4 We compare with two understanding tasks of defect detection and clone detection in Table 5. Understanding Tasks. Specifically, we generate the binary labels as a unigram sequence from the decoder for the defect detection task, while for the clone detection task, we first obtain the sequence embedding of each code snippet using the last decoder state following ( ) and then predict the labels by measuring their similarity. Both CodeT5-small and CodeT5-base outperform all baselines on the defect detection task while CodeT5-base yields 2.6 accuracy score improvement than PLBART. For the clone detection task, our CodeT5 models achieve comparable results to the SOTA GraphCodeBERT and PLBART models. These results demonstrate that with an encode-decoder framework, our CodeT5 can still be adapted well for understanding tasks. Lewis et al. 2020 5.2 Effects of Bimodal Dual Generation and Multi-task Learning We examine the effects of bimodal dual generation at pre-training and multi-task learning at fine-tuning. The bimodal pre-training brings consistent improvements for code summarization and generation tasks on both CodeT5-small and CodeT5-base. However, this pre-training task does not help and even sometimes slightly hurts the performance for PL-PL generation and understanding tasks. We anticipate this is because bimodal dual generation learns a better alignment between PL and NL that naturally benefits the former tasks involving both PL and NL. As a side effect, this objective could bias the model towards the PL-NL tasks and affect its performance on PL-PL tasks. Kwi-multi-task learning, ngokuvamile ivumela iimveliso ezininzi ezilandelayo ngaphandle kwe-code translation kunye ne-defect detection. Ngokutsho, kubaluleke kakhulu imveliso kwi-code summarization, nto leyo ayinayo njengoko i-code summarization ithatha iinkcukacha ezininzi ze-sub tasks (iiyunithi ezisixhenxe ze-13) yaye ngoko ufumane kakhulu kwi-multi-task learning. Ukongezelela, sincoma ukuba i-multi-task learning kubaluleke ngokuvamile imveliso ye-code refinement, nto leyo ingasetyenziswa kwi-joint training yeedatha ze-small kunye ne-medium refinement. Umzekelo olungabonakaliweyo ku-multi-task training kunye ne-defect detection iyavumela i-model ukufumana ngakumbi i-semantics ye-code ye-bug detection, nto leyo yinto efunekayo ye-intermediate step ye-code refinement. 5.3 Ukuhlolwa kwe-Identifier-aware Pre-Training We provide an ablation study to examine the contribution of each component in our identifier-aware objective. Specifically, we compare the performance of our CodeT5-small on four selected tasks by ablating each of the three objectives: masked span prediction (MSP), identifier tagging (IT), and masked identifier prediction (MIP). As shown in Table , we observe that generally removing one of the objectives would reduce the performance for all tasks, indicating that all objectives contribute to the better code understanding of our CodeT5. However, the effect of each objective differs across tasks. Specifically, removing MSP would largely reduce the performance of all generation tasks but instead increase the defect detection performance. This shows that masked span prediction is more crucial for capturing syntactic information for generation tasks. On the contrary, removing MIP would hurt the defect detection task the most, indicating that it might focus more on code semantic understanding. By combining these objectives, our CodeT5 can better capture both syntactic and semantic information from code. 6 We further provide outputs from CodeT5 and its variant without MIP and IT on code generation in Figure . We observe that CodeT5 can correctly generate the exact function, while the model without MIP and IT fails to recover the identifiers of “s2” and “hasField”. This shows our identifier-aware denoising pre-training can better distinguish and leverage the identifier information. 4 We also investigate the identifier tagging performance and find it achieves over 99% F1 for all PLs, showing that our CodeT5 can confidently distinguish identifiers in code. We then check whether MSP and MIP tasks would have conflicts as they employ the same sentinel tokens for masking. In identifier masking, all occurrences of one unique identifier are replaced with the same sentinel token, resulting in a many-to-one mapping compared to the one-to-one mapping in span prediction. We compare models pre-trained with either MSP or MIP, and both on these two tasks in Table . We report the prediction accuracy and also the ratio of how often they can generate the same number of predictions as the sentinel tokens. We observe that pre-training only with either MIP or MSP would bias the model towards that task, achieving poor accuracy and higher mismatch in number of predictions when applied to the other task. Interestingly, we find that MIP-only objective can better recover the correct number of predictions in the MSP task than MSP-only does for the MIP task, meaning that it is easier to adapt from many-to-one mapping to one-to-one mapping and difficult for the opposite. At last, combining them can help our model to make a good trade-off on both tasks. 7 6 Conclusion We have presented CodeT5, a pre-trained encoder-decoder model that incorporates the token type information from code. We propose a novel identifier-aware pre-training objective to better leverage the identifiers and propose a bimodal dual generation task to learn a better NL-PL alignment using code and its comments. Our unified model can support both code understanding and generation tasks and allow for multi-task learning. Experiments show that CodeT5 significantly outperforms all prior work in most CodeXGLUE tasks. Further analysis also reveals its better code comprehension capability across various programming languages. Ukuphakama okwengeziwe kunye ne-ethical consideration Our work generally belongs to NLP applications for software intelligence. With the goal of improving the development productivity of software with machine learning methods, software intelligence research has attracted increasing attention in both academia and industries over the last decade. Software code intelligence techniques can help developers to reduce tedious repetitive workloads, enhance the programming quality and improve the overall software development productivity. This would considerably decrease their working time and also could potentially reduce the computation and operational cost, as a bug might degrade the system performance or even crash the entire system. Our work addresses the fundamental challenge of software code pre-training, our study covers a wide range of code intelligence applications in the software development lifecycle, and the proposed CodeT5 method achieves the state-of-the-art performance on many of the benchmark tasks, showing its great potential benefit towards this goal. We further discuss the ethical consideration of training CodeT5 and the potential risks when applying it into real-world downstream applications: The training datasets in our study are source code including user-written comments from open source Github repositories and publicly available, which do not tie to any specific application. However, it is possible that these datasets would encode some stereotypes like race and gender from the text comments or even from the source code such as variables, function and class names. As such, social biases would be intrinsically embedded into the models trained on them. As suggested by ( ), interventions such as filtration or modulation of generated outputs may help to mitigate these biases in code corpus. Dataset bias. Chen et al. 2021 Ukuqeqeshwa kwimodeli yethu kuqala kuxhomekeke iinkcukacha ze-computing ezincinane kodwa siphinde ukuba sinikezele iinkcukacha zethu ngokucacileyo kunye nokuphucula iinkcukacha zokusebenzisa iinkcukacha ze-computing. Kwimeko, kulinganiswa ngexesha elidlulileyo ye-model ye-language Codex ( Ukucinga ), our CodeT5-base has a much smaller model size of 220M than theirs of 12B (∼ 55×). In addition, we experiment on Google Cloud Plat-form which purchases carbon credits to reduce its carbon footprint, training CodeT5-base produced around 49.25 kg CO2 which was totally off-set by the provider. Furthermore, we release our pre-trained models publicly to avoid repeated training for the code intelligence research community. Computational cost. Chen et al. 2021 e.g., As CodeT5 can be deployed to provide coding assistance such as code generation for aiding developers, automation bias of machine learning systems should be carefully considered, especially for developers who tend to over-rely on the model-generated outputs. Sometimes these systems might produce functions that superficially appear correct but do not actually align with the developer’s intents. If developers unintentionally adopt these incorrect code suggestions, it might cause them much longer time on debugging and even lead to some significant safety issues. We suggest practitioners using CodeT5 should always bear in mind that its generation outputs should be only taken as references which require domain experts for further correctness and security checking. Automation bias. Ukufundwa kweCodeT5 kwi-corpus code efumaneka kuquka iCodeSearchNet ( , ) kunye neqela elincinci ye-Google BigQuery, ezimbini ziye zithunyelwe ekuqaleni kwi-Public Github repositories. I-mod-i-pre-trained ingaba i-coding ezinye iinkcukacha ezincinane ( personal addresses or identification numbers) from the training data. Though we have conducted multi-rounds of data cleaning to mitigate this before training our models, it is still possible that some sensitive information cannot be completely removed. Besides, due to the non-deterministic nature of generation models like CodeT5, it might produce some vulnerable code to harmfully affect the software and even be able to benefit more advanced malware development when deliberately misused. Security implications. Husain et al. 2019 e.g., Acknowledgements We thank Akhilesh Deepak Gotmare, Amrita Saha, Junnan Li, and Chen Xing for valuable discussions. We thank Kathy Baxter for the ethical review. We also thank our anonymous reviewers for their insightful feedback on our paper. References Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, kunye Kai-Wei Chang. 2021. . In , page 2655–2668. I-Association for Computational Linguistics. Unified pre-training Ukuphatheleka kwiprogram kunye nemveliso I-Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, 6-11 Juni, 2021. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Michael Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Win-ter, Philippe Tillet, Felipe Petroski Such, Dave Cum-mings, Matthias Plappert, Fotios Chantzis, Eliza-beth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Nichol, Alex, Paino, Alexino, Nikolas Tlecak, Jie . , abs/2107.03374. Ukubala iimodeli ezinkulu zeengoma ezidlulileyo kwi-code CoRR Kevin Clark, Minh-Thang Luong, Quoc V. Le, noChristopher D. Manning. . In . OpenReview.net. ELECTRA: pre-training text encoders as discriminators rather than generators 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 U-Colin B. Clement, uDawn Drain, uJonathan Timcheck, uAlexey Svyatkovskiy, uNeel Sundaresan. Ukusuka , pages 9052–9065. Association for Computational Linguistics. Pymt5: Multi-mode translation of natural language i-Python Code kunye ne-Transformers I-2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, ngoNovemba 16-20, 2020 Alexis Conneau and Guillaume Lample. 2019. . In , pages 7057–7067. Iimodeli ye-Cross-Language Pre-Training Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada Sergio Cozzetti B. de Souza, Nicolas Anquetil, noKáthia Marçal de Oliveira. 2005. . In , pages Uhlalutyo lweendokumentation ebalulekileyo yokusebenza kwe-software I-Proceedings of the 23rd Annual International Conference on Design of Communication: Documenting & Designing for Pervasive Information, SIGDOC 2005, Coventry, UK, 21-23 September, 2005 68–75. ACM. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Ukusuka iindidi 4171-4186 BERT: Ukuqeqeshwa kokuqala I-deep bidirectional transformers yokufunda iilwimi I-Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, Juni 2-7, 2019, Volume 1 Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi-aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. . In , pages 13042–13054. Unified language model pre-training for natural language understanding and generation I-Advances in Neural Information Processing Systems 32: I-Annual Conference on Neural Information Processing Systems 2019, i-NeurIPS 2019, ngoDisemba 8-14, 2019, Vancouver, BC, Canada Ahmed Elnaggar, Wei Ding, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Silvia Severini, Florian Matthes, and Burkhard Rost. 2021. Ngathi , abs/2104.02443. Code-trans: Towards cracking the language of silicone’s ikhowudi nge-self-supervised deep learning kunye ne-high Ukusebenza Computing CoRR Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xi-aocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, kunye Ming Zhou. Ukusuka , 1536-1547. I-Association for Computational Linguistics. I-Code-bert: Umzekelo we-pre-trained for Programming and Natural Languages I-2020 I-Conference on Empirical Methods in Natural Language Processing: Findings, EMNLP 2020, I-Online Event, 16-20 November 2020 Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tu-fano, Shao Kun Deng, Colin B. Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou. 2021. . In . OpenReview.net. Graphcodebert: Ukuqeqeshwa kokuqala code representations with data flow 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. . , abs/1909.09436. Code-searchnet challenge: Evaluating the state of semantic code search Ukucinga Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2018. . In , pages 1643–1652. Association for Computational Linguistics. Ukulungiselela ulwimi kwi-code in programmatic context Imibuzo ye-2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October - 4 November 2018 Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. 2020. . In , volume 119 of iindidi 5110-5121 PMLR. Ukufundwa kunye nokuthintela contextual embedding of source code Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event Proceedings of Machine Learning Research Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, noLuke Zettlemoyer. Ukusuka , pages 7871–7880. Association for Computational Linguistics. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, Ukucaciswa I-Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, ngoJulayi 5-10, 2020 I-Chin-Yew Lin kunye neFranz Josef Och. . In . ORANGE: a method for evaluating automatic evaluation metrics for machine translation I-COLING 2004, I-20th International Conference on Computational Linguistics, I-Proceedings of the Conference, 23-27 Agasti 2004, Geneva, Switzerland Fang Liu, Ge Li, Yunfei Zhao, kunye neZhi Jin. . In , pages 473–485. IEEE. I-Multi-task learning-based pre-trained language model for code completion 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020, Melbourne, Australia, September 21-25, 2020 Xiaodong Liu, Pengcheng He, Weizhu Chen, noJian-feng Gao. 2019a. . In , pages 4487–4496. Association for Computational Linguistics. Multi-task ne-deep neural networks Ukubuyekezwa kwe-Natural Language Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. 2019b. . , abs/1907.11692. I-Roberta: I-BERT Pre-Training Approach ebonakalayo Ukucinga Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Li-dong Zhou, Linjun Shou, Long Zhou, Michele Tu-fano, Ming Gong, Ming Zhou, Nan Duan, Neel Sun-daresan, Shao Kun Deng, Shengyu Fu, kunye Shujie Liu. Ngathi , abs/2102.04664. I-Codexglue: I-Machine Learning Benchmark Dataset for Code Understanding and Generation Ukucinga Antonio Mastropaolo, Simone Scalabrino, Nathan Cooper, David Nader-Palacio, Denys Poshyvanyk, Rocco Oliveto, and Gabriele Bavota. 2021. Ukusuka iindidi 336-347 IEEE. Studying the usage of text-to-text transfer transformer to support code-related tasks I-IEEE/ACM International Conference on Software Engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021 U-Alec Radford, uJeffrey Wu, uRewon Child, uDavid Luan, uDario Amodei, uIlya Sutskever. . , 1(8): 9 Language models are unsupervised multitask learners I-Opening Blog Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, noPeter J. Liu. Ngathi , 21:140:1–140:67. Ukuhlola iinkcukacha of transfer learning with a unified text-to-text transformer J. Mach. Ukufundwa. Res. Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Am-brosio Blanco, kunye Shuai Ma. . iimveliso ze-ABS/2009 10297 Codebleu: a method for automatic evaluation of code synthesis CoRR Baptiste Rozière, Marie-Anne Lachaux, Lowik Chanussot, Guillaume Lample. . In Ukudluliselwa kwizilwanyana ze-programming I-Advances in Neural Information Processing Systems 33: I-Annual Conference on Neural Information Processing Systems 2020, i-NeurIPS 2020, ngoDesemba . 6-12, 2020, virtual Baptiste Rozière, Marie-Anne Lachaux, Marc Szafraniec, and Guillaume Lample. 2021. Ngathi , abs/2102.07492. DOBF: A deobfuscation pre-training objective for programming languages CoRR Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Ukusuka I-Association for Computer Linguistics. Neural machine translation of rare words with subword units Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. . In , volume 97 of , pages 5926–5936. PMLR. MASS: masked sequence to se-quence pre-training for language generation Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA Proceedings of Machine Learning Research Yu Sun, Shuohuan Wang, Yu-Kun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019. Ngathi , abs/1904.09223. ERNIE: enhanced representation through knowledge integration Ukucinga Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020. . In iindidi ze-1433-1443 i-ACM. Intellicode isakhiwo: code generation using transformer ESEC/FSE ’20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event, USA, November 8-13, 2020 Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshy-vanyk. 2019. . , 28(4):19:1-19:29. An empirical study on learning Ukuguqulwa kwe-bug-patches kwi-wild via neural machine translation ACM Trans. Softw. Eng. Methodol. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Ukusuka iindidi 5998-6008 Ukucaciswa yonke you need Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA Wenhan Wang, Ge Li, Bo Ma, Xin Xia, kunye neZhi Jin. Ukusuka , pages 261–271. IEEE. Ukubonisa i-coding ye-clones nge-graph neural network kunye ne-flow-augmented abstract syntax tree I-IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2020, London, ON, Canada, 18-21, 2020 Yaqin Zhou, Shangqing Liu, Jing Kai Siow, Xiaon-ing Du, and Yang Liu. 2019. . In , pages 10197–10207. Ubungakanani: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada Daniel Zügner, Tobias Kirschstein, Michele Catasta, Jure Leskovec, and Stephan Günnemann. 2021. . In Kwi-OpenReview.net Language-agnostic representation learning of source ikhowudi ukusuka kwi-structure kunye ne-context 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 This paper is under CC by 4.0 Deed (Attribution 4.0 International) license. available on arxiv This paper is Ngokutsho kwe-CC by 4.0 Deed (i-Attribution 4.0 International). available on arxiv