Ny CodeT5 avy amin'ny Salesforce dia mety hanova ny fomba hanoratra sy mahatakatra ny code ny AI

Ny mpanoratra: Yue Wang, wang.y@salesforce.com (Salesforce Research Asia) Weishi Wang, weishi.wang@salesforce.com (Salesforce Research Asia; Nanyang Technological University, Singapore) Shafiq Joty, sjoty@salesforce.com (Salesforce Research Asia; Nanyang Technological University, Singapore) Steven C.H. Hoi, shoi@salesforce.com (Salesforce Research Asia) Ny mpanoratra: Yue Wang, wang.y@salesforce.com (Salesforce Research Asia) Weishi Wang, weishi.wang@salesforce.com (Salesforce Research Asia; Nanyang Teknolojia University, Singapaoro) Shafiq Joty, sjoty@salesforce.com (Salesforce Research Asia; Nanyang Technological University, Singapaoro) Steven C.H. Hoi, ao amin'ny shoi@salesforce.com (Salesforce Research Asia) Ny Abstract Pre-trained models for Natural Languages (NL) like BERT and GPT have been recently shown to transfer well to Programming Languages (PL) and largely benefit a broad set of code-related tasks. Despite their success, most current methods either rely on an encoder-only (or decoder-only) pre-training that is suboptimal for generation (resp. understanding) tasks or process the code snippet in the same way as NL, neglecting the special characteristics of PL such as token types. We present CodeT5, a unified pre-trained encoder-decoder Transformer model that better leverages the code semantics conveyed from the developer-assigned identifiers. Our model employs a unified framework to seamlessly support both code understanding and generation tasks and allows for multi-task learning. Besides, we propose a novel identifier-aware pre-training task that enables the model to distinguish which code tokens are identifiers and to recover them when they are masked. Furthermore, we propose to exploit the user-written code comments with a bimodal dual generation task for better NL-PL alignment. Comprehensive experiments show that CodeT5 significantly outperforms prior methods on understanding tasks such as code defect detection and clone detection, and generation tasks across various directions including PL-NL, NL-PL, and PL-PL. Further analysis reveals that our model can better capture semantic information from code. Our code and pre-trained models are released at . https://github.com/salesforce/CodeT5 1 Ny fampidirana Ny fomba fanao amin'ny alàlan'ny famolavolana ny fiteny ( , Ny mpampiasa dia ( Ny ary Salmona niteraka an'i Boaza tamin-dRahaba ; ary Boaza niteraka an'i Obeda tamin-dRota ; Ny Ny fomba fiasan'izy ireo dia mampiasa paradigma mialoha ny fampiofanana ary avy eo amin'ny fomba fijery tsara izay mikendry ny hamoronana fampisehoana fiteny ankapobeny amin'ny alàlan'ny fampiofanana manokana amin'ny angon-drakitra tsy misy taratasy lehibe, izay azo alefa mba hanomezana tombontsoa ho an'ny asa maro any aoriana, indrindra ireo amin'ny famantarana angon-drakitra voafetra. Ny Ny Ny Ny Ny ), mampiseho vokatra mahatalanjona amin'ny asa mifandraika amin'ny code. Ny Tompo sy ny hafa. 2019 Ravalomanana sy ny hafa. 2019 Raffaele sy ny hafa. 2020 Mifototra amin'ny et al. 2020 Kanada ary ny hafa. 2020 Ny Tompo sy ny Al. 2020 Na izany aza, na dia eo aza ny fahombiazany aza, ny ankamaroan'ireny modely ireny dia miankina amin'ny modely ankapobeny mifandraika amin'ny BERT ( Ny Ny Ny ) na modely tokana ho an'ny decoder toy ny GPT ( Ny ary Salmona niteraka an'i Boaza tamin-dRahaba ; ary Boaza niteraka an'i Obeda tamin-dRota ; ary Obeda niteraka an'i Jese ; Ny ) dia mitaky decoder fanampiny rehefa ampiharina ho an'ny code summarization asa, izay ity decoder dia tsy afaka mahazo tombontsoa avy amin'ny fanomanana mialoha. Ankoatra izany, ny ankamaroan'ny fomba efa misy fotsiny mampiasa ny mahazatra NLP pre-training teknika amin'ny loharanom-baovao amin'ny alalan'ny fifandraisana izany ho toy ny andian-dahatsoratra token toy ny NL. Nandray anjara tamin'ny fanatanterahana ny rafi-panoherana sy ny fampandrosoana. 2020 Ny Tompo sy ny Al. 2020 Ny Kanadà Ary ny al. 2020 Ny Tompo sy ny Al. 2020 Ao amin'ity lahatsoratra ity, dia manolotra ny CodeT5, modely encoder-decoder izay mandinika ny fampahalalana momba ny karazana token ao amin'ny code. Ny ) izay mampiasa denoising sequence-to-sequence (Seq2Seq) pre-training ary efa hita fa mahasoa ny fahatakarana sy ny famokarana asa amin'ny fiteny voajanahary. Ankoatra izany, dia manolotra ny fampiasana ny mpamorona nomena identifier ao amin'ny code. ny "binarySearch" ID ao amin'ny sary Mba hampifangaroana ny fahalalana manokana momba ny fehezan-dalàna toy izany, dia manolotra tanjona vaovao ahafantarana izay mampiofanana ny modely mba hahafantarana izay token ireo identifier ary hamerenana azy ireo rehefa voasoratra. Raffaele sy ny hafa. 2020 Ny G. 2 Ankoatra izany, manolotra ny fampiasana ny fehezan-dalàna sy ny fanehoan-kevitra mifandray amin'izany isika mba hianatra tsara kokoa ny NL-PL. Ny mpamorona matetika manome antontan-taratasy ho an'ny fandaharana mba hanampy amin'ny fikarakarana tsara kokoa ny rindrambaiko ( Ny Amin'ny ankapobeny, isika dia mihevitra ny NL→PL famokarana sy PL→NL famokarana ho toy ny asa roa ary koa manatsara ny modely amin'izy ireo. Avy amin'i Sousa et al. 2005 Ity lahatsoratra ity dia ampahany amin'ny antsipirian'ny CodeSearchNet ( Ny Ny fomba fanao ( Ny Ny antontan-taratasim-pifandraisana amin'ny alàlan'ny fampielezan-kevitra amin'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny Ny ), anisan'izany ny andraikitry ny fahatakarana roa: ny fampitandremana ny fahadisoana sy ny fampitandremana ny klôna, ary ny andraikitry ny famokarana toy ny code summarization, generation, translation, ary refinement. , dia mandinika ihany koa ny fampianarana maro asa mba hanatsarana ny CodeT5 amin'ny asa maro amin'ny fotoana iray amin'ny fampiasana code fanaraha-maso asa ho toy ny loharano fangatahana. Ny Tompo sy ny hafa. 2019 Ny Tompo sy ny Al. 2020 Izaho sy ny al. 2021 1 Manolotra ny iray amin'ireo modely code-decoder voalohany CodeT5 mba hanampy ny fahatakarana sy ny famoronana asa mifandraika amin'ny code, ary koa mamela ny fianarana maro asa. Manolotra tanjona fiofanana mialoha amin'ny fampiofanana vaovao momba ny identifier-conscious izay miankina amin'ny fampahalalana manan-danja momba ny karazana token (identificers) avy amin'ny code. Ankoatra izany, manolotra ny fampiasana ny NL-PL pairs izay azo jerena amin'ny loharanom-baovao mba hianatra tsara kokoa cross-modal fitarihana. Ny fanandramana be dia be dia be dia be dia be dia be dia be dia be no mampiseho fa ny CodeT5 dia mahatonga ny vokatra farany amin'ny alàlan'ny asany efatra ambin'ny folo ao amin'ny CodeXGLUE. Ny fanadihadiana fanampiny dia mampiseho fa ny CodeT5 dia afaka mahazo tsara kokoa ny semantics ny code amin'ny fampiofanana alohan'ny fampiofanana sy bimodal dual-generation voatendry ho an'ny asa NL↔PL. 2 Ny asa mifandraika Ny famolavolana ny famolavolana ny famolavolana ny famolavolana ny famolavolana ( Ny Amin'ny ankapobeny dia azo ampiharina ao amin'ny vondrona telo ny fampiharana ny fampiharana ny fampiharana ny fampiharana ny fampiharana ny fampiharana ny fampiharana ny fampiharana ny fampiharana ny fampiharana ny fampiharana ( Ny Ny fitsipika ( Ny ary Salmona niteraka an'i Boaza tamin-dRahaba ; ary Boaza niteraka an'i Obeda tamin-dRota ; Ny ary Salmona niteraka an'i Boaza tamin-dRahaba ; ary Boaza niteraka an'i Obeda tamin-dRota ; ary Obeda niteraka an'i Jese ; Ny , ary ny modely encoder-decoder toy ny MASS ( Ny Ny mpandray anjara ( Ny ary Salmona niteraka an'i Boaza tamin-dRahaba ; ary Boaza niteraka an'i Obeda tamin-dRota ; Ny Raha oharina amin'ny modely encoder-only sy decoder-only izay miankina amin'ny fahatakarana sy ny famoronana asa, encoder-decoder modely dia afaka manohana tsara ireo karazana asa roa ireo. Matetika izy ireo dia mampiasa denoising sequence-to-sequence pre-training tanjona izay manimba ny loharanom-baovao sy mitaky ny decoder hamerenana azy ireo. Pre-training on Natural Language. Ny Tompo sy ny olon-drehetra 2017 Ny Tompo sy ny hafa. 2019 Ny Ary ny al. Ny taona 2019B Ny Tompo sy ny al. 2020 Ravalomanana sy ny hafa. 2019 Ny hira sy ny al. 2019 Lewis et al. 2020 Raffaele sy ny hafa. 2020 Ny fandaharam-pampianarana amin'ny fiteny fandaharam-pampianarana dia sehatry ny fianarana vao haingana izay miezaka mampitombo ny fomba fanao amin'ny fandaharam-pampianarana NLP ho an'ny loharanom-baovao. Ny ary Salmona niteraka an'i Boaza tamin-dRahaba ; ary Boaza niteraka an'i Obeda tamin-dRota ; Ny Ny fametrahana ny endriky ny endriky ny endriky ny endriky ny endriky ny endriky ny endriky ny endriky ny endriky ny endriky ny endriky ny endriky ny endriky ny endriky ny endrika ( Ny ) ny asa mba hianatra ny NL-PL cross-modal representation. Ankoatra ny BERT-style modely, ( Ary Ny ) mifanaraka amin'ny GPT sy UniLM ( Ny ) for the code completion task. Transcoder ( Ny ) explores programming language translation in an unsupervised setting. Different from them, we explore encoder-decoder models based on T5 for programming language pre-training and support a more comprehensive set of tasks. Pre-training on Programming Language. Kanada ary ny hafa. 2020 Ny Feng et al. 2020 Ny Tompo sy ny al. 2020 Nandray anjara tamin'ny fanatanterahana ny rafi-panoherana sy ny fampandrosoana. 2020 Ny Tompo sy ny Al. 2020 Ny Tompo sy ny Al. 2019 Rosiana sy ny hafa. 2020 Ny sasany amin'ireo asa avy amin'ny ( Ny ; Ny Ny , ) ao amin'ny literatiora vao haingana ihany koa ny fanadihadiana ny rafitra T5 momba ny code, fa mifantoka amin'ny subset voafetra amin'ny famokarana asa ary tsy manohana ny fahatakarana asa toy ny antsika. Ny ) mifototra amin'ny modely encoder-decoder hafa BART dia afaka manohana ny fahatakarana sy ny famokarana asa ihany koa. Na izany aza, ny asa rehetra teo aloha fotsiny dia mandrindra ny code amin'ny fomba mitovy amin'ny fiteny voajanahary ary mandinika ny toetra manokana amin'ny code. Ny Tompo sy ny al. 2020 Ny mpampihorohoro et al. 2021 Ny Tompo sy ny olon-drehetra 2021 Amin'ny ankapobeny i Ahmad et al. 2021 ary Salmona niteraka an'i Boaza tamin-dRahaba ; ary Boaza niteraka an'i Obeda tamin-dRota ; Ny ) dia mampiditra ny angon-drakitra voavonjy avy amin'ny rafitra code ao amin'ny CodeBERT, raha Ny ) manolotra tanjona deobfuscation mba hampitombo ny endriky ny PL. Ireo modely ireo dia mifantoka fotsiny amin'ny fampiofanana ny tsara kokoa code-specific encoder. ( ) proposes to capture the relative distances between code tokens over the code structure. By contrast, we specifically focus on the identifiers that reserve rich code semantics and fuse such information into a Seq2Seq model via two novel identifier tagging and prediction tasks. Ny Tompo sy ny Al. 2021 Rosiana sy ny hafa. 2021 Ny Tompo sy ny hafa. 2021 3 Ny CodeT5 Ny CodeT5 dia mifototra amin'ny rafitra encoder-decoder miaraka amin'ny rafitra mitovy amin'ny T5 ( Ny Ny tanjona dia ny manamboatra fampisehoana ankapobeny ho an'ny fiteny fandaharana (PL) sy fiteny voajanahary (NL) amin'ny alàlan'ny fampiofanana mialoha amin'ny code source tsy misy marika. , dia manitatra ny tanjona denoising Seq2Seq ao amin'ny T5 amin'ny alalan'ny fanoloran-teny roa identifier tagging sy prediction asa mba ahafahan'ny modely tsara kokoa mampiasa ny token-type vaovao avy amin'ny PL, izay ireo identifier nomena ny mpandraharaha. Raffaele sy ny hafa. 2020 2 Amin'ny manaraka, dia manolotra ny fomba CodeT5 dia manoratra ny PL sy NL entana (§ ) ary ny fanoloran-tenantsika momba ny fampiofanana alohan'ny fampiofanana (§ ), manaraka ny fanamafisana amin'ny fampianarana momba ny asa manokana sy ny fampiofanana amin'ny asa maro (§ Ny 3.1 3.2 3.3 4.1 Ny fametrahana ny NL sy ny PL Ao amin'ny dingana mialoha ny fampiofanana, ny modely dia hahazo na PL-only na NL-PL toy ny fidirana miankina amin'ny raha ny code snippet dia miaraka amin'ny NL famaritana na tsy. Ho an'ny NL-PL bimodal in-puts, dia manambara azy ireo ho amin'ny andian-dahatsoratra amin'ny delimiter token [SEP] ary mampiseho ny andian-dahatsoratra manontolo amin'ny endrika toy ny = ([CLS], 1*, ..., any amin'ny tany, 1*, ..., cm*, [SEP] ary izay Ary Asehoy ny isan'ny token teny NL sy token code PL, amin'ny ankapobeny. Ny andian-teny NL dia ho voafetra ho an'ny entana unimodal PL-only. x w c n m Amin'ny alàlan'ny fametrahana ireo endri-javatra manokana, dia manolotra fampahalalana momba ny karazana token avy amin'ny code isika. ny fonosana anarana sy ny variable) satria izy ireo dia iray amin'ireo endri-javatra PL-agnostika indrindra ary manome semantics anankiray. Amin'ny ankapobeny, dia manova ny segondra PL amin'ny Abstract Syntax Tree (AST) ary manaisotra ny karazana nodes ho an'ny code token tsirairay. ∈ {0*, * 1} ho an'ny segondra PL, izay tsirairay ∈ {0*,* 1} dia maneho raha ny token code Ny famantarana na tsia. Ny G. y m yi Ny 3.2 Fomba fanomanana mialoha Ankehitriny dia manolotra ny asa fampiofanana mialoha isika izay ahafahan'ny CodeT5 mianatra endrika mahasoa avy amin'ny angon-drakitra PL-only na NL-PL bimodal. Ny fampiofanana amin'ny alàlan'ny fampiofanana amin'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan'ny alàlan Ny Ny Ny Ny Ny Ity fitaovana ity dia mampiasa ny fitaovana amin'ny alàlan'ny fametrahana fitaovana amin'ny alàlan'ny fametrahana fitaovana amin'ny alàlan'ny fametrahana fitaovana amin'ny alàlan'ny fametrahana fitaovana amin'ny alàlan'ny fametrahana fitaovana amin'ny alàlan'ny fametrahana fitaovana ( Ny ) izay mandika ny habetsaky ny habetsaky ny faharetan'ny tsirairay ary avy eo dia manambara ireo habetsaky ny habetsaky ireo miaraka amin'ny token sentinel sasany ao amin'ny decoder. Araka ny voalaza ao amin'ny sary Ny A. Identifier-aware Denoising Pre-training. Ny hira sy ny al. 2019 Raiso sy ny hafa. 2020 Lewis sy ny hafa. 2020 Raffaele sy ny hafa. 2020 Masked Span Prediction (MSP) 2 Amin'ny ankapobeny, dia mampiasa ny tahan'ny fahavoazana 15 isan-jato toy ny T5 izahay ary manome antoka fa ny faharetan'ny habetsaky ny 3 dia amin'ny alalan'ny fanamafisana ny habetsaky ny 1 ka hatramin'ny 5 token. amin'ny alalan'ny fanodinana ny habetsaky ny alalàn'ny alalàn'ny alalàn'ny alalàn'ny alalàn'ny alalàn'ny alalàn'ny alalàn'ny alalàn'ny alalàn'ny alalàn'ny alalàn'ny alalàn'ny alalàn'ny alalàn'ny alalàn'ny alalàn'ny alalàn'ny alalàn'ny alalàn'ny alalàn'ny alalàn'ny ( Ny Amin'ny ankapobeny, dia manamboatra modely samihafa ho an'ny PLs isan-karazany izahay mba hianatra ny fampisehoana mahery vaika amin'ny fiteny. Ny teny manontolo Masking Ny masoandro sy ny al. 2019 izay θ dia ny modely parameters, x \mask dia ny masked entana, x masks dia ny masked famaritana avy amin'ny decoder amin'ny k milaza ny isan'ny tokens ao amin'ny x masks, ary xmask <t dia ny span famaritana hatramin'izao. Mba hampifangaroana ny fampahalalana rafitra manokana amin'ny code (ny karazana node identifier ao amin'ny AST) ao amin'ny modely, dia manolotra asa roa fanampiny izahay: Ary Manampy amin'ny fampiofanana mialoha ny fampiofanana. Ny famantarana ny famantarana (IT) Ny famaritana ny famaritana ny masquerade (MIP) • Ny It aims to notify the model with the knowledge of whether this code token is an identifier or not, which shares a similar spirit of syntax highlighting in some developer-aided tools. As shown in Figure (b) dia mametraka ny toe-javatra mainty farany amin'ny segondra PL ao amin'ny CodeT5 encoder ho amin'ny famaritana ny mety Ny ( 1*, ..., pm*), and compute a binary cross entropy loss for sequence labeling: Identifier Tagging (IT) 2 p p Aiza Jereo fa amin'ny fametrahana ny asany ho toy ny olana amin'ny fametrahana famantarana andian-dahatsoratra, ny modely dia heverina mba hahazoana ny syntax code sy ny rafitra angon-drakitra amin'ny code. Izany • Tsy mitovy amin'ny famaritana ny fihenam-bidy ao amin'ny MSP, isika dia mampihatra ny identifier rehetra ao amin'ny segondra PL ary mampiasa token sentinel token ho an'ny fisehoan-javatra iray manokana. izay ny fanovana ny anaran'ny identifier dia tsy misy fiantraikany amin'ny semantics code. ( ), we arrange the unique identifiers with the sentinel tokens into a target sequence as shown in Figure (c) Avy eo dia manambara izany amin'ny fomba auto-regressive: Masked Identifier Prediction (MIP) Ny fanararaotana Rosiana sy ny hafa. 2021 I 2 where \I is the masked input. Note that dia asa sarotra kokoa izay mitaky ny modely mba hahatakatra ny code semantics mifototra amin'ny code mivantana ary mampifandray ny zava-mitranga amin'ny identifiers mitovy miaraka. x Ny fanodinana We alternately optimize these three losses with an equal probability, which constitutes our proposed identifier-aware denoising pre-training. Ao amin'ny dingana mialoha ny fampiofanana, ny decoder dia mahita ny habetsaky ny masquerade sy ny identifier, izay tsy mitovy amin'ny asa any aoriana izay ilaina ny decoder hiteraka lahatsoratra NL mivantana na syntactically miverina code tranonkala. Mba hanapaka ny fahasamihafana eo amin'ny fampiofanana mialoha sy ny fanitsiana tsara, dia manolotra ny fampiasana ny NL-PL bimodal angon-drakitra mba hampiofana ny modely ho an'ny fiovam-pifamoivoizana toy ny hita ao amin'ny sary (d) Amin'ny ankapobeny, isika dia mihevitra ny NL→PL generation sy ny PL→NL generation ho toy ny asa roa ary manatsara ny modely amin'izy ireo. Bimodal Dual Generation. 2 PL bimodal datapoint, we construct two training instances with reverse directions and add language ids ( and for Java PL and English NL, respectively). This operation can be also seen as a special case of T5’s span masking by either masking the full NL or PL segment from the bimodal inputs. This task aims to improve the alignment between the NL and PL counterparts. e.g., 3.3 Fine-tuning CodeT5 Taorian'ny fampiofanana mialoha momba ny angon-drakitra tsy misy marika lehibe, dia manatsara ny CodeT5 amin'ny asa avy eo amin'ny alalan'ny fampianarana fampitaovana manokana na fampianarana maro. Code-related tasks can be categorized into generation and understanding tasks. For the former one, our CodeT5 can be naturally adapted with its Seq2Seq framework. For understanding tasks, we investigate two ways of either generating the label as a unigram target sequence ( , ), na manambara izany avy amin'ny vokatra ny kilasy labels mifototra amin'ny farany decoder hidden state manaraka Ny ). Task-specific Transfer Learning: Generation vs. Understanding Tasks. Raffel et al. 2020 Lewis et al. 2020 We also explore a multi-task learning setting by training a shared model on multiple tasks at a time. Multi-task learning is able to reduce computation cost by reusing the most of model weights for many tasks and has been shown to improve the model generalization capability in NL pre-training ( , ). We follow ( ) to employ the same unified model for all tasks without adding any task-specific networks but allow to select different best checkpoints for different tasks. To notify the model with which task it is dealing with, we design a unified format of task control codes and prepend it into the source inputs as shown in Figure Ohatra, mampiasa "Translate Java to CSharp:" ho toy ny loharanom-baovao ho an'ny code-to-code fandikan-teny avy amin'ny Java ho CSharp. Multi-task Learning. Liu et al. 2019a Raffel et al. 2020 1 As different tasks have different dataset sizes, we follow Conneau and Lample (2019) to employ a balanced sampling strategy. For N number of datasets (or tasks), with probabilities {qi} N i=1, we define the following multinomial distribution to sample from: where ni is number of examples for i-th task and α is set to 0.7. This balanced sampling aims to alleviate the bias towards high-resource tasks. 4 Experimental Setup 4.1 Ny fampianarana momba ny dataset We follow Feng et al. (2020) to employ CodeSearchNet (Husain et al., 2019) to pre-train CodeT5, which consists of six PLs with both unimodal and bimodal data. Apart from that, we additionally collect two datasets of C/CSharp from BigQuery1 to ensure that all downstream tasks have overlapped PLs with the pre-training data. In total, we employ around 8.35 million instances for pretraining. Table 1 shows some basic statistics. To obtain the identifier labels from code, we leverage the tree-sitter2 to convert the PL into an abstract syntax tree and then extract its node type information. We filter out reserved keywords for each PL from its identifier list. We observe that PLs have different identifier rates, where Go has the least rate of 19% and Ruby has the highest rate of 32%. 4.2 Tokenizer manokana ho an'ny code Tokenization dia singa manan-danja amin'ny fahombiazana amin'ny modely fiofanana mialoha toy ny BERT sy GPT. Matetika dia mampiasa ny Coding Byte-Pair (BPE) to-kenizer izy ireo ( , ) to alleviate the Out-of-Vocabulary (OoV) issues. Specifically, we train a Byte-level BPE tokenizer following Ny ) and set the vocabulary size to 32,000 as T5. We add additional special tokens ([PAD], [CLS], [SEP], [MASK0], ..., [MASK99]). This tokenzier is trained on all of our pre-training data with non-printable characters and low-frequent tokens (occurring <3 times) filtered. We compare it with T5’s default tokenizer and find that our tokenizer largely reduces the length of tokenized code sequence by 30% - 45% on downstream tasks. This will accelerate the training and especially benefit generation tasks due to the shorter sequence to predict. We also spot a severe problem for applying the T5’s default tokenizer on source code, where it would encode some common code tokens such as brackets [‘{’, ‘}’] into unknown tokens. Sennrich et al. 2016 Radford et al. 2019 4.3 Downstream Tasks and Metrics Ny ankamaroan'ny asa fanorenana sy ny fahatakarana ao amin'ny CodeXGLUE benchmark ( Ny ) and employ the provided public datasets and the same data splits following it for all these tasks. Lu et al. 2021 We first consider two cross-modal generation tasks. aims to summarize a function-level code snippet into English descriptions. The dataset consists of six PLs including Ruby, JavaScript, Go, Python, Java, and PHP from CodeSearchNet ( , ). We employ the smoothed BLEU-4 ( , ) to eval-uate this task. is the task to gen-erate a code snippet based on NL descriptions. We employ the Concode data ( , Ao amin'ny sehatry ny fikarakarana sy ny fanaraha-maso ny sehatry ny fikarakarana ny sehatry ny fikarakarana ny sehatry ny fikarakarana ny sehatry ny fikarakarana ny sehatry ny fikarakarana ny sehatry ny fikarakarana ny sehatry ny fikarakarana ny sehatry ny fikarakarana ( , ) that considers syntactic and semantic matches based on the code structure in addition to the n-gram match. Code summarization Husain et al. 2019 Lin and Och 2004 Code generation Ny Tompo sy ny Al. 2018 Ren et al. 2020 Besides, we consider two code-to-code generation tasks. aims to migrate legacy software from one PL to another, where we focus on translating functions from Java to CSharp and vice versa. aims to convert a buggy function into a correct one. We employ two Java datasets provided by ( ) with various function lengths: small (fewer than 50 tokens) and medium (50-100 tokens). We use BLEU-4 and exact match to evaluate them. Code translation Code refinement Tufano et al. 2019 We also investigate how CodeT5 performs on two understanding-based tasks. The first one is that aims to predict whether a code is vulnerable to software systems or not. We use the C dataset provided by ( ho an'ny fanandramana. ny andraikitra faharoa dia ny which aims to measure the similarity between two code snippets and predict whether they have the same functionality. We experiment with the Java data provided by ( Amin'ny ankapobeny, ny CodeT5 dia manohana ny asa enina sy efatra ambin'ny folo amin'ny CodeXGLUE miaraka amin'ny modely encoder-decoder iray. defect detection Zhou et al. 2019 clone detection Ny Tompo sy ny Al. 2020 4.4 Comparison Models We compare CodeT5 with state-of-the-art (SOTA) pre-trained models that can be categorized into three types: encoder-only, decoder-only, and encoder-decoder models. As models, we consider RoBERTa ( , ), RoBERTa (code) trained with masked language modeling (MLM) on code, CodeBERT ( , ) trained with both MLM and replaced token detection ( , ), GraphCode-BERT ( , ) using data flow from code, and DOBF ( , ) trained with the identifier deobfuscation objective. Note that although DOBF employs a Seq2Seq model during pre-training, it only aims to train a better encoder for downstream tasks without exploring the poten-tial benefit of the pre-trained decoder. encoder-only Liu et al. 2019b Feng et al. 2020 Clark et al. 2020 Guo et al. 2021 Rozière et al. 2021 Ny models, we compare GPT-2 ( , ) and its adaptations on code domain including CodeGPT-2, and CodeGPT-adapted. The difference is that the latter one utilizes a GPT-2 checkpoint for model initialization while the former one is trained from scratch. As models, the current SOTA model for the CodeXGLUE benchmark is PLBART ( , ) based on BART ( , ) architecture. For pre-training data, most of these models employ CodeSearchNet ( , ) except DOBF and PLBART. DOBF is pre-trained on 7.9M Java and 3.6M Python files from BigQuery while PLBART employs a much larger data with 470M Python and 210M Java functions, and 47M NL posts from StackOverflow. decoder-only Radford et al. 2019 encoder-decoder Ah-mad et al. 2021 Lewis et al. 2020 Husain et al. 2019 4.5 Ny endriky ny modely We build CodeT5 based on Huggingface’s T5 ( Ny ) PyTorch implementation and employ two sizes of CodeT5-small (60M) and CodeT5-base (220M). We set the maximum source and target sequence lengths to be 512 and 256, respectively. We use the mixed precision of FP16 to accelerate the pre-training. We set the batch size to 1024 and employ the peak learning rate of 2e-4 with linear decay. We pre-train the model with the denoising objective for 100 epochs and bimodal dual training for further 50 epochs on a cluster of 16 NVIDIA A100 GPUs with 40G memory. The total training time for CodeT5-small and CodeT5-base is 5 and 12 days, respectively. Raf-fel et al. 2020 3 In the fine-tuning phase, we find that the tasks in CodeXGLUE ( , ) are quite sensitive to some hyper parameters such as learning rate, training steps, and batch size. We conduct a grid search and select the best parameters based on the validation set. In multi-task learning, we cover all downstream tasks except clone detection. Izaho sy ny al. 2021 5 Ny vokatra sy ny fanadihadiana In this section, we compare CodeT5 with SOTA models on a broad set of CodeXGLUE downstream tasks (§ ), and investigate the effects of our bimodal dual generation and multi-task learning (§ ), followed by a detailed analysis on the proposed identifier-aware pre-training (§ ). 5.1 5.2 5.3 5.1 CodeXGLUE Downstream Tasks We evaluate two sizes of our model: CodeT5-small and CodeT5-base that are pre-trained with identifier-aware denoising. In addition, we consider the model that continues to train with bimodal dual generation (dual-gen) and show the results with multi-task fine-tuning. The results of all comparison models are obtained from their original papers and also the CodeXGLUE paper ( , Ny Lu et al. 2021 Manolotra ny valin'ny code summarization ny BLEU-4 amin'ny angon-drakitra PL enina ao amin'ny Tabilao Fantatsika fa ny endriky ny modely rehetra dia mihoatra noho ny asany mialoha amin'ny code-only (RoBERTa, CodeBERT, DOBF) na code-decoder-framework (PLBART). Ankoatra izany, ny fahasamihafana lehibe eo amin'ireo modely roa ireo dia manamaivana fa ny code-only frameworks dia tsy ampy ho an'ny asa famokarana. Raha oharina amin'ny SOTA encoder-decoder-modely PLBART, dia hitantsika fa na ny CodeT5 kely aza dia manome vokatra tsara kokoa (anisan'izany amin'ny Python sy Java) satria ny modely dia kely kokoa (60M vs. 140M) ary ny PLBART dia mialoha amin'ny Python sy Java angon-drakitra lehibe kokoa (> 100 fotoana). Amin'ny fampitomboana ny habetsaky ny modely, ny CodeT5 base dia mampitombo ny fampisehoana manontolo amin'ny ambaratonga maherin'ny 1.2 noho ny PLBART. Code Summarization. 2 4 We compare CodeT5 with GPT-style models and PLBART in Table . Our CodeT5-small outperforms all decoder-only mod-els and also the SOTA PLBART, which again confirms the superiority of encoder-decoder models at generating code snippets. Moreover, our CodeT5-base further significantly pushes the SOTA results across three metrics. Particularly, it achieves around 4.7 points improvement on CodeBLEU over PLBART, indicating our CodeT5 can better comprehend the code syntax and semantics with the __help of identi__fier-aware pre-training. Code Generation. 3 Mifanaraka amin'ny andraikitra roa amin'ny famoronana code-to-code izahay: ny famoahana ny code sy ny fanatsarana ny code ao amin'ny Table and further consider one naive copy baseline by copying the source input as the target prediction. In the code translation task, our CodeT5-small outperforms most of base-lines and obtains comparable results with PLBART, which shows the advantages of encoder-decoder models in the code-to-code generation setting. Our CodeT5-base further achieves consistent improvements over PLBART across various metrics for translating from Java to C# and vice versa. Code-to-Code Generation Tasks. 4 Here we show one CodeT5’s output of translating C# to Java in Figure . In this case, despite the poor BLEU score, CodeT5 is able to generate a function that reserves the same functionality and even has better readability compared to the ground-truth. This reveals that CodeT5 has a good generalization ability instead of memorizing and repeating what it has seen before. On the other hand, it also suggests that BLEU score is not a perfect evaluation metric for code generation tasks, where sometimes a higher score can instead reflect the problematic copy issues of neural models. 3 Another code-to-code generation task is code refinement, a challenging task that requires detecting which parts of code are buggy and fix them via generating a bug-free code sequence. Due to the large overlap of source and target code, even the naive copy approach yields very high BLEU scores but zero exact matches. Therefore, we focus on the exact match (EM) metric to evaluate on this task. As shown in Table , we observe that EM scores for the small data are consistently higher than the medium one, indicating that it is harder to fix bugs for a longer code snippet. Our CodeT5-base significantly outperforms all baselines on EM and especially boosts over 4.8 points for the more challenging medium task (13.96 vs. GraphCodeBERT’s 9.10), reflecting its strong code understanding capability. 4 We compare with two understanding tasks of defect detection and clone detection in Table 5. Understanding Tasks. Specifically, we generate the binary labels as a unigram sequence from the decoder for the defect detection task, while for the clone detection task, we first obtain the sequence embedding of each code snippet using the last decoder state following ( ) ary avy eo manamarina ny labels amin'ny alalan'ny fanombanana ny fahasamihafana. Ny CodeT5-small sy ny CodeT5-base dia mihoatra noho ny baseline rehetra amin'ny asa fanadihadiana fahadisoana, raha ny CodeT5-base dia manatsara ny fahamarinana 2.6 noho ny PLBART. Ho an'ny asa fanadihadiana clone, ny modely CodeT5 dia nahatratra vokatra mitovy amin'ny modely SOTA GraphCodeBERT sy PLBART. Ireo vokatra ireo dia mampiseho fa miaraka amin'ny rafitra encode-decoder, ny CodeT5 dia mbola azo ampitahaina tsara ho an'ny fahatakarana asa. Lewis et al. 2020 5.2 Ny fiantraikany amin'ny Bimodal Dual Generation sy Multi-Task Learning Ny bimodal pre-training dia miteraka fanatsarana tsy tapaka ho an'ny code summarization sy ny famoronana asa amin'ny CodeT5-small sy CodeT5-base. Na izany aza, ity asa fampiofanana mialoha ity dia tsy manampy ary indraindray dia mampidi-doza kely ny fahombiazan'ny famoronana PL-PL sy ny fahatakarana asa. Manantena isika fa izany dia satria ny famoronana dual bimodal dia mianatra fitarihana tsara kokoa eo amin'ny PL sy NL izay mahasoa ny asany taloha mifandraika amin'ny PL sy NL. Amin'ny fomba fiantraikany, io tanjona io dia mety hampifandray ny modely amin'ny asa PL-NL ary hisy fiantraikany amin'ny asa PL-PL. In multi-task learning, it generally improves most of downstream tasks except the code translation and defect detection. Particularly, it largely boosts the performance on code summarization, which is not surprising as code summarization takes up the largest portion of sub tasks (six out of thirteen) and thereby benefit the most from the multi-task learning. Besides, we observe that multi-task learning consistently improves the performance of code refinement, which might benefit from the joint training of both small and medium refinement data. Another possible reason is that multi-task training with defect detection would enable the model to better comprehend the code semantics for bug detection, which is also a necessary intermediate step for code refinement. 5.3 Analyzing Identifier-aware Pre-training We provide an ablation study to examine the contribution of each component in our identifier-aware objective. Specifically, we compare the performance of our CodeT5-small on four selected tasks by ablating each of the three objectives: masked span prediction (MSP), identifier tagging (IT), and masked identifier prediction (MIP). As shown in Table , we observe that generally removing one of the objectives would reduce the performance for all tasks, indicating that all objectives contribute to the better code understanding of our CodeT5. However, the effect of each objective differs across tasks. Specifically, removing MSP would largely reduce the performance of all generation tasks but instead increase the defect detection performance. This shows that masked span prediction is more crucial for capturing syntactic information for generation tasks. On the contrary, removing MIP would hurt the defect detection task the most, indicating that it might focus more on code semantic understanding. By combining these objectives, our CodeT5 can better capture both syntactic and semantic information from code. 6 We further provide outputs from CodeT5 and its variant without MIP and IT on code generation in Figure . We observe that CodeT5 can correctly generate the exact function, while the model without MIP and IT fails to recover the identifiers of “s2” and “hasField”. This shows our identifier-aware denoising pre-training can better distinguish and leverage the identifier information. 4 We also investigate the identifier tagging performance and find it achieves over 99% F1 for all PLs, showing that our CodeT5 can confidently distinguish identifiers in code. We then check whether MSP and MIP tasks would have conflicts as they employ the same sentinel tokens for masking. In identifier masking, all occurrences of one unique identifier are replaced with the same sentinel token, resulting in a many-to-one mapping compared to the one-to-one mapping in span prediction. We compare models pre-trained with either MSP or MIP, and both on these two tasks in Table Fantatsika fa ny fampiofanana mialoha fotsiny amin'ny MIP na MSP dia hampitsahatra ny modely amin'izany asa izany, mahatratra ny fahamarinana ratsy sy ny tsy fitovian-kevitra ambony amin'ny isan'ny fanapahan-kevitra rehefa ampiharina amin'ny asa hafa. Mahavariana, dia hitantsika fa ny tanjona MIP-only dia afaka mahazo tsara kokoa ny isan'ny fanapahan-kevitra marina ao amin'ny asa MSP noho ny MSP-only amin'ny asa MIP, izay midika fa mora kokoa ny mifanohitra amin'ny maro-to-olona mapping amin'ny iray-to-olona mapping ary sarotra amin'ny mifanohitra. 7 6 Conclusion We have presented CodeT5, a pre-trained encoder-decoder model that incorporates the token type information from code. We propose a novel identifier-aware pre-training objective to better leverage the identifiers and propose a bimodal dual generation task to learn a better NL-PL alignment using code and its comments. Our unified model can support both code understanding and generation tasks and allow for multi-task learning. Experiments show that CodeT5 significantly outperforms all prior work in most CodeXGLUE tasks. Further analysis also reveals its better code comprehension capability across various programming languages. Ny fiantraikany sy ny fiheverana etika lehibe kokoa Our work generally belongs to NLP applications for software intelligence. With the goal of improving the development productivity of software with machine learning methods, software intelligence research has attracted increasing attention in both academia and industries over the last decade. Software code intelligence techniques can help developers to reduce tedious repetitive workloads, enhance the programming quality and improve the overall software development productivity. This would considerably decrease their working time and also could potentially reduce the computation and operational cost, as a bug might degrade the system performance or even crash the entire system. Our work addresses the fundamental challenge of software code pre-training, our study covers a wide range of code intelligence applications in the software development lifecycle, and the proposed CodeT5 method achieves the state-of-the-art performance on many of the benchmark tasks, showing its great potential benefit towards this goal. We further discuss the ethical consideration of training CodeT5 and the potential risks when applying it into real-world downstream applications: The training datasets in our study are source code including user-written comments from open source Github repositories and publicly available, which do not tie to any specific application. However, it is possible that these datasets would encode some stereotypes like race and gender from the text comments or even from the source code such as variables, function and class names. As such, social biases would be intrinsically embedded into the models trained on them. As suggested by ( ), interventions such as filtration or modulation of generated outputs may help to mitigate these biases in code corpus. Dataset bias. Chen et al. 2021 Our model pre-training requires non-trivial computational resources though we have tried our best to carefully design our experiments and improve experiments to save unnecessary computation costs. In fact, compared to the recent large-scale language model Codex ( , ), our CodeT5-base has a much smaller model size of 220M than theirs of 12B (∼ 55×). In addition, we experiment on Google Cloud Plat-form which purchases carbon credits to reduce its carbon footprint, Ny fampiofanana CodeT5 dia niteraka manodidina ny 49,25 kg CO2 izay niova tanteraka tamin'ny mpamatsy. Ankoatra izany, dia manolotra ny modely efa nianatra izahay ho an'ny daholobe mba hisorohana ny fampiofanana indray mandeha ho an'ny vondrom-piarahamonina fikarohana code intelligence. Computational cost. Chen Ary ny al. 2021 e.g., As CodeT5 can be deployed to provide coding assistance such as code generation for aiding developers, automation bias of machine learning systems should be carefully considered, especially for developers who tend to over-rely on the model-generated outputs. Sometimes these systems might produce functions that superficially appear correct but do not actually align with the developer’s intents. If developers unintentionally adopt these incorrect code suggestions, it might cause them much longer time on debugging and even lead to some significant safety issues. We suggest practitioners using CodeT5 should always bear in mind that its generation outputs should be only taken as references which require domain experts for further correctness and security checking. Automation bias. Ity lahatsoratra ity dia manasongadina ny fomba fanao amin'ny alàlan'ny fametrahana ny CodeSearchNet ( , ) ary ny ampahany kely amin'ny Google BigQuery, izay roa ireo dia voaangona avy amin'ny fitehirizana Github ankapobeny. personal addresses or identification numbers) from the training data. Though we have conducted multi-rounds of data cleaning to mitigate this before training our models, it is still possible that some sensitive information cannot be completely removed. Besides, due to the non-deterministic nature of generation models like CodeT5, it might produce some vulnerable code to harmfully affect the software and even be able to benefit more advanced malware development when deliberately misused. Security implications. Husain et al. 2019 Ny G. Ny fankatoavana Misaotra an'i Akhilesh Deepak Gotmare, Amrita Saha, Junnan Li, ary Chen Xing noho ny fifanakalozan-kevitra manan-danja. Misaotra an'i Kathy Baxter noho ny fitsapana ara-tsiansa. Misaotra ihany koa ny mpitsikera an-tsoratra izahay noho ny valin'ny fanehoan-kevitra momba ny tatitra. References Nandray anjara tamin'ny fanatanterahana ny rafi-pivavahana i Saikat Chakraborty, Baishakhi Ray, ary Kai-Wei Chang. . In , pejy 2655–2668. Association for Computational Linguistics. Unified pre-training Ho an'ny fahatakarana ny fandaharana sy ny taranaka Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021 42Akaza niteraka an'i Jarà; Jarà niteraka an'i Alamata, Azmota ary Zamrì; Zamrì niteraka an'i Mosà; Zamrì niteraka an'i Mosà; Zamrì niteraka an'i Mosà; Zamrì niteraka an'i Mosà; Zamrì niteraka an'i Mosà; Zamrì niteraka an'i Mosà. Ny , abs/2107.03374. Ny fanombanana ny modely fiteny lehibe izay nianatra amin'ny code CoRR Kevin Clark, Minh-Thang Luong, Quoc V. Le, ary Christopher D. Manning. Ary ny . OpenReview.net. ELECTRA: mialoha ny fampiofanana, ny lahatsoratra dia mandika ho toy ny fanavakavahana fa tsy generators 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Etiopia, 26-30 Aprily 2020 Colin B. Clement, Dawn Drain, Jonathan Timcheck, Alexey Svyatkovskiy, ary Neel Sundaresan. Ary ny , pejy 9052–9065. fikambanana ho an'ny Lingvistika Computational. Pymt5: multi-mode translation of natural language and python code with transformers Ny fandaharam-potoana amin'ny 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, Novambra 16-20, 2020 Alexis Conneau and Guillaume Lample. 2019. . In Ao amin'ny pejy 7057 hatramin'ny 7067. Cross-lingual language model pretraining Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada Sergio Cozzetti B. de Souza, Nicolas Anquetil, ary Káthia Marçal de Oliveira. Ary ny Ny pejy Fanadihadiana momba ny antontan-taratasy ilaina amin'ny fikarakarana rindrambaiko Proceedings of the 23rd Annual International Conference on Design of Communication: documenting & Designing for Pervasive Information, SIGDOC 2005, Coventry, UK, September 21-23, 2005 68–75. ACM. Jacob Devlin, Ming-Wei Chang, Kenton Lee ary Kristina Toutanova. Ary ny , pages 4171–4186. BERT: pre-training of deep bidirectional transformers for language understanding Ny fandaharam-potoana tamin'ny 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, ny 2 hatramin'ny 7 Jona 2019, Volana 1 (Long and Short Papers) Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi-aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. . In , pages 13042–13054. Ny fiteny iray model pre-training for natural language understanding and generation Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada Ahmed Elnaggar, Wei Ding, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Silvia Severini, Florian Matthes, and Burkhard Rost. 2021. . , abs/2104.02443. Code-trans: Mankany amin'ny fandringanana ny fitenin'ny silicone code through self-supervised deep learning and high performance computing CoRR Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xi-aocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. . In , pejy 1536-1547. fikambanana ho an'ny Lingvistika Computational. Code-bert: A pre-trained model for programming and natural languages Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, EMNLP 2020, Online Event, 16-20 November 2020 Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tu-fano, Shao Kun Deng, Colin B. Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou. 2021. Ary ny . OpenReview.net. Graphcodebert: fampiofanana mialoha Representations code miaraka amin'ny data flow 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Aostralia, 3 - 7 May 2021 Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, ary Marc Brockschmidt. Ny , abs/1909.09436. Code-searchnet fanamby: Ny fanombanana ny toe-javatra amin'ny fikarohana code semantic Ny Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2018. Ary ny , pejy 1643-1652. ny Association for Computational Linguistics. Ny fiteny ho an'ny code Ao amin'ny sehatry ny fandaharana Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018 Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. 2020. . In , volume 119 of , pages 5110–5121. PMLR. Learning and evaluating Ny fametrahana ny code source Fifanarahana iraisam-pirenena faha-37 momba ny Machine Learning, ICML 2020, 13-18 Jolay 2020, Virtual Event Proceedings of Machine Learning Research Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. Ary ny , pages 7871–7880. Association for Computational Linguistics. BART: denoising sequence-to-sequence pre-training ho an'ny famokarana fiteny voajanahary, fandikan-teny, Ny fahatakarana Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020 Chin-Yew Lin and Franz Josef Och. 2004. . In . ORANGE: a method for evaluating automatic evaluation metrics for machine translation COLING 2004, 20th International Conference on Computational Linguistics, Proceedings of the Conference, 23-27 August 2004, Geneva, Switzerland Fang Liu, Ge Li, Yunfei Zhao, and Zhi Jin. 2020. . In , pages 473–485. IEEE. Multi-task learning based pre-trained language model for code completion 35th IEEE / ACM International Conference on Automated Software Engineering, ASE 2020, Melbourne, Aostralia, Septambra 21-25, 2020 Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian-feng Gao. 2019a. . In , pages 4487–4496. Association for Computational Linguistics. Ny tambajotra neuronaly mainty amin'ny asa maro Ny fahatakarana ny fiteny voajanahary Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019b. Ny Ny fanapahan-kevitry ny fanapahan-kevitry ny fanapahan-kevitry ny fanapahan-kevitry ny Roberta: A robustly optimized BERT pretraining approach CoRR Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Li-dong Zhou, Linjun Shou, Long Zhou, Michele Tu-fano, Ming Gong, Ming Zhou, Nan Duan, Neel Sun-daresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. 2021. . , abs/2102.04664. Codexglue: A machine learning benchmark dataset for code understanding and generation CoRR Antonio Mastropaolo, Simone Scalabrino, Nathan Cooper, David Nader-Palacio, Denys Poshyvanyk, Rocco Oliveto, and Gabriele Bavota. 2021. Ary ny Sary avy amin'ny pejy 336-347 ao amin'ny IEEE. Studying the usage of text-to-text transfer transformer to support code-related tasks 43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021 Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. . , 1(8):9. Language ny modely dia tsy mifanaraka amin'ny fampianarana multitask OpenAI blog Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. . , 21:140:1 – 140:67. Ny fanadihadiana ny fetra of transfer learning with a unified text-to-text transformer J. Mach. Learn. Res. Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Am-brosio Blanco, and Shuai Ma. 2020. . , abs/2009.10297. Ny tsindry : A method for automatic evaluation of code synthesis CoRR Baptiste Rozière, Marie-Anne Lachaux, Lowik Chanussot ary Guillaume Lample. . In Unsupervised translation of programming languages Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December . 6-12, 2020, virtual Baptiste Rozière, Marie-Anne Lachaux, Marc Szafraniec, and Guillaume Lample. 2021. Ny Izany dia tsy maintsy atao amin'ny ABS/2102.07492. DOBF: Ny tanjona fampiofanana mialoha ny deobfuscation ho an'ny fiteny fandaharana CoRR Rico Sennrich, Barry Haddow, ary Alexandra Birch. . In . The Association for Computer Linguistics. Neural machine translation of rare words with subword units Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. . In , volume 97 of , pages 5926–5936. PMLR. MASS: masked sequence to se-quence pre-training for language generation Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA Ny fikarohana momba ny Machine Learning Yu Sun, Shuohuan Wang, Yu-Kun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019. . Ny fanapahan-kevitra dia ny fanapahan-kevitra amin'ny ankapobeny, abs/1904.09223. ERNIE: enhanced representation through knowledge integration CoRR Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020. . In , pages 1433–1443. ACM. Intellicode compose: code generation using transformer ESEC/FSE ’20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event, USA, November 8-13, 2020 Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshy-vanyk. 2019. . , 28(4):19:1–19:29. An empirical study on learning bug-fixing patches ao amin'ny vahiny amin'ny alalan'ny neural milina translation ACM Trans. Softw. Eng. Metodol. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. . In Sary avy amin'ny pejy 5998-6008. Ny fahatakarana dia ny rehetra Mila ny Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA Wenhan Wang, Ge Li, Bo Ma, Xin Xia, and Zhi Jin. 2020. Ary ny , pages 261–271. IEEE. Detecting code clones with graph neural network and flow-augmented abstract syntax Ny hazo 27th IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2020, Londres, ON, Kanada, 18-21, 2020 Yaqin Zhou, Shangqing Liu, Jing Kai Siow, Xiaon-ing Du, and Yang Liu. 2019. Ary ny , pages 10197–10207. Devign: Effective famantarana ny fahavoazana amin'ny alàlan'ny fianarana ny semantika fandaharana feno amin'ny alàlan'ny tambajotra neuronal graph Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada Daniel Zügner, Tobias Kirschstein, Michele Catasta, Jure Leskovec, and Stephan Günnemann. 2021. . In Ny tranonkala dia OpenReview.net. Ny fiteny-agnostika fampisehoana ny fampianarana ny loharanom-baovao Ny code avy amin'ny rafitra sy ny kontekst 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 This paper is under CC by 4.0 Deed (Attribution 4.0 International) license. available on arxiv Ity lahatsoratra ity dia under CC by 4.0 Deed (Attribution 4.0 International) license. available on arxiv