A Salesforce CodeT5 megváltoztathatja, hogyan írja és érti a kódot az AI

A szerzők: Yue Wang, wang.y@salesforce.com (Salesforce Research Asia) Weishi Wang, weishi.wang@salesforce.com (Salesforce Research Asia; Nanyang Technological University, Singapore) Shafiq Joty, sjoty@salesforce.com (Salesforce Research Asia; Nanyang Technological University, Singapore) Steven C.H. Hoi, shoi@salesforce.com (Salesforce Research Asia) A szerzők: Yue Wang, wang.y@salesforce.com (Salesforce Research Ázsia) Weishi Wang, weishi.wang@salesforce.com (Salesforce Research Asia; Nanyang Műszaki Egyetem, Szingapúr) Shafiq Joty, sjoty@salesforce.com (Salesforce Research Asia; Nanyang Technológiai Egyetem, Szingapúr) Steven C.H. Hoi, shoi@salesforce.com (Salesforce Research Ázsia) absztrakt A természetes nyelvek (NL) előzetes képzett modelljei, mint például a BERT és a GPT, a közelmúltban jól átadhatók a programozási nyelvekre (PL) és nagymértékben előnyösek a kódhoz kapcsolódó feladatok széles skálájára. A sikerük ellenére a legtöbb jelenlegi módszer vagy csak kódolásra (vagy csak dekódolóra) támaszkodik előzetes képzésre, amely nem optimális a generációs (vagy megértési) feladatokhoz, vagy az NL-hez hasonlóan feldolgozza a kódrészletet, figyelmen kívül hagyva a PL-típusok speciális jellemzőit, mint például a token típusokat. A CodeT5-t bemutatjuk, egy egységes, előzetesen képzett kódoló-dekódoló-transzformátor modellt, amely jobban kihas . https://github.com/salesforce/CodeT5 1 Bevezetés Nyelvtanulási módszerek, mint a BERT ( az , Az MNB ( az , T5 és T5 ( az , Az NLP (Natural Language Processing – Természetes Nyelvfeldolgozás) feladatok széles spektrumában jelentősen javította a teljesítményt. Általában egy előkészítő, majd finomhangoló paradigmát alkalmaznak, amelynek célja az általános nyelvi reprezentációk származása a nagyméretű, nem címkézett adatokra irányuló önfelügyeleti képzés révén, amelyeket át lehet adni több downstream feladatra, különösen a korlátozott adatelemzéssel rendelkező feladatokra. az , · az , · az , ), amely ígéretes eredményeket mutat a kódhoz kapcsolódó feladatoknál. Dávid és al. 2019 Radford és al. 2019 Raffles és al. 2020 Svájc És az al. 2020 Kanadában és az al. 2020 Feng és al. 2020 Azonban a sikereik ellenére a legtöbb ilyen modell a BERT-hez hasonló kódoló-csak modellre támaszkodik ( az , · az , ) vagy csak egy dekódoló modell, mint például a GPT ( az , Ezzel a módszernek köszönhetően a megértés és a megértés (CodeBERT) az , ) kiegészítő dekóderre van szükség, amikor a kódösszefoglalási feladathoz alkalmazzák, ahol ez a dekóder nem részesülhet előzetes képzésben.Ezen túlmenően a legtöbb meglévő módszer egyszerűen a hagyományos NLP előzetes képzési technikákat alkalmazza a forráskódra azáltal, hogy az NL-hez hasonló tokenek szekvenciájának tekinthető. Sztálinovszkij et al. 2020 Feng és al. 2020 Kanadában És az al. 2020 Feng és al. 2020 In this work, we present CodeT5, a pre-trained encoder-decoder model that considers the token type information in code. Our CodeT5 builds on the T5 architecture ( az , ) amely denoising sequence-to-sequence (Seq2Seq) előképzést alkalmaz, és kimutatták, hogy előnyös mind a megértés, mind a generációs feladatok természetes nyelven. Ezenkívül javasoljuk, hogy kihasználják a fejlesztő által hozzárendelt azonosítókat a kódban. A „BinarySearch” azonosító a képen Az ilyen kód-specifikus tudás egyesítéséhez egy új azonosító-tudatos objektívet javasolunk, amely a modellt arra tanítja, hogy megkülönböztesse, hogy mely tokenek azonosítók, és visszaszerezze őket, amikor el vannak rejtve. Raffles és al. 2020 Az e.g. 2 Továbbá javasoljuk a kód és a kísérő megjegyzések kihasználását, hogy jobban megismerjük az NL-PL-t. A fejlesztők gyakran dokumentálják a programokat, hogy megkönnyítsék a szoftver karbantartását ( az , Az ilyen PL-NL párok széles körben elérhetők a legtöbb forráskódban.Különösen az NL→PL generációt és a PL→NL generációt tekintjük kettős feladatoknak, és egyidejűleg optimalizáljuk a modellt rajtuk. Zsuzsa és al. 2005 A CodeSearchNet adatbázisában a CodeT5 előkészítése ( az , Az alábbiakban ( az , Ezzel párhuzamosan a C/C# további adatait is gyűjtjük a nyílt forráskódú Github-repozitóriumokból. az , ), beleértve két megértési feladatot: kódhibák észlelését és klónok észlelését, valamint olyan generációs feladatokat, mint a kód összefoglalása, generálása, fordítása és finomítása. , we also explore multi-task learning to fine-tune CodeT5 on multiple tasks at a time using a task control code as the source prompt. In summary, we make the following contributions: Zsuzsa és al. 2019 Feng és al. 2020 Lu és al. 2021 1 Bemutatjuk az egyik első egységes kódoló-dekóder modellt, a CodeT5-t, amely támogatja mind a kódhoz kapcsolódó megértést, mind a generációs feladatokat, valamint lehetővé teszi a többfeladatos tanulást. Javasolunk egy új azonosító-tudatos előképzési célt, amely figyelembe veszi a kódból származó kulcsfontosságú token típusú információkat (azonosítókat). Kiterjedt kísérletek azt mutatják, hogy a CodeT5 a CodeXGLUE tizennégy alfeladatánál a legmodernebb eredményeket hozza létre.A további elemzések azt mutatják, hogy a CodeT5 a javasolt azonosító-tudatos előképzéssel és a bimodális kettős generációval jobban képes rögzíteni a kód szemantikáját. 2 Kapcsolódó munkák A Transformer architektúrán alapuló előre képzett modellek ( az , Három különböző típusú szerszámmal rendelkezik, amelyeket általában három csoportba sorolhatunk: a szerszámgép ( az , Rákóczi ( az , ) és az elektromos ( az , , csak dekódoló modellek, mint például a GPT ( az , , és a kódolási-dekóder modellek, mint a MASS ( az , Bálint ( az , T5 és T5 ( az , A csak kódolóval és csak dekódolóval rendelkező modellekkel összehasonlítva, amelyek megfelelően előnyben részesítik a megértést és a generációs feladatokat, a kódoló-dekódoló modellek jól támogathatják mindkét típusú feladatot. Gyakran használnak denoising szekvencia-szekvencia előképzési célokat, amelyek elrontják a forrásbevitelet, és megkövetelik a dekódolótól, hogy visszaszerezze őket. Pre-training on Natural Language. Veszprém és al. 2017 Dávid és al. 2019 Lió És az al. 2019b Clark és al. 2020 Radford és al. 2019 A dal és az al. 2019 Lewis és al. 2020 Raffles és al. 2020 A programozási nyelv előképzése egy feltörekvő terület, ahol a közelmúltban sok munka megpróbálja kiterjeszteni az NLP előképzési módszereit a forráskódra. az , A székesfehérvári ( az , A COBERT alkalmazza a BERT erőteljes maszkos nyelvi modellezési objektívjét, hogy generikus kód-specifikus reprezentációt nyújtson, és a CodeBERT további helyettesített token detekciót ad hozzá ( az , ) az NL-PL keresztmodális reprezentációjának megtanulása. a BERT-stílusú modellek mellett, (Egyéb ) és (Egyéb ) használja a GPT és az UniLM ( az , ) a kód befejezésének feladatát. transzcoder ( az , ) felfedezi a programozási nyelv fordítását felügyelet nélküli környezetben. tőlük eltérően, a T5-ön alapuló kódolási-dekóder modelleket vizsgálunk a programozási nyelv előképzésére és egy átfogóbb feladatkészlet támogatására. Pre-training on Programming Language. Kanadában és az al. 2020 Feng És az al. 2020 Clark és al. 2020 Sztálinovszkij et al. 2020 Zoltán és al. 2020 Dong és al. 2019 Rózsafüzér és al. 2020 Some emerging work ( az , · az , · az , ) a közelmúltbeli irodalomban a T5 keretrendszert is feltárják a kódról, de csak korlátozott alcsoport generációs feladatokra összpontosítanak, és nem támogatják a megértési feladatokat, mint mi. az , ) egy másik kódoló-dekóder modellre alapozva a BART is támogathatja mind a megértést, mind a generációs feladatokat. mindazonáltal a fenti előző munkák egyszerűen a kódot ugyanúgy feldolgozzák, mint a természetes nyelvet, és nagyrészt figyelmen kívül hagyják a kód-specifikus jellemzőket. Kálmán és al. 2020 Máté és al. 2021 Kálmán és al. 2021 Ahmad és al. 2021 Hasonlóképpen, mint a Hortobágyi ( az , a kódszerkezetből kivont adatáramlást a CodeBERT-be foglalja bele, miközben (Egyéb ) javasoljon egy deobfuscation célt a PL szerkezeti aspektusának kihasználására.Ezek a modellek csak egy jobb kód-specifikus kódoló képzésére összpontosítanak. (Egyéb Ezzel szemben kifejezetten azokra az azonosítókra összpontosítunk, amelyek gazdag kódszemantikát tartalmaznak, és az ilyen információkat egy Seq2Seq modellbe egyesítik két új azonosító címkézési és előrejelzési feladat révén. Máté és al. 2021 Rózsafüzér és al. 2021 Zoltán és al. 2021 3 Kódex 5 Our CodeT5 builds on an encoder-decoder framework with the same architecture as T5 ( az , Célja a programozási nyelv (PL) és a természetes nyelv (NL) általános képviseleteinek származása a nem címkézett forráskód előzetes képzése révén. , kiterjesztjük a T5 denoising Seq2Seq célkitűzését két azonosító címkézési és előrejelzési feladat javaslatával, hogy a modell jobban kihasználhassa a PL token típusú információit, amelyek a fejlesztők által hozzárendelt azonosítók. Raffles és al. 2020 2 Az alábbiakban bemutatjuk, hogyan kódolja a CodeT5 a PL és NL bemeneteket (§ ) és a javasolt azonosító-tudatos előképzési feladatok (§ ), amelyet a feladatspecifikus átviteli tanulás és a többfeladatú képzés finomhangolása követ (§ ) az 3.1 3.2 3.3 3.1 Az NL és a PL kódolása Az előképzési szakaszban a modellünk vagy PL-only vagy NL-PL-t kapna bemenetként attól függően, hogy a kódrészlet NL leírásokkal jár-e, vagy sem. Az NL-PL bimodális bemenetek esetében egy szekvenciába keverjük őket egy határjelző tokennel [SEP], és az egész bemeneti szekvenciát a formátumban ábrázoljuk, mint = (Csehország , 1*, ..., az Ön számára, 1*, ..., cm*, [SEP]), ahol és jelöli az NL szó tokenek és a PL kód tokenek számát.Az NL szószekvencia üres lesz a PL-csak unimodális bemenetek esetében. x w c n m Az alábbiakban az alábbiakban felsoroljuk, hogy milyen típusú kódot használunk, és milyen típusú kódot használunk: ( funkciónevek és változók), mivel ezek az egyik leginkább PL-agnosztikus funkciók és tartalék gazdag kód szemantikát. Pontosabban, mi átalakítjuk a PL szegmens egy Abstract Syntax Tree (AST) és kivonat a csomópont típusok minden kód token. Végül építünk egy sor bináris címkék ∈ {0*, * 1} a PL szegmensben, ahol minden egyes ∈ {0*,* 1} azt jelenti, hogy a kód token Az azonosító vagy sem. Az e.g. y m Éva ci 3.2 Pre-training Tasks Most bemutatjuk a javasolt előképzési feladatokat, amelyek lehetővé teszik a CodeT5 számára, hogy hasznos mintákat tanuljon a PL-only vagy NL-PL bimodális adatokból. A szűk keresztmetszetű (szűk keresztmetszetű) szűk keresztmetszetű (szűk keresztmetszetű) szűk keresztmetszetű ( az , · az , · , Ebből kifolyólag a fentiekből kifolyólag egyértelműen következik, hogy a fentiekből kifolyólag a fentiekből kifolyólag a fentiekből kifolyólag a fentiekből kifolyólag a fentiekből kifolyólag a fentiekből kifolyólag a fentiekből kifolyólag a fentiekből kifolyólag a fentiekből kifolyólag a fentiekből kifolyólag a fentiekből kifolyólag a fentiekből kifolyólag a fentiekből kifolyólag a fentiekből kifolyólag a fentiekből ( , ) amely véletlenszerűen eltakarja a tartományokat önkényes hosszúságokkal, majd megjósolja ezeket a maszkolt tartományokat egyesített néhány sentinel tokennel a dekódolóban. Amint a képen látható Az a) Identifier-aware Denoising Pre-training. A dal és az al. 2019 Raf-fel és mások 2020 Lewis és al. 2020 Raffles és al. 2020 Masked Span Prediction (MSP) 2 Pontosabban, ugyanazt a 15% -os korrupciós arányt alkalmazzuk, mint a T5, és biztosítjuk, hogy az átlagos tartomány hossza 3 legyen azáltal, hogy egységesen mintavételezünk 1 és 5 token közötti tartományokat. a szóbeli tartományok mintavételével az alszó tokenizáció előtt, amelynek célja a részleges aljzatok maszkjának elkerülése, és hasznosnak bizonyult ( az , ) Figyelemre méltó módon előkészítjük a különböző PL-k közös modelljét, hogy megtanuljuk a szilárd keresztnyelvű képviseleteket. A maszkos tartomány előrejelzési veszteségét a következőképpen írjuk le: Az egész szót maszk A nap és az al. 2019 ahol θ a modellparaméterek, x \mask a maszkolt bemenet, x maszk az a maszkolt szekvencia, amelyet a dekódolóról kell megjósolni, k jelöli az x maszkban lévő tokenek számát, és xmask <t az eddig generált span szekvencia. A kód-specifikus szerkezeti információk (az azonosító csomópont típusa az AST-ban) egyesítéséhez két további feladatot javasolunk: és Kiegészítse az előkészítő edzést. Azonosító címkézés (IT) Maszkos azonosító előrejelzés (MIP) • az Célja, hogy értesítse a modellt arról, hogy ez a kód token azonosító-e vagy sem, ami hasonló szintaxis hangsúlyozási szellemet oszt meg néhány fejlesztői segédeszközben. b) feltérképezzük a PL szegmens végső rejtett állapotait a CodeT5 kódolóban egy valószínűségi sorrendbe Az ( 1*, ..., pm*), és számítson ki egy bináris kereszt entrópiás veszteséget a szekvenciás címkézéshez: Identifier Tagging (IT) 2 p p Ahol Megjegyzendő, hogy ha a feladatot sorrendcímkézési problémaként adjuk ki, a modell várhatóan rögzíti a kód szintaxisát és a kód adatáramlási szerkezetét. θe • Az MSP-ben a véletlenszerű hatótávolság maszkolásával ellentétben a PL szegmens összes azonosítóját maszkoljuk, és egy egyedi sentinel tokent használunk egy adott azonosító minden előfordulásához. where changing identifier names does not impact the code semantics. Inspired by (Egyéb ), rendezzük az egyedi azonosítókat a sentinel tokenekkel egy célsorozatba as shown in Figure (c) Ezután előre jelezzük azt auto-regresszív módon: Masked Identifier Prediction (MIP) obfuscation Rózsafüzér és al. 2021 I 2 where \I is the masked input. Note that Ez egy olyan kihívást jelentő feladat, amely megköveteli a modelltől, hogy megértsék a kód szemantikáját a homályos kód alapján, és összekapcsolják ugyanazon azonosítók előfordulásait. x deobfuscation Ezeket a három veszteséget váltakozva egyenlő valószínűséggel optimalizáljuk, ami a javasolt azonosító-tudatos denoizáló előképzés. In the pre-training phase, the decoder only sees discrete masked spans and identifiers, which is disparate from the downstream tasks where the decoder needs to generate either fluent NL texts or syntactically correct code snippets. To close the gap between the pre-training and fine-tuning, we propose to leverage the NL-PL bimodal data to train the model for a bidirectional conversion as shown in Figure (d). Specifically, we regard the NL→PL generation and PL→NL generation as dual tasks and simultaneously optimize the model on them. For each NL- Bimodal Dual Generation. 2 PL bimodal datapoint, we construct two training instances with reverse directions and add language ids ( és a Java PL és az angol NL esetében). Ez a művelet a T5 tartományának elrejtésének különleges esete is lehet, ha a teljes NL vagy PL szegmentet elrejtjük a bimodális bemenetekből. Az e.g. 3.3 Fine-tuning CodeT5 A nagyméretű, nem címkézett adatokkal kapcsolatos előzetes képzés után a CodeT5-et a feladatspecifikus átviteli tanulás vagy a többfeladatos tanulás révén alkalmazkodjuk a downstream feladatokhoz. Code-related tasks can be categorized into generation and understanding tasks. For the former one, our CodeT5 can be naturally adapted with its Seq2Seq framework. For understanding tasks, we investigate two ways of either generating the label as a unigram target sequence ( , ), or predicting it from the vocabulary of class labels based on the last decoder hidden state following (Egyéb ). Task-specific Transfer Learning: Generation vs. Understanding Tasks. Raffel et al. 2020 Lewis et al. 2020 We also explore a multi-task learning setting by training a shared model on multiple tasks at a time. Multi-task learning is able to reduce computation cost by reusing the most of model weights for many tasks and has been shown to improve the model generalization capability in NL pre-training ( , ). We follow ( ) to employ the same unified model for all tasks without adding any task-specific networks but allow to select different best checkpoints for different tasks. To notify the model with which task it is dealing with, we design a unified format of task control codes and prepend it into the source inputs as shown in Figure . For instance, we employ “Translate Java to CSharp:” as the source prompt for the code-to-code translation task from Java to CSharp. Multi-task Learning. Liu et al. 2019a Raffles és al. 2020 1 As different tasks have different dataset sizes, we follow Conneau and Lample (2019) to employ a balanced sampling strategy. For N number of datasets (or tasks), with probabilities {qi} N i=1, we define the following multinomial distribution to sample from: where ni is number of examples for i-th task and α is set to 0.7. This balanced sampling aims to alleviate the bias towards high-resource tasks. 4 Experimental Setup 4.1 Pre-training Dataset We follow Feng et al. (2020) to employ CodeSearchNet (Husain et al., 2019) to pre-train CodeT5, which consists of six PLs with both unimodal and bimodal data. Apart from that, we additionally collect two datasets of C/CSharp from BigQuery1 to ensure that all downstream tasks have overlapped PLs with the pre-training data. In total, we employ around 8.35 million instances for pretraining. Table 1 shows some basic statistics. To obtain the identifier labels from code, we leverage the tree-sitter2 to convert the PL into an abstract syntax tree and then extract its node type information. We filter out reserved keywords for each PL from its identifier list. We observe that PLs have different identifier rates, where Go has the least rate of 19% and Ruby has the highest rate of 32%. 4.2 Code-specific Tokenizer A korszerűsítés és a korszerűsítés alapvető fontosságú elemei A korszerűsítés és a korszerűsítés alapvető fontosságú elemei A korszerűsítés és a korszerűsítés alapvető fontosságú elemei A korszerűsítés és a korszerűsítés alapvető fontosságú elemei A korszerűsítés alapvető fontosságú elemei A korszerűsítés és a korszerűsítés alapvető fontosságú elemei A korszerűsítés alapvető fontosságú elemei A korszerűsítés és a korszerűsítés alapvető fontosságú elemei A korszerűsítés alapvető fontosságú elemei A korszerűsítés alapvető fontosságú elemei ( , ) to alleviate the Out-of-Vocabulary (OoV) issues. Specifically, we train a Byte-level BPE tokenizer following ( ) és állítsa be a szókincs méretét 32 000-re, mint T5. Hozzáadunk további speciális tokeneket ([PAD], [CLS], [SEP], [MASK0], ..., [MASK99]). Ezt a tokenizer-t a nem nyomtatható karakterekkel és alacsony frekvenciájú tokenekkel (3-szor fordul elő) szűrjük. Összehasonlítjuk a T5 alapértelmezett tokenizálójával, és megállapítjuk, hogy tokenizálónk nagyrészt 30% - 45% -kal csökkenti a tokenizált kódsorozat hosszát a lefelé irányuló feladatoknál. Ez felgyorsítja a képzést és különösen előnyös a generációs feladatokat a rövidebb előrejelzési sorrend miatt. A T5 alapértelmezett tokenizálój Zoltán és al. 2016 Radford et al. 2019 4.3 Downstream Tasks and Metrics We cover most generation and understanding tasks in the CodeXGLUE benchmark ( , ) and employ the provided public datasets and the same data splits following it for all these tasks. Lu et al. 2021 We first consider two cross-modal generation tasks. Az adatbázisban használt adatbázisok között szerepelnek a Java, a Java, a Java, a Java, a Ruby, a JavaScript, a Go és a CodeSearchNet ( , ). We employ the smoothed BLEU-4 ( az , ) to eval-uate this task. is the task to gen-erate a code snippet based on NL descriptions. We employ the Concode data ( az , ) in Java where the input contains both NL texts and class environment contexts, and the output is a function. We evaluate it with BLEU-4, exact match (EM) accuracy, and CodeBLEU ( , A szintaxis és a szemantikus egyezések a kódszerkezeteken alapulnak az n-grammatch mellett. Code summarization Husain et al. 2019 Lin és Och 2004 Code generation Iyer et al. 2018 Ren et al. 2020 Ezenkívül két code-to-code generációs feladatot is figyelembe vesszük. aims to migrate legacy software from one PL to another, where we focus on translating functions from Java to CSharp and vice versa. Célja, hogy egy buggy funkciót helyes funkcióvá alakítson át. ( ) with various function lengths: small (fewer than 50 tokens) and medium (50-100 tokens). We use BLEU-4 and exact match to evaluate them. Code translation Code refinement Zoltán és al. 2019 We also investigate how CodeT5 performs on two understanding-based tasks. The first one is that aims to predict whether a code is vulnerable to software systems or not. We use the C dataset provided by ( ) for experiment. The second task is which aims to measure the similarity between two code snippets and predict whether they have the same functionality. We experiment with the Java data provided by (Egyéb ). We employ F1 score and accuracy for evaluating these two tasks respectively. In total, our CodeT5 supports six tasks and fourteen sub-tasks in CodeXGLUE with a unified encoder-decoder model. defect detection Zhou et al. 2019 clone detection Wang és al. 2020 4.4 Összehasonlító modellek We compare CodeT5 with state-of-the-art (SOTA) pre-trained models that can be categorized into three types: encoder-only, decoder-only, and encoder-decoder models. As models, we consider RoBERTa ( , ), RoBERTa (code) trained with masked language modeling (MLM) on code, CodeBERT ( , ) trained with both MLM and replaced token detection ( az , ), GraphCode-BERT ( , ) using data flow from code, and DOBF ( , Megjegyzendő, hogy bár a DOBF egy Seq2Seq modellt alkalmaz az előképzés során, csak arra törekszik, hogy egy jobb kódolót képezzen a downstream feladatokhoz anélkül, hogy feltárná az előképzett dekódoló potenciális előnyeit. encoder-only Liu et al. 2019b Feng et al. 2020 Clark et al. 2020 Guo et al. 2021 Rozière et al. 2021 Azért models, we compare GPT-2 ( , A különbség az, hogy az utóbbi egy GPT-2 ellenőrző pontot használ a modellkezeléshez, míg az előbbi a semmiből képzett. models, the current SOTA model for the CodeXGLUE benchmark is PLBART ( , ) based on BART ( , ) architecture. For pre-training data, most of these models employ CodeSearchNet ( , ) except DOBF and PLBART. DOBF is pre-trained on 7.9M Java and 3.6M Python files from BigQuery while PLBART employs a much larger data with 470M Python and 210M Java functions, and 47M NL posts from StackOverflow. decoder-only Radford et al. 2019 encoder-decoder Ah-mad et al. 2021 Lewis et al. 2020 Zsuzsa és al. 2019 4.5 Konfigurációk A T5 és T5 modellek a T5 modellek ( , ) PyTorch implementation and employ two sizes of CodeT5-small (60M) and CodeT5-base (220M). We set the maximum source and target sequence lengths to be 512 and 256, respectively. We use the mixed precision of FP16 to accelerate the pre-training. We set the batch size to 1024 and employ the peak learning rate of 2e-4 with linear decay. We pre-train the model with the denoising objective for 100 epochs and bimodal dual training for further 50 epochs on a cluster of 16 NVIDIA A100 GPUs with 40G memory. The total training time for CodeT5-small and CodeT5-base is 5 and 12 days, respectively. Raf-fel et al. 2020 3 In the fine-tuning phase, we find that the tasks in CodeXGLUE ( , ) are quite sensitive to some hyper parameters such as learning rate, training steps, and batch size. We conduct a grid search and select the best parameters based on the validation set. In multi-task learning, we cover all downstream tasks except clone detection. Lu et al. 2021 5 Results and Analysis In this section, we compare CodeT5 with SOTA models on a broad set of CodeXGLUE downstream tasks (§ ), and investigate the effects of our bimodal dual generation and multi-task learning (§ ), followed by a detailed analysis on the proposed identifier-aware pre-training (§ ). 5.1 5.2 5.3 5.1 CodeXGLUE Downstream feladatok Két méretet értékelünk modellünkben: a CodeT5-kis és a CodeT5-bázis, amelyek előzetesen képzettek az azonosító-tudatos denoizálással. Ezenkívül figyelembe vesszük azt a modellt, amely továbbra is bimodális kettős generációval (dual-gen) edz, és a többfeladatos finomhangolással mutatja az eredményeket. az , ). Lu et al. 2021 We show code summarization results of smoothed BLEU-4 on six PL data in Table Megfigyeltük, hogy minden modellváltozatunk jelentősen felülmúlja a korábbi munkát egy csak kódoló (RoBERTa, CodeBERT, DOBF) vagy kódoló-dekóder keretrendszerrel (PLBART). Ezenkívül a két modellcsoport közötti kiemelkedő teljesítménykülönbség megerősíti, hogy a csak kódoló keretrendszerek nem optimálisak a generációs feladatokhoz. A SOTA kódoló-dekóder modellhez képest a PLBART modellhez képest azt találjuk, hogy még a CodeT5 kicsi is jobb általános pontszámokat ad (például Python és Java esetében is), mivel a modellünk sokkal kisebb (60M vs. 140M), és a PLBART-t sokkal nagyobb Python és Java adatokkal (> 100-szor) előkészítették. A modellméret növelésével a CodeT5 bázis több mint 1,2 abszolút ponttal növeli az általános teljesítményt a PLBART-hoz képest. Code Summarization. 2 4 Összehasonlítjuk a CodeT5-öt a GPT-stílusú modellekkel és a PLBART-ot a táblázatban . Our CodeT5-small outperforms all decoder-only mod-els and also the SOTA PLBART, which again confirms the superiority of encoder-decoder models at generating code snippets. Moreover, our CodeT5-base further significantly pushes the SOTA results across three metrics. Particularly, it achieves around 4.7 points improvement on CodeBLEU over PLBART, indicating our CodeT5 can better comprehend the code syntax and semantics with the __help of identi__fier-aware pre-training. Code Generation. 3 We compare two code-to-code generation tasks: code translation and code refinement in Table A kódfordítási feladatban a CodeT5-kicsi felülmúlja a legtöbb alapvonalat, és összehasonlítható eredményeket ér el a PLBART-szel, amely a kód-kód generációs beállításban a kódoló-dekóder modellek előnyeit mutatja. Code-to-Code Generation Tasks. 4 Itt megmutatjuk az egyik CodeT5 kimenetét a C# fordításáról Java-ra ábrán . In this case, despite the poor BLEU score, CodeT5 is able to generate a function that reserves the same functionality and even has better readability compared to the ground-truth. This reveals that CodeT5 has a good generalization ability instead of memorizing and repeating what it has seen before. On the other hand, it also suggests that BLEU score is not a perfect evaluation metric for code generation tasks, where sometimes a higher score can instead reflect the problematic copy issues of neural models. 3 Another code-to-code generation task is code refinement, a challenging task that requires detecting which parts of code are buggy and fix them via generating a bug-free code sequence. Due to the large overlap of source and target code, even the naive copy approach yields very high BLEU scores but zero exact matches. Therefore, we focus on the exact match (EM) metric to evaluate on this task. As shown in Table , we observe that EM scores for the small data are consistently higher than the medium one, indicating that it is harder to fix bugs for a longer code snippet. Our CodeT5-base significantly outperforms all baselines on EM and especially boosts over 4.8 points for the more challenging medium task (13.96 vs. GraphCodeBERT’s 9.10), reflecting its strong code understanding capability. 4 Összehasonlítjuk a hibák észlelésének és a klónok észlelésének két megértési feladatát az 5. táblázatban. Understanding Tasks. Specifically, we generate the binary labels as a unigram sequence from the decoder for the defect detection task, while for the clone detection task, we first obtain the sequence embedding of each code snippet using the last decoder state following ( ) and then predict the labels by measuring their similarity. Both CodeT5-small and CodeT5-base outperform all baselines on the defect detection task while CodeT5-base yields 2.6 accuracy score improvement than PLBART. For the clone detection task, our CodeT5 models achieve comparable results to the SOTA GraphCodeBERT and PLBART models. These results demonstrate that with an encode-decoder framework, our CodeT5 can still be adapted well for understanding tasks. Lewis és al. 2020 5.2 Effects of Bimodal Dual Generation and Multi-task Learning We examine the effects of bimodal dual generation at pre-training and multi-task learning at fine-tuning. The bimodal pre-training brings consistent improvements for code summarization and generation tasks on both CodeT5-small and CodeT5-base. However, this pre-training task does not help and even sometimes slightly hurts the performance for PL-PL generation and understanding tasks. We anticipate this is because bimodal dual generation learns a better alignment between PL and NL that naturally benefits the former tasks involving both PL and NL. As a side effect, this objective could bias the model towards the PL-NL tasks and affect its performance on PL-PL tasks. In multi-task learning, it generally improves most of downstream tasks except the code translation and defect detection. Particularly, it largely boosts the performance on code summarization, which is not surprising as code summarization takes up the largest portion of sub tasks (six out of thirteen) and thereby benefit the most from the multi-task learning. Besides, we observe that multi-task learning consistently improves the performance of code refinement, which might benefit from the joint training of both small and medium refinement data. Egy másik lehetséges ok az, hogy a hibakereséssel végzett többfeladatos képzés lehetővé tenné a modell számára, hogy jobban megértsék a kód szemantikáját a hibakereséshez, ami szintén szükséges közbenső lépés a kód finomításához. 5.3 Analyzing Identifier-aware Pre-training We provide an ablation study to examine the contribution of each component in our identifier-aware objective. Specifically, we compare the performance of our CodeT5-small on four selected tasks by ablating each of the three objectives: masked span prediction (MSP), identifier tagging (IT), and masked identifier prediction (MIP). As shown in Table , we observe that generally removing one of the objectives would reduce the performance for all tasks, indicating that all objectives contribute to the better code understanding of our CodeT5. However, the effect of each objective differs across tasks. Specifically, removing MSP would largely reduce the performance of all generation tasks but instead increase the defect detection performance. This shows that masked span prediction is more crucial for capturing syntactic information for generation tasks. On the contrary, removing MIP would hurt the defect detection task the most, indicating that it might focus more on code semantic understanding. By combining these objectives, our CodeT5 can better capture both syntactic and semantic information from code. 6 We further provide outputs from CodeT5 and its variant without MIP and IT on code generation in Figure . We observe that CodeT5 can correctly generate the exact function, while the model without MIP and IT fails to recover the identifiers of “s2” and “hasField”. This shows our identifier-aware denoising pre-training can better distinguish and leverage the identifier information. 4 Azt is megvizsgáljuk az azonosító címkézés teljesítményét, és azt találjuk, hogy az összes PL-nél több mint 99% F1 értéket ér el, ami azt mutatja, hogy a CodeT5 magabiztosan megkülönbözteti az azonosítókat a kódban. Ezután ellenőrizzük, hogy az MSP és a MIP feladatok konfliktusokba ütköznének-e, mivel ugyanazokat a sentinel tokeneket használják a maszkáláshoz. Az azonosító maszkálás során az azonosító egyedi azonosítójának minden előfordulását ugyanazzal a sentinel tokennel helyettesítik, ami sok-egyes térképet eredményez a spektrum-előrejelzésben alkalmazott egy-egy térképhez képest. . We report the prediction accuracy and also the ratio of how often they can generate the same number of predictions as the sentinel tokens. We observe that pre-training only with either MIP or MSP would bias the model towards that task, achieving poor accuracy and higher mismatch in number of predictions when applied to the other task. Interestingly, we find that MIP-only objective can better recover the correct number of predictions in the MSP task than MSP-only does for the MIP task, meaning that it is easier to adapt from many-to-one mapping to one-to-one mapping and difficult for the opposite. At last, combining them can help our model to make a good trade-off on both tasks. 7 6 Conclusion We have presented CodeT5, a pre-trained encoder-decoder model that incorporates the token type information from code. We propose a novel identifier-aware pre-training objective to better leverage the identifiers and propose a bimodal dual generation task to learn a better NL-PL alignment using code and its comments. Our unified model can support both code understanding and generation tasks and allow for multi-task learning. Experiments show that CodeT5 significantly outperforms all prior work in most CodeXGLUE tasks. Further analysis also reveals its better code comprehension capability across various programming languages. Broader Impact and Ethical Consideration Munkánk általában a szoftverintelligencia NLP alkalmazásaihoz tartozik. A szoftverfejlesztési termelékenység gépi tanulási módszerekkel történő javításának céljával a szoftverintelligencia kutatása az elmúlt évtizedben egyre nagyobb figyelmet szentelt mind az egyetemen, mind az iparágakban. A szoftverkód-intelligencia technikái segítenek a fejlesztőknek csökkenteni a fárasztó ismétlődő munkaterheléseket, javítani a programozási minőséget és javítani az általános szoftverfejlesztési termelékenységet. Ez jelentősen csökkentené a munkaidejüket, és potenciálisan csökkentheti a számítástechnikai és működési költségeket is, mivel egy hiba ronthatja a rendszer teljesítményét, vagy akár az egész rendszert. Munkánk foglalkozik a szoftverkód We further discuss the ethical consideration of training CodeT5 and the potential risks when applying it into real-world downstream applications: The training datasets in our study are source code including user-written comments from open source Github repositories and publicly available, which do not tie to any specific application. However, it is possible that these datasets would encode some stereotypes like race and gender from the text comments or even from the source code such as variables, function and class names. As such, social biases would be intrinsically embedded into the models trained on them. As suggested by (Egyéb ), interventions such as filtration or modulation of generated outputs may help to mitigate these biases in code corpus. Dataset bias. Chen et al. 2021 Our model pre-training requires non-trivial computational resources though we have tried our best to carefully design our experiments and improve experiments to save unnecessary computation costs. In fact, compared to the recent large-scale language model Codex ( , ), a CodeT5 bázisunk sokkal kisebb modellmérettel rendelkezik, 220M-rel, mint a 12B (∼ 55×). Ezenkívül kísérletezünk a Google Cloud Platform-formmal, amely szén-dioxid-krediteket vásárol a szénlábnyomának csökkentése érdekében, training CodeT5-base produced around 49.25 kg CO2 which was totally off-set by the provider. Furthermore, we release our pre-trained models publicly to avoid repeated training for the code intelligence research community. Computational cost. Chen És az al. 2021 Az e.g. As CodeT5 can be deployed to provide coding assistance such as code generation for aiding developers, automation bias of machine learning systems should be carefully considered, especially for developers who tend to over-rely on the model-generated outputs. Sometimes these systems might produce functions that superficially appear correct but do not actually align with the developer’s intents. If developers unintentionally adopt these incorrect code suggestions, it might cause them much longer time on debugging and even lead to some significant safety issues. We suggest practitioners using CodeT5 should always bear in mind that its generation outputs should be only taken as references which require domain experts for further correctness and security checking. Automation bias. Az alábbi linkre kattintva kattintson a(z) CodeSearchNet ( , ) és a Google BigQuery kis része, amelyek mindkettője eredetileg a nyilvános Github-repozitóriumokból származik. personal addresses or identification numbers) from the training data. Though we have conducted multi-rounds of data cleaning to mitigate this before training our models, it is still possible that some sensitive information cannot be completely removed. Besides, due to the non-deterministic nature of generation models like CodeT5, it might produce some vulnerable code to harmfully affect the software and even be able to benefit more advanced malware development when deliberately misused. Security implications. Husain et al. 2019 e.g., elismerések Köszönjük Akhilesh Deepak Gotmare, Amrita Saha, Junnan Li és Chen Xing értékes megbeszélésekért. Köszönjük Kathy Baxter etikai felülvizsgálatért. References Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray és Kai-Wei Chang. . In , oldal 2655–2668. A Számítástechnikai Nyelvészet Szövetsége. Unified pre-training A program megértése és generálása A Számítástechnikai Nyelvészet Szövetségének Észak-Amerikai Fejezetének 2021. évi Konferenciája: Emberi Nyelvi Technológiák, NAACL-HLT 2021, Online, június 6-11. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Win-ter, Philippe Tillet, Felipe Petroski Such, Dave Cum-mings, Matthias Plappert, Fotios Chantzis, Eliza-beth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welin-der, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. . , abs/2107.03374. A kóddal képzett nagyméretű nyelvi modellek értékelése CoRR Kevin Clark, Minh-Thang Luong, Quoc V. Le és Christopher D. Manning. Az In . OpenReview.net. ELECTRA: előkészítő szövegek kódolása diszkriminátorok helyett generators 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 Colin B. Clement, Dawn Drain, Jonathan Timcheck, Alexey Svyatkovskiy és Neel Sundaresan. Az In , 9052–9065. oldal A Számítástechnikai Nyelvészet Szövetsége. Pymt5: Multi-mode fordítás természetes nyelven and python code with transformers A Természetes Nyelvfeldolgozás Empirikus Módszereiről szóló 2020 Konferencia, EMNLP 2020, Online, november 16-20, 2020 Alexis Conneau és Guillaume Lample 2019. . In 7057 és 7067 között. Cross-lingual language model pretraining Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada Sergio Cozzetti B. de Souza, Nicolas Anquetil, and Káthia Marçal de Oliveira. 2005. . In oldalak A study of the documentation essential to software maintenance Proceedings of the 23rd Annual International Conference on Design of Communication: documenting & Designing for Pervasive Information, SIGDOC 2005, Coventry, UK, September 21-23, 2005 68–75. ACM. Jacob Devlin, Ming-Wei Chang, Kenton Lee és Kristina Toutanova. . In , pages 4171–4186. BERT: pre-training of deep bidirectional transformers for language understanding Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers) Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi-aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. . In , pages 13042–13054. Unified language A természetes nyelv megértéséhez és generálásához szükséges modellelőképzés Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada Ahmed Elnaggar, Wei Ding, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Silvia Severini, Florian Matthes és Burkhard Rost. . , abs/2104.02443. Code-trans: Towards cracking the language of silicone’s Az önfelügyeleti mély tanulás és a magas performance computing CoRR Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xi-aocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang és Ming Zhou. . In , pages 1536–1547. Association for Computational Linguistics. Code-bert: A pre-trained model for programming and natural languages Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, EMNLP 2020, Online Event, 16-20 November 2020 Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tu-fano, Shao Kun Deng, Colin B. Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou. 2021. . In Az OpenReview.net Graphcodebert: Pre-training code representations with data flow 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. . , abs/1909.09436. Code-searchnet kihívás: A szemantikus kódkeresés állapotának értékelése Részletesebben Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2018. Az In , pages 1643–1652. Association for Computational Linguistics. A nyelv kódolása in programmatic context Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018 Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. 2020. . In , volume 119 of , pages 5110–5121. PMLR. Learning and evaluating A forráskód kontextusos beágyazása Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event A gépi tanulás kutatása Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. . In , oldalak 7871–7880. Egyesület számítástechnikai nyelvészet. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, A megértés Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020 Chin-Yew Lin és Franz Josef 2004. . In . A narancssárga: a method for evaluating automatic evaluation metrics for machine translation COLING 2004, 20. Nemzetközi Számítástechnikai Nyelvészeti Konferencia, A Konferencia Eljárásai, 2004. augusztus 23-27 Genf, Svájc Fang Liu, Ge Li, Yunfei Zhao, and Zhi Jin. 2020. . In 473 és 485 között. Multi-task learning based pre-trained language model for Kód befejezés 35. IEEE/ACM Nemzetközi Konferencia az automatizált szoftverfejlesztésről, ASE 2020, Melbourne, Ausztrália, szeptember 21-25, 2020 Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian-feng Gao. 2019a. Az In , pages 4487–4496. Association for Computational Linguistics. Multi-task deep neural networks Természetes nyelvi megértés Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019b. . , abs/1907.11692 sz. Roberta: A robustly optimized BERT pretraining approach CoRR Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Li-dong Zhou, Linjun Shou, Long Zhou, Michele Tu-fano, Ming Gong, Ming Zhou, Nan Duan, Neel Sun-daresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. 2021. . , abs/2102.04664. Codexglue: Machine Learning benchmark adatkészlet a kód megértéséhez és generálásához CoRR Antonio Mastropaolo, Simone Scalabrino, Nathan Cooper, David Nader-Palacio, Denys Poshyvanyk, Rocco Oliveto, and Gabriele Bavota. 2021. Az In , pages 336–347. IEEE. A text-to-text transzformátor használata Kódhoz kapcsolódó feladatok támogatása 43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021 Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei és Ilya Sutskever. A . , 1(8):9. nyelv A modellek felügyelet nélküli multitask tanulók OpenAI blog Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li és Peter J. Liu. . , 21:140:1–140:67. Exploring the limits of transfer learning with a unified text-to-text transformer J. Mach. Learn. Res. Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Am-brosio Blanco, and Shuai Ma. 2020. . , abs/2009.10297. Codebleu: a method for automatic evaluation of code synthesis CoRR Baptiste Rozière, Marie-Anne Lachaux, Lowik Chanussot és Guillaume Lample. Az In A programozási nyelvek felügyelet nélküli fordítása Előrelépések a neurális információfeldolgozó rendszerekben 33: Éves konferencia a neurális információfeldolgozó rendszerekről 2020, NeurIPS 2020, december . 2020-12 - Virtuális Baptiste Rozière, Marie-Anne Lachaux, Marc Szafraniec, and Guillaume Lample. 2021. . , abs/2102.07492. A DOBF: A deobfuscation pre-training objective for programming languages CoRR Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. . In . The Association for Computer Linguistics. Neurális gépi fordítás ritka szavakkal subword units Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. . In , volume 97 of , pages 5926–5936. PMLR. MASS: masked sequence to se-quence pre-training for language generation A 36. Nemzetközi Machine Learning Konferencia, ICML 2019, 2019. június 9. és 15. között, Long Beach, Kalifornia, USA Proceedings of Machine Learning Research Yu Sun, Shuohuan Wang, Yu-Kun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian és Hua Wu. A . , abs/1904 09223. ERNIE: enhanced representation through knowledge integration CoRR Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020. . In , 1433–1443 oldalak között. Intellicode compose: Transzformátor segítségével generált kód ESEC/FSE ’20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event, USA, November 8-13, 2020 Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White és Denys Poshy-vanyk. A . , 28(4) 19:1 19:29. Empirikus tanulmány a tanulásról bug-fixing patches in the wild via neural machine translation ACM Trans. szoftver Eng. Methodol. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser és Illia Polosukhin. Az In , pages 5998–6008. A figyelem minden Szüksége van Előrelépések a neurális információfeldolgozó rendszerekben 30: Éves konferencia a neurális információfeldolgozó rendszerekről 2017, december 4-9, 2017, Long Beach, CA, USA Wenhan Wang, Ge Li, Bo Ma, Xin Xia, and Zhi Jin. 2020. Az In , pages 261–271. IEEE. Detecting code clones with graph neural network and flow-augmented abstract syntax tree 27th IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2020, London, ON, Canada, February 18-21, 2020 Yaqin Zhou, Shangqing Liu, Jing Kai Siow, Xiaon-ing Du, and Yang Liu. 2019. . In , pages 10197–10207. Értékelés: hatékony vulnerability identification by learning comprehensive program semantics via graph neural networks Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada Daniel Zügner, Tobias Kirschstein, Michele Catasta, Jure Leskovec, and Stephan Günnemann. 2021. . In Az OpenReview.net Language-agnostic representation learning of source code from structure and context 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 Ez a dokumentum a CC by 4.0 Deed (Attribution 4.0 International) licenc alatt érhető el. Ez a dokumentum a CC by 4.0 Deed (Attribution 4.0 International) licenc alatt érhető el.