Salesforce CodeT5 može promijeniti način na koji AI piše i razumije kod

Autori : Yue Wang, wang.y@salesforce.com (Salesforce Research Asia) Weishi Wang, weishi.wang@salesforce.com (Salesforce Research Asia; Nanyang Technological University, Singapore) Shafiq Joty, sjoty@salesforce.com (Salesforce Research Asia; Nanyang Technological University, Singapore) Steven C.H. Hoi, shoi@salesforce.com (Salesforce Research Asia) Autori : Yue Wang, wang.y@salesforce.com (Salesforce Research Asia) Weishi Wang, weishi.wang@salesforce.com (Salesforce Research Asia; Nanyang Technological University, Singapur) Shafiq Joty, sjoty@salesforce.com (Salesforce Research Asia; Nanyang Technological University, Singapur) Steven C.H. Hoi, shoi@salesforce.com (Istraživanje Salesforce Asia) Abstrakcija Unatoč njihovom uspjehu, većina trenutnih metoda ili se oslanja na kodiranje samo kodiranje (ili decoder-only) pre-trening koji je suboptimalan za generiranje (odnosno razumijevanje) zadataka ili obrađuje snippet koda na isti način kao i NL, zanemarujući posebne karakteristike PL-a, kao što su tipovi tokena. Predstavljamo CodeT5, jedinstvenu pre-treniranu kodiranje-dekodiranje Transformer model koji bolje iskorištava kod semantike prenesene od developer-dodijeljenih identifikatora. Naš model koristi jedinstveni okvir za besprijekorno podupiranje i kodiranje zadataka i generiranje zadataka kodiranja i omogućava učenje više zadataka. Osim toga, predlažemo novu identifikaciju-pre-treniranje koja omogućuje značajnu . https://github.com/salesforce/CodeT5 1 Uvod Sljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći ( , Sljedeći Članak GMP ( , Istraživanje je provedeno na T5 ( , Oni obično zapošljavaju paradigmu prije treninga, a zatim fine-tune koja ima za cilj izvući generičke jezične reprezentacije samonadzornim treningom na velikim neoznačenim podacima, koji se mogu prenijeti u korist višestrukih zadataka u daljnjem tijeku, osobito onih s ograničenom anotacijom podataka. Inspirirani njihovim uspjehom, postoje mnogi nedavni pokušaji prilagodbe ovih metoda predtreninga za programski jezik (PL) ( , • , • , ), pokazujući obećavajuće rezultate na zadatke povezane s kodom. Devlin et al. 2019 Radford i al. 2019 Raul i al. 2020 Svijatkovski i Al. 2020 Kanadski i al. 2020 Feng et al. 2020 Međutim, unatoč njihovom uspjehu, većina ovih modela se oslanja na jedan kodirni model sličan BERT-u ( , • , ) ili samo dekoderni model poput GPT-a ( , Sljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članak , ) zahtijeva dodatni dekoder kada se primjenjuje za zadatak sažetka koda, gdje ovaj dekoder ne može imati koristi od predtreninga.Osim toga, većina postojećih metoda jednostavno primjenjuje konvencionalne NLP tehnike predtreninga na izvorni kod, odnoseći ga kao sekvencu žetona poput NL. Svijatkovski et al. 2020 Feng et al. 2020 Kanadski i Al. 2020 Feng et al. 2020 U ovom radu predstavljamo CodeT5, model unaprijed osposobljenog kodera-dekodera koji uzima u obzir informacije o tipu tokena u kodu. , (Seq2Seq) predtreniranje i pokazalo se da koristi i razumijevanju i generiranju zadataka u prirodnom jeziku. Osim toga, predlažemo iskoristiti identifikatore dodijeljene od strane programera u kodu. Identifikator „binarySearch“ na slici Da bismo spojili takvo znanje specifično za kod, predlažemo novi objektiv svjesni identifikatora koji trenira model kako bi razlikovao koje su žetone identifikatori i oporavio ih kada su maskirani. Raul i al. 2020 e. g 2 Nadalje, predlažemo da iskoristimo kod i njegove prateće komentare kako bismo naučili bolje NL-PL usklađivanje. Razvijatelji često pružaju dokumentaciju za programe kako bi olakšali bolje održavanje softvera ( , Konkretno, mi NL→PL generaciju i PL→NL generaciju smatramo dvostrukim zadatcima i istodobno optimiziramo model na njima. Od Sofije et al. 2005 Sljedeći članakKako se riješiti problematičnih problema kod dijabetesa ( , Sljedeći ( , Naime, na temelju ove Uredbe (EZ) br. i (EZ) br. i (EZ) br. i (EZ) br. i (EZ) br. i (EZ) br. i (EZ) br. i (EZ) br. i (EZ) br. i (EZ) br. i (EZ) br. i (EZ) br. i (EZ) br. i (EZ) br. i (EZ) br. i (EZ) br. i (EZ) br. i (EZ) br. i (EZ) br. br. i (EZ) br. i (EZ) br. , ), uključujući dva zadatka razumijevanja: otkrivanje pogrešaka kodova i otkrivanje klona, te zadatke generacije kao što su sažetak koda, generacija, prijevod i rafiniranje. , također istražujemo učenje višestrukih zadataka kako bismo fino prilagodili CodeT5 na više zadataka odjednom pomoću koda za kontrolu zadataka kao poziva na izvor. Husain et al. 2019 Feng et al. 2020 Lu i Al. 2021 1 Predstavljamo jedan od prvih jedinstvenih kodera-dekodera modela CodeT5 koji podržava i razumijevanje i generiranje zadataka povezanih s kodom, a također omogućuje multi-task učenje. Predlažemo novi cilj pre-treninga svjesnog identifikatora koji uzima u obzir ključne informacije o tipu žetona (identifikatori) iz koda. Opsežni eksperimenti pokazuju da CodeT5 donosi najmodernije rezultate na četrnaest podzadataka u CodeXGLUE-u.Dalja analiza pokazuje da naš CodeT5 može bolje snimiti semantiku koda s predloženim pre-treningom svjesnim identifikatora i bimodalnom dvostrukom generacijom uglavnom koristi NL↔PL zadatke. 2 Povezani radovi Sljedeći članakKako se riješiti problematičnih situacija na temelju dijagnostičkih metoda ( , Sljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članak , Bjelovar ( , , a također elektronički ( , , samo dekoderni modeli poput GPT-a ( , , i koderi-dekoderi modeli kao što su MASS ( , Bjelovar ( , Istraživanje je provedeno na T5 ( , U usporedbi s modelima samo kodiranjem i samo dekodiranjem koji promiču zadaće razumijevanja i generiranja, modeli kodiranja i dekodiranja mogu dobro poduprijeti obje vrste zadataka. Često koriste ciljeve predtreniranja denoze sekvencije do sekvencije koji korumpiraju izvorni unos i zahtijevaju da ih decoder oporavi. Pre-training on Natural Language. Šibenik et al. 2017 Devlin et al. 2019 Liu et al. 2019b Clark i al. 2020 Radford i al. 2019 Pjesma i al. 2019 Lewis i al. 2020 Raul i al. 2020 Pre-training na programskom jeziku je nastajajuće područje u kojem je mnogo nedavnog rada pokušava proširiti NLP pre-training metode na izvorni kod. , ) i CodeBERT ( , COBERT koristi moćan objektiv za modeliranje maskovanog jezika BERT-a kako bi proizveo generičku kodnu specifičnu reprezentaciju, a CodeBERT dodatno dodaje zamijenjeno detekciju tokena ( , ) zadatak naučiti NL-PL cross-modalnu reprezentaciju. Osim BERT-style modela, (na primjeru ) i (na primjeru Sljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći ( , Sljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći ( , Za razliku od njih, istražujemo modele kodera-dekodera zasnovane na T5 za predtreniranje programskog jezika i podržavamo sveobuhvatniji skup zadataka. Pre-training on Programming Language. Kanadski i al. 2020 Feng i Al. 2020 Clark i al. 2020 Svijatkovski et al. 2020 Liu i al. 2020 Dong i al. 2019 Roze et al. 2020 Sljedeći Članak Nekoliko novih radova ( , • , • , ) u nedavnoj literaturi također istražiti T5 okvira na kod, ali oni se usredotočuju samo na ograničen podskup generacije zadataka i ne podržavaju razumijevanje zadataka kao što smo mi. , ) na temelju drugog modela kodera-dekodera BART također može podržati i zadaće razumijevanja i generiranja. međutim, sav prethodni rad jednostavno obrađuje kod na isti način kao i prirodni jezik i uglavnom ignorira karakteristike specifične za kod. Klement i al. 2020 Mastropaolo et al. 2021 Elnaggar et al. 2021 Ahmad et al. 2021 Sljedeći članakPrimjerice ( , ) uključuje protok podataka izvučen iz strukture koda u CodeBERT, dok (na primjeru ) predlaže cilj deobfuscation kako bi se iskoristio strukturni aspekt PL. Ovi modeli se usredotočuju samo na obuku boljeg specifičnog kodera. (na primjeru Nasuprot tome, mi se posebno usredotočimo na identifikatore koji rezerviraju bogatu kodnu semantiku i spajanje takvih informacija u model Seq2Seq putem dva nova zadatka označavanja identifikatora i predviđanja. Čovjek i al. 2021 Roze et al. 2021 Željko i al. 2021 3 Kodeks 5 Naš CodeT5 temelji se na okviri kodera-dekodera s istom arhitekturom kao i T5 ( , Cilj je izvoditi generičke reprezentacije za programski jezik (PL) i prirodni jezik (NL) putem prethodnog osposobljavanja o neoznačenom izvornom kodu. , proširujemo denoizirajući cilj Seq2Seq u T5 predlažući dva zadatka označavanja identifikatora i predviđanja kako bi model mogao bolje iskoristiti informacije o tipu tokena iz PL-a, koji su identifikatori koje dodjeljuju programeri. Raul i al. 2020 2 U nastavku predstavljamo kako CodeT5 kodira ulaze PL i NL (§ ) i naše predložene zadatke prekvalifikacije (§ ), followed by the fine-tuning with task-specific transfer learning and multi-task training (§ ) i 3.1 3.2 3.3 3.1 Kodiranje NL i PL U fazi predtreninga, naš model bi primio ili PL-only ili NL-PL kao ulaze ovisno o tome ima li kod snippet prateći NL opise ili ne. = ([CLS], 1*, ..., brzinsko*, [SEP], 1*, ..., cm*, [SEP]), gdje i označiti broj NL riječi žetona i PL kod žetona, odnosno. NL riječ sekvenca će biti prazna za PL-samo unimodal ulazima. x w c n m Kako bi se utvrdilo koje su to karakteristike, predlažemo korištenje informacija o tipovima tokena iz koda. funkcijska imena i varijable) jer su jedna od PL-agnostičkih značajki i rezerviraju bogatu semantiku koda. Konkretno, pretvaramo segment PL u Abstract Syntax Tree (AST) i izvadimo vrste čvorova za svaki kod token. ∈ {0*, * 1} za segment PL, gdje svaka ∈ {0*,* 1} predstavlja je li kod token Je li to identifikacija ili ne. e. g y m Yin Cij 3.2 Predškolski zadaci Sada predstavljamo naše predložene zadatke predtreninga koje omogućuju CodeT5 da uči korisne obrasce iz bilo PL-only ili NL-PL bimodal podataka. Istraživanje je pokazalo da je pre-training od buke sekvence do sekvence (Seq2Seq) vrlo učinkovit u širokom nizu NLP zadataka ( , • , ; , U ovom slučaju, ako je potrebno, potrebno je primijeniti posebne mjere kako bi se riješio problem, uključujući primjenu dijagnostičkih mjera, kako bi se riješio problem. ( , ) koji slučajno maskiraju razmjere s proizvoljnim dužinama, a zatim predviđaju te maskirane razmjere u kombinaciji s nekim sentinel tokenima na dekoderu. , as illustrated in Figure (a) i to Identifier-aware Denoising Pre-training. Song et al. 2019 Raf-fel i al. 2020 Lewis i al. 2020 Raul i al. 2020 Masked Span Prediction (MSP) 2 Konkretno, koristimo istu stopu korupcije od 15% kao i T5 i osiguravamo da prosječna duljina raspona bude 3 jednako uzorkovanjem raspona od 1 do 5 žetona. uzimajući uzorak razmjera prije tokenizacije podrijetla, koji ima za cilj izbjegavanje maskiranja djelomičnih podtokena i pokazao se korisnim ( , Posebno, unaprijed smo obučavali zajednički model za različite PL-ove kako bismo naučili robusne prekogranične reprezentacije. Cijela riječ maska Sun et al. 2019 gdje θ su parametri modela, x \mask je maskirani unos, x maska je maskirana sekvenca za predviđanje iz dekodera s k označavanjem broja žetona u x maski, a xmask <t je sekvenca do sada generirana. Da bismo u model unijeli više strukturnih informacija specifičnih za kod (tip identifikatora u AST-u), predlažemo dva dodatna zadatka: i Da bi se nadopunilo pre-trening. Identifikacijska oznaka (IT) Predviđanje maskovanog identifikatora (MIP) • Cilj je obavijestiti model sa znanjem o tome je li ovaj kod token identifikator ili ne, što dijeli sličan duh sinteze istakavanja u nekim alatima koje pomažu programeri. (b), mapiramo konačna skrivena stanja PL segmenta kod kodera CodeT5 u niz vjerojatnosti Sljedeći ( 1*, ..., pm*) i izračunati binarni križni entropijski gubitak za označavanje sekvencija: Identifier Tagging (IT) 2 p p Gdje are the encoder parameters. Note that by casting the task as a sequence labeling problem, the model is expected to capture the code syntax and the data flow structures of the code. θe • Za razliku od maskiranja slučajnog raspona u MSP-u, maskiramo sve identifikatore u segmentu PL i koristimo jedinstveni sentinel token za sve pojave jednog specifičnog identifikatora. gdje promjena imena identifikatora ne utječe na semantiku koda. (na primjeru ), we arrange the unique identifiers with the sentinel tokens into a target sequence as shown in Figure (c) Tada to predviđamo na auto-regressivni način: Masked Identifier Prediction (MIP) obfuscation Roze et al. 2021 I 2 Gdje To je maskirani ulaz.Zapamtite da je je izazovniji zadatak koji zahtijeva da model razumije semantiku koda na temelju zamagljenog koda i povezuje pojave istih identifikatora zajedno. x deobstrucija We alternately optimize these three losses with an equal probability, which constitutes our proposed identifier-aware denoising pre-training. In the pre-training phase, the decoder only sees discrete masked spans and identifiers, which is disparate from the downstream tasks where the decoder needs to generate either fluent NL texts or syntactically correct code snippets. To close the gap between the pre-training and fine-tuning, we propose to leverage the NL-PL bimodal data to train the model for a bidirectional conversion as shown in Figure (d). Specifically, we regard the NL→PL generation and PL→NL generation as dual tasks and simultaneously optimize the model on them. For each NL- Bimodal Dual Generation. 2 Sljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći... and for Java PL and English NL, respectively). This operation can be also seen as a special case of T5’s span masking by either masking the full NL or PL segment from the bimodal inputs. This task aims to improve the alignment between the NL and PL counterparts. e.g., 3.3 Fine-tuning CodeT5 After pre-training on large-scale unlabeled data, we adapt CodeT5 to downstream tasks via either task-specific transfer learning or multi-task learning. Sljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članak , ), or predicting it from the vocabulary of class labels based on the last decoder hidden state following (na primjeru ). Task-specific Transfer Learning: Generation vs. Understanding Tasks. Raffel et al. 2020 Lewis i al. 2020 We also explore a multi-task learning setting by training a shared model on multiple tasks at a time. Multi-task learning is able to reduce computation cost by reusing the most of model weights for many tasks and has been shown to improve the model generalization capability in NL pre-training ( , ). We follow ( ) koristiti isti jedinstveni model za sve zadatke bez dodavanja bilo kakvih mreža specifičnih za zadatke, ali omogućiti odabir različitih najboljih kontrolnih točaka za različite zadatke. . For instance, we employ “Translate Java to CSharp:” as the source prompt for the code-to-code translation task from Java to CSharp. Multi-task Learning. Liu et al. 2019a Raffel et al. 2020 1 As different tasks have different dataset sizes, we follow Conneau and Lample (2019) to employ a balanced sampling strategy. For N number of datasets (or tasks), with probabilities {qi} N i=1, we define the following multinomial distribution to sample from: where ni is number of examples for i-th task and α is set to 0.7. This balanced sampling aims to alleviate the bias towards high-resource tasks. 4 Eksperimentalni radovi 4.1 Pre-training Dataset We follow Feng et al. (2020) to employ CodeSearchNet (Husain et al., 2019) to pre-train CodeT5, which consists of six PLs with both unimodal and bimodal data. Apart from that, we additionally collect two datasets of C/CSharp from BigQuery1 to ensure that all downstream tasks have overlapped PLs with the pre-training data. In total, we employ around 8.35 million instances for pretraining. Table 1 shows some basic statistics. To obtain the identifier labels from code, we leverage the tree-sitter2 to convert the PL into an abstract syntax tree and then extract its node type information. We filter out reserved keywords for each PL from its identifier list. We observe that PLs have different identifier rates, where Go has the least rate of 19% and Ruby has the highest rate of 32%. 4.2 Code-specific Tokenizer Sljedeći članakKako se riješiti problema sa štitnjačom: Kako se riješiti problema sa štitnjačom? – Sljedeći članakKako se riješiti problema sa štitnjačom? – Sljedeći članakKako se riješiti problema sa štitnjačom? , ) to alleviate the Out-of-Vocabulary (OoV) issues. Specifically, we train a Byte-level BPE tokenizer following ( ) and set the vocabulary size to 32,000 as T5. We add additional special tokens ([PAD], [CLS], [SEP], [MASK0], ..., [MASK99]). This tokenzier is trained on all of our pre-training data with non-printable characters and low-frequent tokens (occurring <3 times) filtered. We compare it with T5’s default tokenizer and find that our tokenizer largely reduces the length of tokenized code sequence by 30% - 45% on downstream tasks. This will accelerate the training and especially benefit generation tasks due to the shorter sequence to predict. We also spot a severe problem for applying the T5’s default tokenizer on source code, where it would encode some common code tokens such as brackets [‘{’, ‘}’] into unknown tokens. Sennrich et al. 2016 Radford et al. 2019 4.3 Daljnji zadatci i mjerila We cover most generation and understanding tasks in the CodeXGLUE benchmark ( , ) and employ the provided public datasets and the same data splits following it for all these tasks. Lu et al. 2021 We first consider two cross-modal generation tasks. aims to summarize a function-level code snippet into English descriptions. The dataset consists of six PLs including Ruby, JavaScript, Go, Python, Java, and PHP from CodeSearchNet ( , ). We employ the smoothed BLEU-4 ( , ) to eval-uate this task. is the task to gen-erate a code snippet based on NL descriptions. We employ the Concode data ( , ) in Java where the input contains both NL texts and class environment contexts, and the output is a function. We evaluate it with BLEU-4, exact match (EM) accuracy, and CodeBLEU ( , koja uzima u obzir sintaktičke i semantičke podudare na temelju strukture koda uz n-gramsku podudaranje. Code summarization Husain et al. 2019 Lin and Och 2004 Code generation Iyer et al. 2018 Ren et al. 2020 Besides, we consider two code-to-code generation tasks. aims to migrate legacy software from one PL to another, where we focus on translating functions from Java to CSharp and vice versa. aims to convert a buggy function into a correct one. We employ two Java datasets provided by (na primjeru ) with various function lengths: small (fewer than 50 tokens) and medium (50-100 tokens). We use BLEU-4 and exact match to evaluate them. Code translation Code refinement Tufano et al. 2019 Također istražujemo kako CodeT5 djeluje na dvije zadaće koje se temelje na razumijevanju. that aims to predict whether a code is vulnerable to software systems or not. We use the C dataset provided by (na primjeru ) for experiment. The second task is which aims to measure the similarity between two code snippets and predict whether they have the same functionality. We experiment with the Java data provided by ( ). We employ F1 score and accuracy for evaluating these two tasks respectively. In total, our CodeT5 supports six tasks and fourteen sub-tasks in CodeXGLUE with a unified encoder-decoder model. defect detection Zhou et al. 2019 clone detection Wang et al. 2020 4.4 Usporedba modela Uspoređujemo CodeT5 s najmodernijim (SOTA) pre-treniranim modelima koji se mogu kategorizirati u tri vrste: samo kodiranje, samo dekodiranje i kodiranje-dekodiranje. models, we consider RoBERTa ( , ), RoBERTa (code) trained with masked language modeling (MLM) on code, CodeBERT ( , ) trained with both MLM and replaced token detection ( , ), GraphCode-BERT ( , ) koristeći protok podataka iz koda, i DOBF ( , Imajte na umu da iako DOBF koristi model Seq2Seq tijekom predtreninga, on ima samo za cilj obučiti bolji koder za zadatke u daljnjem tijeku bez istraživanja potencijalne koristi unaprijed osposobljenog dekodera. encoder-only Liu i al. 2019b Feng et al. 2020 Clark et al. 2020 Čovjek i al. 2021 Roze et al. 2021 For models, we compare GPT-2 ( , ) and its adaptations on code domain including CodeGPT-2, and CodeGPT-adapted. The difference is that the latter one utilizes a GPT-2 checkpoint for model initialization while the former one is trained from scratch. As Sljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći članakSljedeći... , ) based on BART ( , Sljedeći članakKako se riješiti ovog problema?Kako se riješiti ovog problema?Kako se riješiti problema?Kako se riješiti problema? , ) except DOBF and PLBART. DOBF is pre-trained on 7.9M Java and 3.6M Python files from BigQuery while PLBART employs a much larger data with 470M Python and 210M Java functions, and 47M NL posts from StackOverflow. decoder-only Radford et al. 2019 encoder-decoder Mladić et al. 2021 Lewis et al. 2020 Husain et al. 2019 4.5 Model Configurations We build CodeT5 based on Huggingface’s T5 ( , ) PyTorch implementation and employ two sizes of CodeT5-small (60M) and CodeT5-base (220M). We set the maximum source and target sequence lengths to be 512 and 256, respectively. We use the mixed precision of FP16 to accelerate the pre-training. We set the batch size to 1024 and employ the peak learning rate of 2e-4 with linear decay. We pre-train the model with the denoising objective for 100 epochs and bimodal dual training for further 50 epochs on a cluster of 16 NVIDIA A100 GPUs with 40G memory. The total training time for CodeT5-small and CodeT5-base is 5 and 12 days, respectively. Raf-fel et al. 2020 3 In the fine-tuning phase, we find that the tasks in CodeXGLUE ( , ) are quite sensitive to some hyper parameters such as learning rate, training steps, and batch size. We conduct a grid search and select the best parameters based on the validation set. In multi-task learning, we cover all downstream tasks except clone detection. Lu et al. 2021 5 Results and Analysis U ovom odjeljku usporedimo CodeT5 s SOTA modelima na širokom nizu CodeXGLUE zadataka u daljnjem tijeku (§ ), i istražiti učinke naše bimodalne dvojne generacije i multi-task učenja (§ ), followed by a detailed analysis on the proposed identifier-aware pre-training (§ ) i 5.1 5.2 5.3 5.1 CodeXGLUE Downstream Tasks We evaluate two sizes of our model: CodeT5-small and CodeT5-base that are pre-trained with identifier-aware denoising. In addition, we consider the model that continues to train with bimodal dual generation (dual-gen) and show the results with multi-task fine-tuning. The results of all comparison models are obtained from their original papers and also the CodeXGLUE paper ( , ). Lu et al. 2021 We show code summarization results of smoothed BLEU-4 on six PL data in Table . We observe all our model variants significantly outperform prior work with either an encode-only (RoBERTa, CodeBERT, DOBF) or encoder-decoder framework (PLBART). Moreover, the salient performance gap between these two groups of models confirms that encode-only frameworks are suboptimal for generation tasks. Compared to the SOTA encoder-decoder model PLBART, we find that even our CodeT5-small yields better overall scores (also on Python and Java) given that our model is much smaller (60M vs. 140M) and PLBART is pre-trained with much larger Python and Java data (> 100 times). We attribute such improvement to our identifier-aware denoising pre-training and better employment of bi-modal training data Povećanjem veličine modela naša baza CodeT5 povećava ukupne performanse za više od 1,2 apsolutne točke u odnosu na PLBART. Code Summarization. 2 4 We compare CodeT5 with GPT-style models and PLBART in Table . Our CodeT5-small outperforms all decoder-only mod-els and also the SOTA PLBART, which again confirms the superiority of encoder-decoder models at generating code snippets. Moreover, our CodeT5-base further significantly pushes the SOTA results across three metrics. Particularly, it achieves around 4.7 points improvement on CodeBLEU over PLBART, indicating our CodeT5 can better comprehend the code syntax and semantics with the __help of identi__fier-aware pre-training. Code Generation. 3 We compare two code-to-code generation tasks: code translation and code refinement in Table and further consider one naive copy baseline by copying the source input as the target prediction. In the code translation task, our CodeT5-small outperforms most of base-lines and obtains comparable results with PLBART, which shows the advantages of encoder-decoder models in the code-to-code generation setting. Our CodeT5-base further achieves consistent improvements over PLBART across various metrics for translating from Java to C# and vice versa. Code-to-Code Generation Tasks. 4 Here we show one CodeT5’s output of translating C# to Java in Figure . In this case, despite the poor BLEU score, CodeT5 is able to generate a function that reserves the same functionality and even has better readability compared to the ground-truth. This reveals that CodeT5 has a good generalization ability instead of memorizing and repeating what it has seen before. On the other hand, it also suggests that BLEU score is not a perfect evaluation metric for code generation tasks, where sometimes a higher score can instead reflect the problematic copy issues of neural models. 3 Another code-to-code generation task is code refinement, a challenging task that requires detecting which parts of code are buggy and fix them via generating a bug-free code sequence. Due to the large overlap of source and target code, even the naive copy approach yields very high BLEU scores but zero exact matches. Therefore, we focus on the exact match (EM) metric to evaluate on this task. As shown in Table , we observe that EM scores for the small data are consistently higher than the medium one, indicating that it is harder to fix bugs for a longer code snippet. Our CodeT5-base significantly outperforms all baselines on EM and especially boosts over 4.8 points for the more challenging medium task (13.96 vs. GraphCodeBERT’s 9.10), reflecting its strong code understanding capability. 4 We compare with two understanding tasks of defect detection and clone detection in Table 5. Understanding Tasks. Specifically, we generate the binary labels as a unigram sequence from the decoder for the defect detection task, while for the clone detection task, we first obtain the sequence embedding of each code snippet using the last decoder state following ( ) and then predict the labels by measuring their similarity. Both CodeT5-small and CodeT5-base outperform all baselines on the defect detection task while CodeT5-base yields 2.6 accuracy score improvement than PLBART. For the clone detection task, our CodeT5 models achieve comparable results to the SOTA GraphCodeBERT and PLBART models. These results demonstrate that with an encode-decoder framework, our CodeT5 can still be adapted well for understanding tasks. Lewis et al. 2020 5.2 Effects of Bimodal Dual Generation and Multi-task Learning We examine the effects of bimodal dual generation at pre-training and multi-task learning at fine-tuning. The bimodal pre-training brings consistent improvements for code summarization and generation tasks on both CodeT5-small and CodeT5-base. However, this pre-training task does not help and even sometimes slightly hurts the performance for PL-PL generation and understanding tasks. We anticipate this is because bimodal dual generation learns a better alignment between PL and NL that naturally benefits the former tasks involving both PL and NL. As a side effect, this objective could bias the model towards the PL-NL tasks and affect its performance on PL-PL tasks. In multi-task learning, it generally improves most of downstream tasks except the code translation and defect detection. Particularly, it largely boosts the performance on code summarization, which is not surprising as code summarization takes up the largest portion of sub tasks (six out of thirteen) and thereby benefit the most from the multi-task learning. Besides, we observe that multi-task learning consistently improves the performance of code refinement, which might benefit from the joint training of both small and medium refinement data. Another possible reason is that multi-task training with defect detection would enable the model to better comprehend the code semantics for bug detection, which is also a necessary intermediate step for code refinement. 5.3 Analiza prekvalifikacijskih vježbi We provide an ablation study to examine the contribution of each component in our identifier-aware objective. Specifically, we compare the performance of our CodeT5-small on four selected tasks by ablating each of the three objectives: masked span prediction (MSP), identifier tagging (IT), and masked identifier prediction (MIP). As shown in Table , we observe that generally removing one of the objectives would reduce the performance for all tasks, indicating that all objectives contribute to the better code understanding of our CodeT5. However, the effect of each objective differs across tasks. Specifically, removing MSP would largely reduce the performance of all generation tasks but instead increase the defect detection performance. This shows that masked span prediction is more crucial for capturing syntactic information for generation tasks. On the contrary, removing MIP would hurt the defect detection task the most, indicating that it might focus more on code semantic understanding. By combining these objectives, our CodeT5 can better capture both syntactic and semantic information from code. 6 Nadalje pružamo izlaze iz CodeT5 i njegove varijante bez MIP-a i IT-a na generaciji koda na slici . We observe that CodeT5 can correctly generate the exact function, while the model without MIP and IT fails to recover the identifiers of “s2” and “hasField”. This shows our identifier-aware denoising pre-training can better distinguish and leverage the identifier information. 4 We also investigate the identifier tagging performance and find it achieves over 99% F1 for all PLs, showing that our CodeT5 can confidently distinguish identifiers in code. We then check whether MSP and MIP tasks would have conflicts as they employ the same sentinel tokens for masking. In identifier masking, all occurrences of one unique identifier are replaced with the same sentinel token, resulting in a many-to-one mapping compared to the one-to-one mapping in span prediction. We compare models pre-trained with either MSP or MIP, and both on these two tasks in Table . We report the prediction accuracy and also the ratio of how often they can generate the same number of predictions as the sentinel tokens. We observe that pre-training only with either MIP or MSP would bias the model towards that task, achieving poor accuracy and higher mismatch in number of predictions when applied to the other task. Interestingly, we find that MIP-only objective can better recover the correct number of predictions in the MSP task than MSP-only does for the MIP task, meaning that it is easier to adapt from many-to-one mapping to one-to-one mapping and difficult for the opposite. At last, combining them can help our model to make a good trade-off on both tasks. 7 6 Conclusion Predstavili smo CodeT5, model unaprijed osposobljenog kodera-dekodera koji uključuje informacije o tipovima žetona iz koda. Predlažemo novi cilj pre-treninga svjestan identifikatora kako bismo bolje iskoristili identifikatore i predložili bimodalni zadatak dvostruke generacije kako bismo naučili bolje usklađivanje NL-PL-a pomoću koda i njegovih komentara. Naš ujedinjeni model može podržati i zadaće razumijevanja koda i generacije i omogućiti učenje s više zadataka. Eksperimenti pokazuju da CodeT5 značajno nadmašuje sve prethodne radove u većini zadataka CodeXGLUE. Širi utjecaj i etička razmatranja Our work generally belongs to NLP applications for software intelligence. With the goal of improving the development productivity of software with machine learning methods, software intelligence research has attracted increasing attention in both academia and industries over the last decade. Software code intelligence techniques can help developers to reduce tedious repetitive workloads, enhance the programming quality and improve the overall software development productivity. This would considerably decrease their working time and also could potentially reduce the computation and operational cost, as a bug might degrade the system performance or even crash the entire system. Our work addresses the fundamental challenge of software code pre-training, our study covers a wide range of code intelligence applications in the software development lifecycle, and the proposed CodeT5 method achieves the state-of-the-art performance on many of the benchmark tasks, showing its great potential benefit towards this goal. Dalje ćemo raspravljati o etičkom razmatranju obuke CodeT5 i potencijalnim rizicima kada se primjenjuje u stvarnim aplikacijama u daljnjem tijeku: The training datasets in our study are source code including user-written comments from open source Github repositories and publicly available, which do not tie to any specific application. However, it is possible that these datasets would encode some stereotypes like race and gender from the text comments or even from the source code such as variables, function and class names. As such, social biases would be intrinsically embedded into the models trained on them. As suggested by ( ), interventions such as filtration or modulation of generated outputs may help to mitigate these biases in code corpus. Dataset bias. Chen i al. 2021 Our model pre-training requires non-trivial computational resources though we have tried our best to carefully design our experiments and improve experiments to save unnecessary computation costs. In fact, compared to the recent large-scale language model Codex ( , ), naša CodeT5 baza ima mnogo manju veličinu modela od 220M nego njihova 12B (∼ 55×). Obuka CodeT5-base proizvela je oko 49,25 kg CO2 koji je u potpunosti kompenzirao dobavljač. Nadalje, objavljujemo naše prethodno obučene modele kako bismo izbjegli ponavljanje obuke za istraživačku zajednicu kod inteligencije. Computational cost. Chen i Al. 2021 e.g., As CodeT5 can be deployed to provide coding assistance such as code generation for aiding developers, automation bias of machine learning systems should be carefully considered, especially for developers who tend to over-rely on the model-generated outputs. Sometimes these systems might produce functions that superficially appear correct but do not actually align with the developer’s intents. If developers unintentionally adopt these incorrect code suggestions, it might cause them much longer time on debugging and even lead to some significant safety issues. We suggest practitioners using CodeT5 should always bear in mind that its generation outputs should be only taken as references which require domain experts for further correctness and security checking. Automation bias. Sljedeći članakKako se riješiti problematičnih situacija u Hrvatskoj ( , ) and a small fraction of Google BigQuery, both of which are originally collected from public Github repositories. Pre-trained mod-els might encode some sensitive information ( personal addresses or identification numbers) from the training data. Though we have conducted multi-rounds of data cleaning to mitigate this before training our models, it is still possible that some sensitive information cannot be completely removed. Besides, due to the non-deterministic nature of generation models like CodeT5, it might produce some vulnerable code to harmfully affect the software and even be able to benefit more advanced malware development when deliberately misused. Security implications. Husain et al. 2019 e.g., Acknowledgements We thank Akhilesh Deepak Gotmare, Amrita Saha, Junnan Li, and Chen Xing for valuable discussions. We thank Kathy Baxter for the ethical review. We also thank our anonymous reviewers for their insightful feedback on our paper. References Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. . In , pages 2655–2668. Association for Computational Linguistics. Unified pre-training for program understanding and generation Prijedlozi Konferencije Sjevernoameričkog poglavlja Udruženja za računalne lingvistike: Tehnologije ljudskog jezika 2021., NAACL-HLT 2021, Online, 6. i 11. lipnja 2021. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Michael Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Win-ter, Philippe Tillet, Felipe Petroski Such, Dave Cum-mings, Matthias Plappert, Fotios Chantzis, Eliza-beth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Nichol, Alex, Paino, Nikolas Tlecak Sutak, Jie Tang . , abs/2107.03374. Evaluating large language models trained on code CoRR Kevin Clark, Minh-Thang Luong, Quoc V. Le i Christopher D. Manning. . In . OpenReview.net. ELECTRA: pre-training text encoders as discriminators rather than generators 8. međunarodna konferencija o predstavništvima učenja, ICLR 2020, Addis Ababa, Etiopija, 26. i 30. travnja 2020. Colin B. Clement, Dawn Drain, Jonathan Timcheck, Alexey Svyatkovskiy, and Neel Sundaresan. 2020. U , pages 9052–9065. Association for Computational Linguistics. Pymt5: multi-mode translation of natural language and python code with transformers Konferencija o empirijskim metodama u obradi prirodnog jezika, EMNLP 2020, Online, 16. i 20. studenog 2020. Alexis Conneau and Guillaume Lample. 2019. . In , pages 7057–7067. Cross-lingual language model pretraining Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada Sergio Cozzetti B. de Souza, Nicolas Anquetil, and Káthia Marçal de Oliveira. 2005. . In , pages A study of the documentation essential to software maintenance Proceedings of the 23rd Annual International Conference on Design of Communication: documenting & Designing for Pervasive Information, SIGDOC 2005, Coventry, UK, September 21-23, 2005 68–75. ACM. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. . In , pages 4171–4186. BERT: pre-training of deep bidirectional transformers for language understanding Prijedlozi Konferencije Sjevernoameričkog poglavlja Udruge za računalne lingvistike: Tehnologije ljudskog jezika, NAACL-HLT 2019, Minneapolis, MN, SAD, od 2. do 7. lipnja 2019. Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi-aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. . In , pages 13042–13054. Unified language model pre-training for natural language understanding and generation Napredak u neuralnim sustavima za obradu informacija 32: godišnja konferencija o neuralnim sustavima za obradu informacija 2019, NeurIPS 2019, 8. i 14. prosinca 2019, Vancouver, BC, Kanada Ahmed Elnaggar, Wei Ding, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Silvia Severini, Florian Matthes i Burkhard Rost. . , abs/2104.02443. Code-trans: Towards cracking the language of silicone’s Kodiranje kroz samoregulirano duboko učenje i visoko performance computing CoRR Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xi-aocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang i Ming Zhou. U , pages 1536–1547. Association for Computational Linguistics. Code-bert: A pre-trained model for programming and natural languages Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, EMNLP 2020, Online Event, 16-20 November 2020 Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tu-fano, Shao Kun Deng, Colin B. Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou. 2021. U . OpenReview.net. Graphcodebert: Pre-trening code representations with data flow 9. međunarodna konferencija o predstavljanju učenja, ICLR 2021, Virtual Event, Austrija, 3. i 7. svibnja 2021. Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis i Marc Brockschmidt. . , abs/1909.09436. Code-searchnet challenge: Evaluating the state of semantic code search CoRR Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2018. U , stranice 1643-1652. udruga za računalne lingvistike. Korištenje jezika u kod in programmatic context Konferencija o empirijskim metodama u obradi prirodnog jezika, Bruxelles, Belgija, 31. listopada – 4. studenog 2018. Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. 2020. U , volume 119 of , pages 5110–5121. PMLR. Učenje i evaluacija Kontekstualno integriranje izvornog koda Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event Proceedings of Machine Learning Research Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov i Luke Zettlemoyer. U , stranice 7871–7880. udruga za računalne lingvistike. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020 Chin-Yew Lin i Franz Josef, 2004. . In . Narančasta je: a method for evaluating automatic evaluation metrics for machine translation COLING 2004, 20th International Conference on Computational Linguistics, Proceedings of the Conference, 23-27 August 2004, Geneva, Switzerland Fang Liu, Ge Li, Yunfei Zhao, and Zhi Jin. 2020. . In , pages 473–485. IEEE. Multi-task učenje temelji se na predtreniranom modelu jezika za code completion 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020, Melbourne, Australia, September 21-25, 2020 Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian-feng Gao. 2019a. . In , stranice 4487–4496. udruga za računalne lingvistike. Multi-task deep neural networks for natural language understanding Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019b. . , abs/1907.11692. Roberta: A robustly optimized BERT pretraining approach Koroš Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Li-dong Zhou, Linjun Shou, Long Zhou, Michele Tu-fano, Ming Gong, Ming Zhou, Nan Duan, Neel Sun-daresan, Shao Kun Deng, Shengyu Fu i Shujie Liu. . , abs/2102.04664. Codexglue: Sastav podataka za strojno učenje za razumijevanje i stvaranje koda CoRR Antonio Mastropaolo, Simone Scalabrino, Nathan Cooper, David Nader-Palacio, Denys Poshyvanyk, Rocco Oliveto, and Gabriele Bavota. 2021. . In , pages 336–347. IEEE. Studying the usage of text-to-text transfer transformer to Podrška kodiranim zadaćama 43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021 Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei i Ilya Sutskever. . u Na primjer, 1(8): 9 Jezik models are unsupervised multitask learners Otvoriti blog Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. . u , 21:140:1–140:67. Exploring the limits of transfer learning with a unified text-to-text transformer J. Mach. Learn. Res. Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Am-brosio Blanco i Shuai Ma. . u , abs/2009.10297. Codebleu: a method for automatic evaluation of code synthesis CoRR Baptiste Rozière, Marie-Anne Lachaux, Lowik Chanussot, and Guillaume Lample. 2020. U Unsupervised translation of programming languages Napredak u neuralnim informacijskim sustavima 33: godišnja konferencija o neuralnim informacijskim sustavima 2020., NeurIPS 2020, prosinac . 12. ožujka 2020., virtualno Baptiste Rozière, Marie-Anne Lachaux, Marc Szafraniec, and Guillaume Lample. 2021. . , abs/2102.07492. DOBF: A deobfuscation pre-training objective for programming languages CoRR Rico Sennrich, Barry Haddow i Alexandra Birch. . In . The Association for Computer Linguistics. Neural machine translation of rare words with subword units Prijedlozi 54. godišnjeg sastanka Udruge za računalne lingvistike, ACL 2016, 7. i 12. kolovoza 2016., Berlin, Njemačka, Volume 1: Dugi dokumenti Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. U , volume 97 of , stranice 5926–5936. MASS: masked sequence to se-quence pre-training for language generation Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA Proceedings of Machine Learning Research Yu Sun, Shuohuan Wang, Yu-Kun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019. . u , abs/1904 09223. ERNIE: enhanced representation through knowledge integration CoRR Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020. . In , pages 1433–1443. ACM. Intellicode compose: code generation using transformer ESEC/FSE ’20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event, USA, November 8-13, 2020 Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White i Denys Poshy-vanyk. . u , 28(4):19:1–19:29. empirijska studija o učenju bug-fixing patches in the wild via neural machine prijevod ACM Trans. Softw. Eng. Methodol. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. . In , pages 5998–6008. Attention is all Trebate Napredak u neuralnim sustavima za obradu informacija 30: godišnja konferencija o neuralnim sustavima za obradu informacija 2017, 4. i 9. prosinca 2017., Long Beach, CA, SAD Wenhan Wang, Ge Li, Bo Ma, Xin Xia i Zhi Jin. . In Stranice 261 – 271. Otkrivanje klona koda s grafičkom neuronskom mrežom i abstraktnom sintaxom povećanom protokom tree 27th IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2020, London, ON, Canada, February 18-21, 2020 Yaqin Zhou, Shangqing Liu, Jing Kai Siow, Xiaon-ing Du i Yang Liu. U , pages 10197–10207. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks Napredak u neuralnim sustavima za obradu informacija 32: godišnja konferencija o neuralnim sustavima za obradu informacija 2019, NeurIPS 2019, 8. i 14. prosinca 2019, Vancouver, BC, Kanada Daniel Zügner, Tobias Kirschstein, Michele Catasta, Jure Leskovec i Stephan Günnemann. U Predavanje na OpenReview.net. Language-agnostic representation learning of source Kod iz strukture i konteksta 9. međunarodna konferencija o predstavljanju učenja, ICLR 2021, Virtual Event, Austrija, 3. i 7. svibnja 2021. Ovaj dokument je dostupan na arhivu pod licencom CC by 4.0 Deed (Attribution 4.0 International). Ovaj dokument je dostupan na arhivu pod licencom CC by 4.0 Deed (Attribution 4.0 International).