CodeT5 на Salesforce може да промени начина, по който AI пише и разбира код

на авторите: Yue Wang, wang.y@salesforce.com (Salesforce Research Asia) Weishi Wang, weishi.wang@salesforce.com (Salesforce Research Asia; Nanyang Technological University, Singapore) Shafiq Joty, sjoty@salesforce.com (Salesforce Research Asia; Nanyang Technological University, Singapore) Steven C.H. Hoi, shoi@salesforce.com (Salesforce Research Asia) на авторите: Yue Wang, wang.y@salesforce.com (Изследване на продажбите Азия) Weishi Wang, weishi.wang@salesforce.com (Salesforce Research Asia; Nanyang Technological University, Сингапур) Shafiq Joty, sjoty@salesforce.com (Salesforce Research Asia; Nanyang Technological University, Сингапур) Стивън Х. Хой, shoi@salesforce.com (Salesforce Research Asia) абстрактна Предварително обучени модели на NL за естествени езици (NL) като BERT и GPT наскоро са показали, че прехвърлят добре на програмни езици (PL) и до голяма степен се възползват от широк набор от задачи, свързани с код. Независимо от успеха им, повечето настоящи методи или разчитат на предварително обучение само с кодиране (или само с декодиране), което е подоптимално за генериране (или разбиране) на задачите или обработване на кодовото резюме по същия начин като NL, пренебрегвайки специалните характеристики на PL, като например типове токени. Ние представяме CodeT5, унифициран, предварително обучен, кодиран кодиране-декодер-трансформер модел, който по-добре използва семантиката на кода, предадена . https://github.com/salesforce/CodeT5 1 Въведение Pre-trained language models such as BERT ( , ГЕРБ ( , , както и T5 ( , Те обикновено използват парадигма за предварително обучение, а след това фино настройване, която има за цел да извлече генерични езикови представи чрез самонаблюдаване на обучение по мащабни не-етикетирани данни, които могат да бъдат прехвърлени в полза на множество задачи надолу по веригата, особено тези с ограничена анотация на данни. , ; , • ; , , показва обещаващи резултати по задачи, свързани с кода. Делфин и др. 2019 Ръдфорд и др. 2019 Рафаел и др. 2020 Святковски и Ал. 2020 Канада и др. 2020 Фън и Ал. 2020 Въпреки това, въпреки успеха си, повечето от тези модели разчитат или на код-само модел, подобен на BERT ( , • ; , ) или само декодер модел като GPT ( , За да се постигне това, трябва да се използват специални методи, които са подходящи за измерване на състоянието на тялото, като например козметиката ( , ) изисква допълнителен декодер, когато се прилага за задачата за обобщаване на кода, където този декодер не може да се възползва от предварителното обучение.В допълнение, повечето съществуващи методи просто използват конвенционалните NLP техники за предварително обучение на изходния код, като го разглеждат като последователност от токени като NL. Това до голяма степен пренебрегва богатата структурна информация в кода, която е жизненоважна за пълното разбиране на семантиката на кода. Святковски и др. 2020 Фън и Ал. 2020 Канада и Ал. 2020 Фън и Ал. 2020 В тази работа представяме CodeT5, предварително обучен кодекс-декодер-модел, който разглежда информацията за типа токен в кода. , (Seq2Seq) предварително обучение и е доказано, че е от полза както за разбирането, така и за генерирането на задачи в естествения език.В допълнение, ние предлагаме да се възползват от идентификаторите, присвоени от разработчиците в кода.Когато пишат програми, разработчиците са склонни да използват информативни идентификатори, за да направят кода по-разбираем, така че тези идентификатори обикновено да запазят богата семантика на кода, Идентификаторът „binarySearch“ на фигурата directly indicates its functionality. To fuse such code-specific knowledge, we propose a novel identifier-aware objective that trains the model to distinguish which tokens are identifiers and recover them when they are masked. Рафаел и др. 2020 от гр. 2 Освен това предлагаме да се възползваме от кода и придружаващите го коментари, за да научим по-добро NL-PL подравняване. Разработчиците често предоставят документация за програми, за да улеснят по-добрата поддръжка на софтуера ( , ), така че такива двойки PL-NL са широко достъпни в повечето изходни кодове.По-специално, ние разглеждаме генерирането на NL→PL и генерирането на PL→NL като двойни задачи и едновременно оптимизираме модела върху тях. от София и др. 2005 За да се предотврати възпалението на черния дроб, трябва да се използват антибиотици ( , ) following ( , В допълнение към това, ние допълнително събираме допълнителни данни за C/C# от репозиториите с отворен код на Github. , ), включително две задачи за разбиране: откриване на дефекти в кода и откриване на клони, както и задачи за генериране като обобщаване на кода, генериране, превод и усъвършенстване. , ние също така проучваме многозадачно обучение за фино настройване на CodeT5 върху няколко задачи наведнъж, като използваме код за контрол на задачите като отправна бележка. Хюсеин и др. 2019 Фън и Ал. 2020 Лу и Ал. 2021 1 Представяме един от първите унифицирани кодиращи-декодерни модели CodeT5 за подпомагане както на свързаното с кода разбиране, така и на задачите за генериране, а също така позволява многозадачно обучение. Предлагаме нов идентификатор-съзнателен предварително обучение цел, която разглежда ключовата информация тип токен (идентификатори) от кода.В допълнение, ние предлагаме да се възползват от NL-PL двойки, които са естествено достъпни в изходния код, за да научат по-добро кръстосано подравняване. Extensive experiments show that CodeT5 yields state-of-the-art results on the fourteen sub-tasks in CodeXGLUE. Further analysis shows our CodeT5 can better capture the code semantics with the proposed identifier-aware pre-training and bimodal dual generation primarily benefits NL↔PL tasks. 2 Свързани дейности Предварително обучени модели, базирани на трансформаторни архитектури ( , За да се постигне това, е необходимо да се използват три основни метода за изчисляване на размера на теглото, които могат да бъдат комбинирани в три основни типа: ( , Брюксел ( , ) и ЕЛЕКТРА ( , , само декодерни модели като GPT ( , , както и кодове-декодерни модели като MASS ( , бръмбар ( , , както и T5 ( , В сравнение с моделите само за кодиране и само за декодиране, които съответно благоприятстват задачите за разбиране и генериране, моделите за кодиране и декодиране могат добре да поддържат и двата вида задачи. Често те използват целите за предварително обучение, които денозират последователността към последователността, които развалят входа на източника и изискват от декодера да ги възстанови. Pre-training on Natural Language. Василев и др. 2017 Делфин и др. 2019 Лиу и Ал. 2019г Кларк и др. 2020 Ръдфорд и др. 2019 Песента и ал. 2019 Луис и ал. 2020 Рафаел и др. 2020 Проучване на езика на програмирането е нововъзникваща област, в която много скорошна работа се опитва да разшири методите за предварително обучение на НЛП до изходния код. , ) и CodeBERT ( , За да се постигне това, трябва да се използват два основни метода за изчисляване на стойностите, които се използват за изчисляване на стойностите, които се използват за изчисляване на стойностите, които се използват за изчисляване на стойностите ( , ) задача за изучаване на NL-PL кръстосано представяне. в допълнение към моделите в стила на BERT, ( от ) и ( от ) съответно използват GPT и UniLM ( , ) за изпълнение на задачата за кодиране. преводач ( , За разлика от тях, ние изследваме модели на кодиране-декодер на базата на T5 за предварително обучение на езика за програмиране и поддържаме по-изчерпателен набор от задачи. Pre-training on Programming Language. Канада и др. 2020 Фън и Ал. 2020 Кларк и др. 2020 Святковски и др. 2020 Лъв и Ал. 2020 Донг и Ал. 2019 Роджърс и др. 2020 Някои възникващи дейности ( , • ; , • ; , ) в неотдавнашната литература също изследват рамката T5 за код, но те се фокусират само върху ограничен поднабор от задачи за генериране и не поддържат задачи за разбиране като нас. , ) въз основа на друг модел за кодиране-декодер BART може също така да поддържа както задачи за разбиране, така и за генериране. Въпреки това, цялата предишна работа просто обработва кода по същия начин като естествения език и до голяма степен пренебрегва специфичните за кода характеристики. Климент и др. 2020 Мавзолея и др. 2021 Кърджали и др. 2021 Ахмед и ал. 2021 Напоследък, с помощта на GraphCodeBERT ( , Интегрира потока от данни, извлечени от структурата на кода в CodeBERT, докато ( от ) предлагат цел за размиване, за да се възползват от структурния аспект на PL. Тези модели се фокусират само върху обучението на по-добър код-специфичен кодер. ( от ) предлага да се улавят относителните разстояния между кодовите токени по структурата на кода.За разлика от това, ние се фокусираме специално върху идентификаторите, които запазват богатата семантика на кода и сливат такава информация в модел Seq2Seq чрез две нови задачи за етикетиране и прогнозиране на идентификатора. ГЕРБ и Ал. 2021 Роджърс и др. 2021 Кърджали и др. 2021 3 Кодекс 5 Our CodeT5 builds on an encoder-decoder framework with the same architecture as T5 ( , Целта е да се получат генерични представи за езика за програмиране (PL) и естествения език (NL) чрез предварително обучение по не-етикетиран изходен код. , разширяваме денозиращата цел Seq2Seq в T5, като предлагаме две задачи за етикетиране и прогнозиране на идентификатори, за да позволим на модела да използва по-добре информацията за типа токен от PL, които са идентификаторите, зададени от разработчиците. Рафаел и др. 2020 2 По-долу представяме как CodeT5 кодира PL и NL входовете (§ ) и нашите предложени задачи за идентифициране-съзнание за предварително обучение (§ ), последвано от фина настройка със специфично за задачите обучение за трансфер и обучение за множество задачи (§ ) на 3.1 3.2 3.3 3.1 Кодиране на NL и PL На етапа на предварително обучение, нашият модел ще получи или PL-само или NL-PL като входове в зависимост от това дали кодовият откъс има придружаващи NL описания или не. = (Съгласно чл. 1*, ..., вятърна вода, 1*, ..., cm*, [SEP]), където и означава броя на NL дума токени и PL код токени, съответно. NL дума последователност ще бъде празна за PL-само unimodal входове. x w c n m За да се постигне по-голяма ефективност, е необходимо да се използват специални методи за изчисляване на размера на кода ( имената на функциите и променливите), тъй като те са една от най-агностичните функции на PL и запазват богата семантика на кода. По-специално, ние превръщаме сегмента PL в Абстрактно синтаксисно дърво (AST) и извличаме типовете възли за всеки код токен. ∈ {0*, * 1} за сегмента PL, където всеки ∈ {0*,* 1} представлява дали кодовият токен Идентифицира ли се или не. от гр. y m Джи ci 3.2 Задачи за предварително обучение Сега представяме нашите предложени задачи за предварително обучение, които позволяват на CodeT5 да изучава полезни модели от PL-само или NL-PL бимодални данни. За да се избегнат усложненията, се препоръчва да се използват антибиотици, които са много ефективни в много случаи ( , • ; , • ; , За да се избегнат тези усложнения, е необходимо да се използват специални инструменти, за да се предотврати появата на алергични реакции, които могат да доведат до появата на алергични реакции ( , ) която случайно маскира диапазони с произволни дължини и след това предсказва тези маскирани диапазони, комбинирани с някои токени Sentinel на декодера. Както е показано на фигурата а) от Identifier-aware Denoising Pre-training. Песента и ал. 2019 Раф-фел и др. 2020 Луис и ал. 2020 Рафаел и др. 2020 Masked Span Prediction (MSP) 2 По-конкретно, ние използваме същия процент на корупция от 15% като T5 и гарантираме, че средната дължина на диапазона е 3 чрез равномерно вземане на проби от диапазони от 1 до 5 токена. чрез вземане на проби от диапазони преди под-токенизацията, която има за цел да избегне маскирането на частични под-токове и се оказва полезна ( , Забележително е, че ние предварително обучаваме споделен модел за различни PLs, за да научим силни кръстосани езикови представи. Цялата дума маска Слънцето и Ал. 2019 където θ са параметрите на модела, x \mask е маскираният вход, x mask е маскираната последователност, за да се предскаже от декодера с k, обозначаващ броя на токените в x mask, и xmask <t е последователността, генерирана досега. За да обединим в модела по-специфична за кода структурна информация (типът на възела на идентификатора в AST), предлагаме две допълнителни задачи: и Да се допълни предварителното обучение. Идентифициране на таг (IT) Предсказване на маскиран идентификатор (MIP) • Тя има за цел да уведоми модела със знанието дали този код токен е идентификатор или не, което споделя подобен дух на синтаксис подчертаване в някои инструменти, подпомагани от разработчици. (б), ние картографираме окончателните скрити състояния на сегмента PL на кодера CodeT5 в поредица от вероятности • ( 1*, ..., pm*) и изчислете двоична загуба на кръстосана ентропия за етикетиране на последователността: Identifier Tagging (IT) 2 p p къде Имайте предвид, че чрез изхвърляне на задачата като проблем за етикетиране на последователност, моделът се очаква да улавя синтаксиса на кода и структурите на потока от данни на кода. θe • Different from the random span masking in MSP, we mask all identifiers in the PL segment and employ a unique sentinel token for all occurrences of one specific identifier. In the field of software engineering, this is called където промяната на имената на идентификаторите не влияе на семантиката на кода. ( от ), ние подреждаме уникалните идентификатори с токените Sentinel в целева последователност as shown in Figure След това го предсказваме по авторегресивен начин: Masked Identifier Prediction (MIP) obfuscation Роджърс и др. 2021 I 2 къде Забележка: Забележка: Забележка: Забележка: Забележка: е по-трудна задача, която изисква от модела да разбере семантиката на кода, базирана на замъгления код и да свърже събитията на едни и същи идентификатори заедно. x Деактивиране We alternately optimize these three losses with an equal probability, which constitutes our proposed identifier-aware denoising pre-training. В етапа на предварително обучение, декодерът вижда само дискретни маскирани диапазони и идентификатори, което се различава от задачите надолу по веригата, където декодерът трябва да генерира или текущи NL текстове, или синтактично коректни кодови откъси.За да се затвори пропастта между предварително обучение и фино настройване, ние предлагаме да се възползват от NL-PL бимодалните данни, за да обучат модела за двупосочна конверсия, както е показано на фигура (d) По-конкретно, ние разглеждаме генерирането на NL→PL и генерирането на PL→NL като двойни задачи и едновременно оптимизираме модела върху тях. Bimodal Dual Generation. 2 За да се отговори на този въпрос, трябва да се използват два метода за изчисляване на стойностите, които се използват за изчисляване на стойностите и за изчисляване на стойностите ( и за Java PL и English NL, съответно). Тази операция може да се разглежда и като специален случай на маскиране на обхвата на T5 чрез маскиране на пълния сегмент NL или PL от бимодалните входове. e.g., 3.3 Fine-tuning CodeT5 After pre-training on large-scale unlabeled data, we adapt CodeT5 to downstream tasks via either task-specific transfer learning or multi-task learning. За да се постигне това, е необходимо да се изчистят и да се преодолеят различията между тях и да се постигне по-добро възстановяване на състоянието на тялото. ( , ), or predicting it from the vocabulary of class labels based on the last decoder hidden state following ( ). Task-specific Transfer Learning: Generation vs. Understanding Tasks. Рафаел и др. 2020 Lewis et al. 2020 We also explore a multi-task learning setting by training a shared model on multiple tasks at a time. Multi-task learning is able to reduce computation cost by reusing the most of model weights for many tasks and has been shown to improve the model generalization capability in NL pre-training ( , ). We follow ( ) to employ the same unified model for all tasks without adding any task-specific networks but allow to select different best checkpoints for different tasks. To notify the model with which task it is dealing with, we design a unified format of task control codes and prepend it into the source inputs as shown in Figure . For instance, we employ “Translate Java to CSharp:” as the source prompt for the code-to-code translation task from Java to CSharp. Multi-task Learning. Liu et al. 2019a Raffel et al. 2020 1 As different tasks have different dataset sizes, we follow Conneau and Lample (2019) to employ a balanced sampling strategy. For N number of datasets (or tasks), with probabilities {qi} N i=1, we define the following multinomial distribution to sample from: where ni is number of examples for i-th task and α is set to 0.7. This balanced sampling aims to alleviate the bias towards high-resource tasks. 4 Experimental Setup 4.1 Pre-training Dataset We follow Feng et al. (2020) to employ CodeSearchNet (Husain et al., 2019) to pre-train CodeT5, which consists of six PLs with both unimodal and bimodal data. Apart from that, we additionally collect two datasets of C/CSharp from BigQuery1 to ensure that all downstream tasks have overlapped PLs with the pre-training data. In total, we employ around 8.35 million instances for pretraining. Table 1 shows some basic statistics. To obtain the identifier labels from code, we leverage the tree-sitter2 to convert the PL into an abstract syntax tree and then extract its node type information. We filter out reserved keywords for each PL from its identifier list. We observe that PLs have different identifier rates, where Go has the least rate of 19% and Ruby has the highest rate of 32%. 4.2 Code-specific Tokenizer Tokenization is a key ingredient for the success of pre-trained language models like BERT and GPT. They often employ a Byte-Pair Encoding (BPE) to-kenizer ( , ) to alleviate the Out-of-Vocabulary (OoV) issues. Specifically, we train a Byte-level BPE tokenizer following ( ) and set the vocabulary size to 32,000 as T5. We add additional special tokens ([PAD], [CLS], [SEP], [MASK0], ..., [MASK99]). This tokenzier is trained on all of our pre-training data with non-printable characters and low-frequent tokens (occurring <3 times) filtered. We compare it with T5’s default tokenizer and find that our tokenizer largely reduces the length of tokenized code sequence by 30% - 45% on downstream tasks. This will accelerate the training and especially benefit generation tasks due to the shorter sequence to predict. We also spot a severe problem for applying the T5’s default tokenizer on source code, where it would encode some common code tokens such as brackets [‘{’, ‘}’] into unknown tokens. Sennrich et al. 2016 Radford et al. 2019 4.3 Долупосочни задачи и метрики We cover most generation and understanding tasks in the CodeXGLUE benchmark ( , ) and employ the provided public datasets and the same data splits following it for all these tasks. Лу и Ал. 2021 We first consider two cross-modal generation tasks. aims to summarize a function-level code snippet into English descriptions. The dataset consists of six PLs including Ruby, JavaScript, Go, Python, Java, and PHP from CodeSearchNet ( , ). We employ the smoothed BLEU-4 ( , за оценяване на тази задача. is the task to gen-erate a code snippet based on NL descriptions. We employ the Concode data ( , ) in Java where the input contains both NL texts and class environment contexts, and the output is a function. We evaluate it with BLEU-4, exact match (EM) accuracy, and CodeBLEU ( , която разглежда синтактични и семантични съвпадения въз основа на структурата на кода в допълнение към n-грамното съвпадение. Code summarization Husain et al. 2019 Lin and Och 2004 Code generation Ивайло и Ал. 2018 Ren et al. 2020 Besides, we consider two code-to-code generation tasks. aims to migrate legacy software from one PL to another, where we focus on translating functions from Java to CSharp and vice versa. aims to convert a buggy function into a correct one. We employ two Java datasets provided by ( ) with various function lengths: small (fewer than 50 tokens) and medium (50-100 tokens). We use BLEU-4 and exact match to evaluate them. Code translation Code refinement Tufano et al. 2019 We also investigate how CodeT5 performs on two understanding-based tasks. The first one is that aims to predict whether a code is vulnerable to software systems or not. We use the C dataset provided by ( ) for experiment. The second task is which aims to measure the similarity between two code snippets and predict whether they have the same functionality. We experiment with the Java data provided by ( от Ние използваме F1 резултат и точност за оценяване на тези две задачи съответно. Като цяло, нашият CodeT5 поддържа шест задачи и четиринадесет подзадачи в CodeXGLUE с единен модел за кодиране-декодер. defect detection Zhou et al. 2019 clone detection Wang et al. 2020 4.4 Comparison Models We compare CodeT5 with state-of-the-art (SOTA) pre-trained models that can be categorized into three types: encoder-only, decoder-only, and encoder-decoder models. As models, we consider RoBERTa ( , ), RoBERTa (code) trained with masked language modeling (MLM) on code, CodeBERT ( , ) trained with both MLM and replaced token detection ( , ), GraphCode-BERT ( , ) using data flow from code, and DOBF ( , Имайте предвид, че въпреки че DOBF използва модел Seq2Seq по време на предварително обучение, той има за цел само да обучи по-добър кодер за задачи надолу по веригата, без да изследва потенциалната полза от предварително обучения декодер. encoder-only Liu et al. 2019b Feng et al. 2020 Clark et al. 2020 Guo et al. 2021 Роджърс и др. 2021 For models, we compare GPT-2 ( , ) и неговите адаптации на домейна на кода, включително CodeGPT-2, и CodeGPT-адаптиран. Разликата е, че последният използва контролна точка GPT-2 за инициализация на модел, докато първият е обучен от нулата. Настоящият регламент за защита на данните, съдържащ се в Регламента за защита на данните, се прилага в съответствие с Регламент ( , ) based on BART ( , За да се постигне това, е необходимо да се използват най-често използваните методи за измерване на състоянието на котката ( , DOBF е предварително обучен за 7.9M Java и 3.6M Python файлове от BigQuery, докато PLBART използва много по-големи данни с 470M Python и 210M Java функции и 47M NL постове от StackOverflow. decoder-only Radford et al. 2019 encoder-decoder Магьосникът и ал. 2021 Lewis et al. 2020 Husain et al. 2019 4.5 Model Configurations We build CodeT5 based on Huggingface’s T5 ( , ) PyTorch implementation and employ two sizes of CodeT5-small (60M) and CodeT5-base (220M). We set the maximum source and target sequence lengths to be 512 and 256, respectively. We use the mixed precision of FP16 to accelerate the pre-training. We set the batch size to 1024 and employ the peak learning rate of 2e-4 with linear decay. We pre-train the model with the denoising objective for 100 epochs and bimodal dual training for further 50 epochs on a cluster of 16 NVIDIA A100 GPUs with 40G memory. The total training time for CodeT5-small and CodeT5-base is 5 and 12 days, respectively. Raf-fel et al. 2020 3 In the fine-tuning phase, we find that the tasks in CodeXGLUE ( , ) are quite sensitive to some hyper parameters such as learning rate, training steps, and batch size. We conduct a grid search and select the best parameters based on the validation set. In multi-task learning, we cover all downstream tasks except clone detection. Lu et al. 2021 5 Results and Analysis In this section, we compare CodeT5 with SOTA models on a broad set of CodeXGLUE downstream tasks (§ ), и да проучи ефектите от нашето бимодално двойно поколение и многозадачно обучение (§ ), followed by a detailed analysis on the proposed identifier-aware pre-training (§ ). 5.1 5.2 5.3 5.1 CodeXGLUE Downstream Tasks We evaluate two sizes of our model: CodeT5-small and CodeT5-base that are pre-trained with identifier-aware denoising. In addition, we consider the model that continues to train with bimodal dual generation (dual-gen) and show the results with multi-task fine-tuning. The results of all comparison models are obtained from their original papers and also the CodeXGLUE paper ( , ). Lu et al. 2021 We show code summarization results of smoothed BLEU-4 on six PL data in Table Ние наблюдаваме, че всички наши варианти на модела значително превъзхождат предишната работа с кодиращо-само (RoBERTa, CodeBERT, DOBF) или кодиращо-декодерна рамка (PLBART). Освен това, значителната разлика в производителността между тези две групи модели потвърждава, че кодиращите-само рамки са подоптимални за задачи за генериране.В сравнение с кодиращия-декодерния модел PLBART на SOTA, ние откриваме, че дори нашият CodeT5-малки произвежда по-добри общи резултати (също и в Python и Java), като се има предвид, че нашият модел е много по-малък (60M срещу 140M) и PLBART е предварително обучен с много по-големи Python и Java данни (> 100 пъти). . By increasing the model size, our CodeT5-base boosts the overall performance by over 1.2 absolute points over PLBART. Code Summarization. 2 4 We compare CodeT5 with GPT-style models and PLBART in Table . Our CodeT5-small outperforms all decoder-only mod-els and also the SOTA PLBART, which again confirms the superiority of encoder-decoder models at generating code snippets. Moreover, our CodeT5-base further significantly pushes the SOTA results across three metrics. Particularly, it achieves around 4.7 points improvement on CodeBLEU over PLBART, indicating our CodeT5 can better comprehend the code syntax and semantics with the __help of identi__fier-aware pre-training. Code Generation. 3 We compare two code-to-code generation tasks: code translation and code refinement in Table В задачата за превод на код, нашият CodeT5-small превъзхожда повечето базови линии и получава сравними резултати с PLBART, което показва предимствата на моделите на кодиране-декодер в настройката за генериране на код към код. Code-to-Code Generation Tasks. 4 Тук показваме изхода на един CodeT5 от превода на C# на Java на фигура . In this case, despite the poor BLEU score, CodeT5 is able to generate a function that reserves the same functionality and even has better readability compared to the ground-truth. This reveals that CodeT5 has a good generalization ability instead of memorizing and repeating what it has seen before. On the other hand, it also suggests that BLEU score is not a perfect evaluation metric for code generation tasks, where sometimes a higher score can instead reflect the problematic copy issues of neural models. 3 Друга задача за генериране на код към код е прецизирането на кода, предизвикателна задача, която изисква да се открият кои части от кода са грешни и да се поправят чрез генериране на последователност от кодове без грешки. Поради голямото припокриване на изходния и целевия код, дори наивният подход за копиране дава много високи резултати от BLEU, но нулеви точни съвпадения. , отбелязваме, че EM резултатите за малките данни са последователно по-високи от средните, което показва, че е по-трудно да се поправят грешки за по-дълъг кодов откъс.Нашата CodeT5-база значително превъзхожда всички бази на EM и особено увеличава над 4.8 точки за по-трудната средна задача (13.96 срещу 9.10 на GraphCodeBERT), което отразява силната си способност за разбиране на кода. 4 Ние сравняваме с две задачи за разбиране на откриване на дефекти и откриване на клони в Таблица 5. Understanding Tasks. Specifically, we generate the binary labels as a unigram sequence from the decoder for the defect detection task, while for the clone detection task, we first obtain the sequence embedding of each code snippet using the last decoder state following ( ) and then predict the labels by measuring their similarity. Both CodeT5-small and CodeT5-base outperform all baselines on the defect detection task while CodeT5-base yields 2.6 accuracy score improvement than PLBART. For the clone detection task, our CodeT5 models achieve comparable results to the SOTA GraphCodeBERT and PLBART models. These results demonstrate that with an encode-decoder framework, our CodeT5 can still be adapted well for understanding tasks. Lewis et al. 2020 5.2 Effects of Bimodal Dual Generation and Multi-task Learning We examine the effects of bimodal dual generation at pre-training and multi-task learning at fine-tuning. The bimodal pre-training brings consistent improvements for code summarization and generation tasks on both CodeT5-small and CodeT5-base. However, this pre-training task does not help and even sometimes slightly hurts the performance for PL-PL generation and understanding tasks. We anticipate this is because bimodal dual generation learns a better alignment between PL and NL that naturally benefits the former tasks involving both PL and NL. As a side effect, this objective could bias the model towards the PL-NL tasks and affect its performance on PL-PL tasks. В многозадачното обучение той обикновено подобрява повечето задачи надолу по веригата, с изключение на превода на код и откриването на дефекти. По-специално, той значително подобрява производителността на обобщаването на кода, което не е изненадващо, тъй като обобщаването на кода заема най-голямата част от подзадачите (шест от тринадесетте) и по този начин се възползва най-много от многозадачното обучение. Another possible reason is that multi-task training with defect detection would enable the model to better comprehend the code semantics for bug detection, which is also a necessary intermediate step for code refinement. 5.3 Анализ на идентифициращо-съзнателно предварително обучение We provide an ablation study to examine the contribution of each component in our identifier-aware objective. Specifically, we compare the performance of our CodeT5-small on four selected tasks by ablating each of the three objectives: masked span prediction (MSP), identifier tagging (IT), and masked identifier prediction (MIP). As shown in Table , отбелязваме, че като цяло премахването на една от целите би намалило изпълнението на всички задачи, което показва, че всички цели допринасят за по-доброто разбиране на кода на нашия CodeT5. Въпреки това, ефектът на всяка цел се различава между задачите. По-специално, премахването на MSP би намалило до голяма степен изпълнението на всички задачи за генериране, но вместо това ще увеличи ефективността за откриване на дефекти. Това показва, че маскираното прогнозиране е по-важно за улавяне на синтактична информация за задачите за генериране. Напротив, премахването на MIP би наранило задачата за откриване на дефекти най-много, което показва, че може да се съсредоточи повече върху семантичното разбиране на кода. 6 We further provide outputs from CodeT5 and its variant without MIP and IT on code generation in Figure Ние наблюдаваме, че CodeT5 може правилно да генерира точната функция, докато моделът без MIP и IT не успява да възстанови идентификаторите на "s2" и "hasField". 4 We also investigate the identifier tagging performance and find it achieves over 99% F1 for all PLs, showing that our CodeT5 can confidently distinguish identifiers in code. We then check whether MSP and MIP tasks would have conflicts as they employ the same sentinel tokens for masking. In identifier masking, all occurrences of one unique identifier are replaced with the same sentinel token, resulting in a many-to-one mapping compared to the one-to-one mapping in span prediction. We compare models pre-trained with either MSP or MIP, and both on these two tasks in Table . We report the prediction accuracy and also the ratio of how often they can generate the same number of predictions as the sentinel tokens. We observe that pre-training only with either MIP or MSP would bias the model towards that task, achieving poor accuracy and higher mismatch in number of predictions when applied to the other task. Interestingly, we find that MIP-only objective can better recover the correct number of predictions in the MSP task than MSP-only does for the MIP task, meaning that it is easier to adapt from many-to-one mapping to one-to-one mapping and difficult for the opposite. At last, combining them can help our model to make a good trade-off on both tasks. 7 6 Conclusion We have presented CodeT5, a pre-trained encoder-decoder model that incorporates the token type information from code. We propose a novel identifier-aware pre-training objective to better leverage the identifiers and propose a bimodal dual generation task to learn a better NL-PL alignment using code and its comments. Our unified model can support both code understanding and generation tasks and allow for multi-task learning. Experiments show that CodeT5 significantly outperforms all prior work in most CodeXGLUE tasks. Further analysis also reveals its better code comprehension capability across various programming languages. По-широко въздействие и етични съображения Our work generally belongs to NLP applications for software intelligence. With the goal of improving the development productivity of software with machine learning methods, software intelligence research has attracted increasing attention in both academia and industries over the last decade. Software code intelligence techniques can help developers to reduce tedious repetitive workloads, enhance the programming quality and improve the overall software development productivity. This would considerably decrease their working time and also could potentially reduce the computation and operational cost, as a bug might degrade the system performance or even crash the entire system. Our work addresses the fundamental challenge of software code pre-training, our study covers a wide range of code intelligence applications in the software development lifecycle, and the proposed CodeT5 method achieves the state-of-the-art performance on many of the benchmark tasks, showing its great potential benefit towards this goal. По-нататък обсъждаме етичните съображения на обучението на CodeT5 и потенциалните рискове при прилагането му в реални приложения надолу по веригата: Тренировъчните набори от данни в нашето проучване са изходни кодове, включително потребителски написани коментари от репозитории с отворен код на Github и публично достъпни, които не са обвързани с конкретно приложение. Въпреки това е възможно тези набори от данни да кодират някои стереотипи като раса и пол от текстовите коментари или дори от изходния код като променливи, функции и имена на класове. Като такива, социалните предразсъдъци биха били вътрешно вградени в моделите, обучени върху тях. ( ), interventions such as filtration or modulation of generated outputs may help to mitigate these biases in code corpus. Dataset bias. Chen et al. 2021 Our model pre-training requires non-trivial computational resources though we have tried our best to carefully design our experiments and improve experiments to save unnecessary computation costs. In fact, compared to the recent large-scale language model Codex ( , ), our CodeT5-base has a much smaller model size of 220M than theirs of 12B (∼ 55×). In addition, we experiment on Google Cloud Plat-form which purchases carbon credits to reduce its carbon footprint, training CodeT5-base produced around 49.25 kg CO2 which was totally off-set by the provider. Furthermore, we release our pre-trained models publicly to avoid repeated training for the code intelligence research community. Computational cost. Chen et al. 2021 e.g., As CodeT5 can be deployed to provide coding assistance such as code generation for aiding developers, automation bias of machine learning systems should be carefully considered, especially for developers who tend to over-rely on the model-generated outputs. Sometimes these systems might produce functions that superficially appear correct but do not actually align with the developer’s intents. If developers unintentionally adopt these incorrect code suggestions, it might cause them much longer time on debugging and even lead to some significant safety issues. We suggest practitioners using CodeT5 should always bear in mind that its generation outputs should be only taken as references which require domain experts for further correctness and security checking. Automation bias. Създаване на система за управление на кода (CodeSearchNet) в съответствие с изискванията на чл. , ) and a small fraction of Google BigQuery, both of which are originally collected from public Github repositories. Pre-trained mod-els might encode some sensitive information ( personal addresses or identification numbers) from the training data. Though we have conducted multi-rounds of data cleaning to mitigate this before training our models, it is still possible that some sensitive information cannot be completely removed. Besides, due to the non-deterministic nature of generation models like CodeT5, it might produce some vulnerable code to harmfully affect the software and even be able to benefit more advanced malware development when deliberately misused. Security implications. Husain et al. 2019 от гр. Acknowledgements We thank Akhilesh Deepak Gotmare, Amrita Saha, Junnan Li, and Chen Xing for valuable discussions. We thank Kathy Baxter for the ethical review. We also thank our anonymous reviewers for their insightful feedback on our paper. Референции Уси Удин Ахмад, Саикат Чакраборти, Байшахи Рей и Кай-Уей Чанг. В , страници 2655–2668. асоциация за изчислителна лингвистика. Unified pre-training Програма за разбиране и поколение Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021 Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Win-ter, Philippe Tillet, Felipe Petroski Such, Dave Cum-mings, Matthias Plappert, Fotios Chantzis, Eliza-beth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welin-der, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. . , abs/2107.03374 и др. Evaluating large language models trained on code CoRR Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. . In . OpenReview.net. ELECTRA: pre-training text encoders as discriminators rather than generators 8-ма международна конференция за представяне на ученето, ICLR 2020, Адис Абеба, Етиопия, 26-30 април 2020 г. Colin B. Clement, Dawn Drain, Jonathan Timcheck, Alexey Svyatkovskiy, and Neel Sundaresan. 2020. . In , pages 9052–9065. Association for Computational Linguistics. Pymt5: мултимод превод на естествен език and python code with transformers Конференция за емпиричните методи в обработката на естествения език, EMNLP 2020, онлайн, 16-20 ноември 2020 г. Alexis Conneau and Guillaume Lample. 2019. . In , pages 7057–7067. Cross-lingual language model pretraining Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada Sergio Cozzetti B. de Souza, Nicolas Anquetil, and Káthia Marçal de Oliveira. 2005. В Страници A study of the documentation essential to software maintenance 23. международна конференция за проектиране на комуникациите: документиране и проектиране за широко разпространена информация, Coventry, Великобритания, 21-23 септември 2005 г. 68–75. ACM. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. . In , pages 4171–4186. BERT: pre-training of deep bidirectional transformers for language understanding Документи от Конференцията на Северноамериканската глава на Асоциацията за изчислителна лингвистика: Човешки езикови технологии, NAACL-HLT 2019, Минеаполис, MN, САЩ, 2-7 юни 2019 г., том 1 (Дълги и къси документи) Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi-aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. . In , pages 13042–13054. Unified language model pre-training for natural language understanding and generation Напредък в системите за обработка на невронна информация 32: Годишна конференция за системите за обработка на невронна информация 2019, NeurIPS 2019, 8-14 декември 2019, Ванкувър, Канада Ahmed Elnaggar, Wei Ding, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Silvia Severini, Florian Matthes, and Burkhard Rost. 2021. . от , abs/2104.02443. Code-trans: Towards cracking the language of silicone’s code through self-supervised deep learning and high Компютърни постижения Кора Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xi-aocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. В , pages 1536–1547. Association for Computational Linguistics. Code-bert: A pre-trained model for programming and natural languages Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, EMNLP 2020, Online Event, 16-20 November 2020 Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tu-fano, Shao Kun Deng, Colin B. Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou. 2021. В . OpenReview.net. Graphcodebert: Предварително обучение code representations with data flow IX Международна конференция за представяне на ученето, ICLR 2021, Виртуално събитие, Австрия, 3 – 7 май 2021 г. Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis и Marc Brockschmidt. . , abs/1909.09436. Code-searchnet challenge: Evaluating the state of semantic code search Кора Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2018. . In , pages 1643–1652. Association for Computational Linguistics. Mapping language to code В програмен контекст Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018 Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. 2020. В Общо 119 броя , pages 5110–5121. PMLR. Learning and evaluating contextual embedding of source code Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event Proceedings of Machine Learning Research Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov и Luke Zettlemoyer. . In , pages 7871–7880. Association for Computational Linguistics. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020 Чин-Йу Лин и Франц Йозеф, 2004 г. В . ORANGE: a method for evaluating automatic evaluation metrics for machine translation COLING 2004, 20th International Conference on Computational Linguistics, Proceedings of the Conference, 23-27 August 2004, Geneva, Switzerland Fang Liu, Ge Li, Yunfei Zhao, and Zhi Jin. 2020. . In Страници 473 – 485. Multi-task learning based pre-trained language model for code completion 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020, Melbourne, Australia, September 21-25, 2020 Xiaodong Liu, Pengcheng He, Weizhu Chen и Jian-feng Gao. . In , pages 4487–4496. Association for Computational Linguistics. Multi-task deep neural networks Разбиране на естествения език Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019b. . , abs/1907.11692. Roberta: A robustly optimized BERT pretraining approach Кора Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Li-dong Zhou, Linjun Shou, Long Zhou, Michele Tu-fano, Ming Gong, Ming Zhou, Nan Duan, Neel Sun-daresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. 2021. . , abs/2102.04664. Codexglue: A machine learning benchmark dataset for code understanding and generation Кора Antonio Mastropaolo, Simone Scalabrino, Nathan Cooper, David Nader-Palacio, Denys Poshyvanyk, Rocco Oliveto и Gabriele Bavota. . In , pages 336–347. IEEE. Studying the usage of text-to-text transfer transformer to support code-related tasks 43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021 Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. . 1 (осем) от 9 Language models are unsupervised multitask learners Блогът отваря Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li и Peter J. Liu. . , 21:140:1 – 140:67. Exploring the limits of transfer learning with a unified text-to-text transformer J. Мах. Научете се. Res. Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Am-brosio Blanco и Shuai Ma. . от Съгласно чл.299 от 2009 г. Codebleu: a method for automatic evaluation of code synthesis CoRR Baptiste Rozière, Marie-Anne Lachaux, Lowik Chanussot и Guillaume Lample. . In Unsupervised translation of programming languages Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December . 6-12, 2020, virtual Baptiste Rozière, Marie-Anne Lachaux, Marc Szafraniec, and Guillaume Lample. 2021. . Съгласно чл.292 от чл. DOBF: A deobfuscation pre-training objective for programming languages CoRR Рико Сенрих, Бари Хадоу и Александра Бирч. . In Асоциация за компютърна лингвистика. Невронно машинно превеждане на редки думи subword units Протоколи от 54-то годишно събрание на Асоциацията по изчислителна лингвистика, ACL 2016, 7-12 август 2016 г., Берлин, Германия, том 1: Дълги документи Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. В , volume 97 of , страници 5926 – 5936. MASS: маскирана последователност към предварително обучение за генериране на езици 36-та международна конференция за машинно обучение, ICML 2019, 9-15 юни 2019 г., Лонг Бийч, Калифорния, САЩ Proceedings of Machine Learning Research Yu Sun, Shuohuan Wang, Yu-Kun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian и Hua Wu. . от , abs/1904 09223. ERNIE: enhanced representation through knowledge integration CoRR Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu и Neel Sundaresan. В , страници 1433—1443. ац. Intellicode compose: code generation using transformer ESEC/FSE ’20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event, САЩ, 8-13 ноември 2020 г. Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshy-vanyk. 2019. . , 28(4):19:1–19:29. Емпирично изследване на ученето bug-fixing patches in the wild via neural machine translation ACM Trans. софтуер Eng. Methodol. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. В Страници 5998 – 6008 Вниманието е всичко you need Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA Wenhan Wang, Ge Li, Bo Ma, Xin Xia и Zhi Jin. . In Страници 261 – 271. Detecting code clones with graph neural network and flow-augmented abstract syntax tree 27th IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2020, London, ON, Canada, February 18-21, 2020 Yaqin Zhou, Shangqing Liu, Jing Kai Siow, Xiaon-ing Du, and Yang Liu. 2019. . In Страници 10197 – 10207 Съвпадение за: Ефективно Идентифициране на уязвимостта чрез изучаване на всеобхватна програмна семантика чрез графични невронни мрежи Напредък в системите за обработка на невронна информация 32: Годишна конференция за системите за обработка на невронна информация 2019, NeurIPS 2019, 8-14 декември 2019, Ванкувър, Канада Даниел Зугнер, Тобиас Киршщайн, Микеле Катаста, Юре Лесковец и Стефан Гюнеман. В . OpenReview.net. Language-agnostic representation learning of source code from structure and context 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 Тази статия е достъпна под лиценза CC by 4.0 Deed (Attribution 4.0 International). Тази статия е достъпна под лиценза CC by 4.0 Deed (Attribution 4.0 International).