CodeT5 من Salesforce قد يتغير كيفية كتابة وتفهم الكود

المؤلفين : Yue Wang, wang.y@salesforce.com (Salesforce Research Asia) Weishi Wang, weishi.wang@salesforce.com (Salesforce Research Asia; Nanyang Technological University, Singapore) Shafiq Joty, sjoty@salesforce.com (Salesforce Research Asia; Nanyang Technological University, Singapore) Steven C.H. Hoi, shoi@salesforce.com (Salesforce Research Asia) المؤلفين : Yue Wang, wang.y@salesforce.com (مشاريع البحث في آسيا) Weishi Wang, weishi.wang@salesforce.com (Salesforce Research Asia ، جامعة نانيانغ التكنولوجية ، سنغافورة) Shafiq Joty ، sjoty@salesforce.com (Salesforce Research Asia ، جامعة نانيانغ التكنولوجية ، سنغافورة) Steven C.H. Hoi ، shoi@salesforce.com (متابعة شركة Salesforce في آسيا) abstract وقد أظهرت النماذج التي تم تدريبها مؤخراً عن لغات طبيعية (NL) مثل BERT و GPT أنها تترك جيداً إلى لغات البرمجة (PL) وتستفيد بشكل كبير من مجموعة واسعة من المهام ذات الصلة بالموضوع. على الرغم من نجاحها، فإن معظم الأساليب الحالية تعتمد على تدريب إلكتروني فقط (أو مجاني فقط) الذي يقلل من الكفاءة لإنتاج (أو تفهم) المهام أو معالجة ملف الكود بطريقة مماثلة من NL، مما يؤثر على خصائص خاصية PL مثل أنواع التوزيع. ونحن نقدم CodeT5، نموذج إلكتروني إلكتروني إلكتروني إلكتروني إلكتروني إلكتروني إلكتروني إلكتروني إلكتروني إلكتروني يعتمد بشكل أفضل على . https://github.com/salesforce/CodeT5 1 إدراج تقييم اللغة الإنجليزية ( ، (بالتالي، فوركس) ، ( ) ، و T5 ( ، ويستخدمون عادةً نموذجًا متكاملًا متكاملًا يهدف إلى إدخال نموذجات لغة عامة من خلال التدريب الذاتي على البيانات غير المحدودة على نطاق واسع، والتي يمكن نقلها لمساعدة العديد من المهام المستقبلية، وخاصة تلك التي تحتوي على إشارات محدودة من البيانات. ، · ، · ، إظهار النتائج الإيجابية على المهام ذات الصلة بالموضوع. ديفيد و أل. 2019 راديفورد et al. 2019 Raffel et al. 2020 سوانيتسكي و أل . 2020 كندا و الب. 2020 فينغ et al. 2020 ومع ذلك ، على الرغم من نجاحها ، فإن معظم هذه النماذج تعتمد على نموذج مجاني مثل BERT ( ، · ، (بالتالي، يُمكن أن تُستخدم كاميراً إلكترونيًّا، أو كاميراً إلكترونيًّا، أو كاميراً إلكترونيًّا). ، ويُعدّ من أهمّ الأدوات التي تُعنى بتطوير الذكاء الاصطناعي، كما يُعدّ من أهمّ الأدوات التي تُعنى بتطوير الذكاء الاصطناعي. ، ويحتاج إلى إدخال إدخال إضافي عند تطبيقه لمهمة تقييم الكود ، حيث لا يستطيع هذا الإدخال الاستفادة من التدريب المبكر. بالإضافة إلى ذلك ، فإن معظم الأساليب الحالية تستخدم تقنيات التدريب المبكر NLP التقليدية على الكود عن طريق التعامل معها كقطعة من القذائف مثل NL. سوزوكيه et al. 2020 فينغ et al. 2020 كندا و أل . 2020 فينغ et al. 2020 في هذه العملية، نحن نقدم CodeT5، وهو نموذج إلكتروني مخصص للكتابة، الذي يعتبر نموذجًا نوعيًا من المعلومات في العملية. ، ) الذي يستخدم تدريبًا إرشاديًا من الجانب إلى الجانب (Seq2Seq) ويشير إلى أن يكون مفيدًا لكل من التعلم والإنتاج في لغة طبيعية. بالإضافة إلى ذلك، ونحن نود أن نقدم الاستفادة من التعريفات المخصصة للمطورين في الكود. عند كتابة البرامج، يتعين على المطورين استخدام التعريفات الإحصائية لخلق الكود أكثر فهمًا، بحيث ستحتفظ هذه التعريفات بشكل عام بتعريفات الكود الغنية، العنوان "BinarySearch" في الصورة من أجل دمج هذه المعرفة ذات الصلة بالموضوع، ونحن نقترح هدفاً جديدًا يدرك هوية الوثائق التي تدرب النماذج على تمييز ما هو الوثائق هو الوثائق وتجديدها عندما يتم إخفاءها. رافيلد et al. 2020 أ.ج ، 2 وبالإضافة إلى ذلك، فإننا نقترح الاستفادة من الكود والملاحظات المرتبطة بها لتعلم تحديد NL-PL أفضل. في كثير من الأحيان ، يقدم المطورون بيانات للبرمجيات لتسهيل إدارة البرمجيات بشكل أفضل ( ، وبالتالي فإن مثل هذه زوجات PL-NL متوفرة على نطاق واسع في معظم الكود. بواسطة سوزان et al. 2005 تداول الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيار ( ، (بما في ذلك) ، ويعد هذا البرنامج من أهم البرامج التي تتيح للمستخدمين معرفة ما إذا كانت هذه البرامج متوفرة في جميع أنحاء العالم، أو ما إذا كانت هذه البرامج متوفرة في جميع أنحاء العالم، أو ما إذا كانت هذه البرامج متوفرة في جميع أنحاء العالم، أو ما إذا كانت هذه البرامج متوفرة في جميع أنحاء العالم. ، )، بما في ذلك مهام فهم اثنين: اكتشاف نقص الكود و اكتشاف الكون، و مهام إنتاج مثل إجمالي الكود، إنتاج، الترجمة، والتعليق. ونحن نستعرض أيضا تعلم العديد من المهام لتعزيز CodeT5 على العديد من المهام في وقت واحد باستخدام رمز التحكم في المهام كوصف مصدر. إسماعيل و آل 2019 فينغ et al. 2020 لويس و آل 2021 1 ونحن نقدم واحدة من أول نموذج كودي-كودي متكامل CodeT5 لدعم فهم الكود والمهمات ذات الصلة بالإنتاج، فضلاً عن إمكانية التعلم متعدد المهام. ونحن نقدم هدفاً جديداً للتدريب المبكر الذي يتناول المعلومات المهمة من نوع التقييم (القياسات) من الكود. بالإضافة إلى ذلك، ونحن نود الاستفادة من زوجات NL-PL التي متوفرة بشكل طبيعي في الكود للتعرف على التقييم المزدوج أفضل. وتظهر تجربة واسعة النطاق أن CodeT5 يوفر أحدث النتائج على أربعة عشر عملاً ثانويًا في CodeXGLUE. تظهر تحليلات أخرى أن CodeT5 يمكن أن يكتشف بشكل أفضل التسمية الكودية مع التدريب المبكر المختص بتحديد الهوية المحددة وتوليد مزدوج ثنائيًا يفيد أساسًا المهام NL↔PL. 2 الأعمال ذات الصلة تداول الخيارات الثنائية استراتيجيات التداول ( ، ولهذا السبب، يمكننا أن نقوم بتصنيف هذه الأدوات في مجموعة متنوعة من الأدوات، بما في ذلك الأدوات المتطورة، مثل الأدوات المتطورة ( ، روبرتس ( ، (بعد أن تتمكن من إرسال الهواتف الذكية) ، (بالتالي، يُمكن أن تُعرف بـ (أو بـ) بـ (أو بـ) . ، (بما في ذلك المكونات الأساسية، مثل المكونات الأساسية ( ، بوتفليقة ( ، ( ) ، و T5 ( ، في المقارنة مع النماذج الواردة فقط على الكمبيوتر والمكالمات الواردة فقط على الكمبيوتر التي ترغب في فهم وظائف وتكوينها، فإن النماذج الواردة على الكمبيوتر يمكن أن تساهم بشكل جيد في كلا النماذجين. Pre-training on Natural Language. زكريا et al. 2017 ديفيد و أل. 2019 ليو و أل . 2019B كلاين et al. 2020 راديفورد et al. 2019 السيناريو et al. 2019 لويس et al. 2020 رافيلد et al. 2020 تداول الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية. ، (بعد أن تمكنت من إدخال الورق) ، ويستخدم COBERT هاتف BERT الذكي الذكي الذكي الذكي الذكي الذكي الذكي الذكي الذكي الذكي الذكي الذكي الذكي الذكي الذكي الذكي الذكي الذكي ( ، - تعلم كيفية التعامل مع NL-PL، بالإضافة إلى نموذج BERT، ( ) ) و ( ) (بالتالي، يُمكن أن تُستخدم أيضًا أندرويد و أندرويد) ، (بالتالي، يجب أن تتمكن من إعادة تدوير الكمبيوتر) ( ، يبحث عن ترجمة لغات البرمجة في بيئة غير مراقبة.إلا أننا نبحث عن نموذج كودور-كودور على أساس T5 لتدريب لغات البرمجة قبل التدريب وتدعم مجموعة أكثر كفاءة من المهام. Pre-training on Programming Language. كندا و الب. 2020 فينغ و أل . 2020 كلاين et al. 2020 سوزوكيه et al. 2020 لويس و آل 2020 كونغ et al. 2019 روسيون et al. 2020 بعض الأعمال الناشئة ( ، · ، · ، في الأدبيات الأخيرة أيضا استكشاف إطار T5 على الكمبيوتر، ولكنها تركز فقط على مجموعة محدودة من مهام التكوين ولا تدعم مهام الفهم مثلنا. ، ويستند إلى نموذج إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد إدوارد. كلاودون et al. 2020 إمبراطوريه et al. 2021 أندريه et al. 2021 عبد العزيز et al 2021 في وقت سابق من هذا الأسبوع، أعلنت شركة إيرباص ( ، ) يتضمن تدفق البيانات المختلطة من هيكل الكود في CodeBERT ، في حين ( ) يقدم هدف إزالة التفكير لإستفادة الجوانب الهيكلية من PL. هذه النماذج تركز فقط على تدريب محرر محدد الكود أفضل. ( ) في المقابل، ونحن نتطلع بشكل خاص إلى المحددات التي تحمي التسمية الكيميائية الغنية وتدمير هذه المعلومات إلى نموذج Seq2Seq من خلال وظائف العلامة الفريدة من العلامة الفريدة و التوقيت. كوبون et al. 2021 روسيون et al. 2021 زهران et al. 2021 3 كود5 تُعَلِّمُونَهُنَّ مِنْ أَسْمَائِهِنَّ وَأَسْمَائِهِنَّ بِالْمَعْرُوفِ وَلِلرِّجَالِ عَلَيْهِنَّ دَرَجَةٌ وَاللهُ عَزِيزٌ حَكِيمٌ (55) ، ويهدف إلى إدخال توصيات عامة للغة البرمجة (PL) واللغة الطبيعية (NL) من خلال التدريب المبكر على رمز مصدر غير مكتمل. ونتوسع هدف Seq2Seq المحدود في T5 عن طريق تقديم وظائف تقييم واضحة للتقييم لتمكين النماذج من الاستفادة بشكل أفضل من المعلومات من نوع التقييم من PL ، والتي هي الوظائف المخصصة من قبل المطورين.لتحسين تقييم NL-PL ، نقدم أيضاً أهداف التعلم الثنائي الفوتوغرافي ثنائي لتركيب ثنائي من NL و PL. رافيلد et al. 2020 2 في النهاية، نحن نقدم كيف CodeT5 تصفح الملفات PL و NL (§ ) و المهمة المختصرة للتعرف على الوظائف المتعلقة بالتدريب (§ )، ثم يلتقي التكيف مع التعلم التبادل المحدد بالمهام والتدريب على العديد من المهام (§ ) 3.1 3.2 3.3 3.1 كود NL و PL في المرحلة المبكرة من التدريب ، سيحصل نموذجنا على PL-only أو NL-PL كدخول حسب ما إذا كان القطعة الكودية تترافق مع توصيات NL أم لا. = ([الصورة] ، 1* ، ... ، ، ، ، ، ، ، ، ، ، 1*, ..., cm*, [SEP]) ، حيث و تحديد عدد الكلمات NL و PL الكلمات، على النحو التالي. سيكون سلسلة الكلمات NL غامضًا للاتصالات PL فقط. x w c n m في هذا المقال سنقوم بتحديد ما إذا كان من الممكن أن نستخدم أجهزة الكمبيوتر الخاصة بك أو أجهزة الكمبيوتر الخاصة بك. ( أسماء الوظائف ومعدلاتها) لأنها واحدة من أهم ميزات PL-Agnostic وتحتوي على إشارات الكود الغنية. وخاصة، نحن نعمل على تحويل قطعة PL إلى شجرة إشارات استثنائية (AST) وتخزين أنماط أنماط كل نقطة رمزية. ∈ {0*, * 1} بالنسبة للقطاع الخاص، حيث كل ∈ {0*,* 1} يُشير إلى ما إذا كان إنها تصنيف أو لا. أ.ج ، y m جين ذاك 3.2 مهام التدريب المبكر نقدم الآن مهام التدريب المبكر المقترحة التي تسمح لـ CodeT5 بتعلم النماذج المفيدة من بيانات PL-only أو NL-PL bimodal. وقد أظهرت دراسة طبية حديثة أن التهاب التهاب التهاب التهاب التهاب التهاب التهاب ( ، · ، · ، في هذه الحالة، يجب أن تُعَلِّمَ أنواعًا كثيرة من الأدوية التي تُعَلِّمُها، مثل الأدوية التي تُعَلِّمُها، وتُعَلِّمَ أنواعًا كثيرة من الأدوية التي تُعَلِّمُها، وتُعَلِّمَ أنواعًا كثيرة من الأدوية الأخرى. ، ) الذي يغطي عشوائيًا المساحات مع طوابق عشوائية ثم يتوقع هذه المساحات المغلقة معا مع بعض توقيعات Sentinel في الكمبيوتر. كما يظهر في الصورة (أو) Identifier-aware Denoising Pre-training. السيناريو et al. 2019 رافيلد et al. 2020 لويس et al. 2020 رافيلد et al. 2020 Masked Span Prediction (MSP) 2 وخاصة، نحن نستخدم نفس معدل الفساد بنسبة 15٪ مثل T5، ونأمل أن يكون طوله المتوسط 3 عن طريق اختبار نطاق من 1 إلى 5 تريك. من خلال اختبار التكاليف قبل تكوين علامات النقد الأجنبي، والذي يهدف إلى تجنب تغطية علامات النقد الأجنبي جزئية ويظهر مفيدًا ( ، وأشار إلى أننا ندرس نموذجًا مشتركًا لعدد من PLs لتعلُّم التمثيلات المتعددة اللغات القوية. الكلمات الدلالية Masking السادس والخامس 2019 حيث θ هي المعايير النموذجية ، x \mask هو الدخول المزعج ، x mask هو الترتيب المزعج لتوضيح من الكمبيوتر مع k يحدد عدد الرموز في x mask ، و xmask <t هو الترتيب الترتيب الذي تم إنشاؤه حتى الآن. لإدخال المزيد من المعلومات الهيكلية المحددة للكود (التصنيف النموذجي في AST) في النموذج، ونحن نقترح دو مهام إضافية: و التكامل مع التدريب المبكر. تقييم الهوية (IT) إمكانية تحديد المحدد المغلوب (MIP) • ويهدف إلى إبلاغ النماذج مع معرفة ما إذا كان هذا التراكيب هو مؤشر أم لا، والذي يشارك روح المحاكاة مماثلة في بعض الأدوات المساعدة للمطور. (ب) نقوم بتصوير الحالات الخفية النهائية لقطة PL في CodeT5 Encoder إلى سلسلة من احتمالات. = ( 1* ، ... ، pm*) ، وحساب خسارة الكثافة الثنائية للرسم الترتيب: Identifier Tagging (IT) 2 p p أين يرجى ملاحظة أنه من خلال إزالة المهمة كأزمة إدراج الترتيب، يتوقع أن يتلقى النموذج إصبع الكود والهياكل التي تدور حول تدفق البيانات من الكود. ستكون • على عكس تغطية النطاق العشوائي في MSP ، نحن تغطينا جميع المحددات في قطاع PL وتستخدم توقيع Sentinel فريدة من نوعها لجميع الحوادث التي تشهدها إحدى المحددات المحددة. حيث تغير أسماء العلامة لا يؤثر على التسمية الكاملة. ( ) ), we arrange the unique identifiers with the sentinel tokens into a target sequence as shown in Figure (ج) ثم نقرأه بطريقة تلقائية: Masked Identifier Prediction (MIP) الاغتيال Rozière et al. 2021 I 2 أين \I هي الوثيقة الوثيقة. تذكر أن is a more challenging task that requires the model to comprehend the code semantics based on obfuscated code and link the occurrences of the same identifiers together. x إزعاج نحن نتمكن من تحسين هذه الخسائر الثلاثة مع احتمال متساوي، وهو ما يشكل تدريبًا إرشاديًا لدينا. In the pre-training phase, the decoder only sees discrete masked spans and identifiers, which is disparate from the downstream tasks where the decoder needs to generate either fluent NL texts or syntactically correct code snippets. To close the gap between the pre-training and fine-tuning, we propose to leverage the NL-PL bimodal data to train the model for a bidirectional conversion as shown in Figure (د) في المقام الأول، نرى إنتاج NL→PL و إنتاج PL→NL كمهام مزدوجة، وفي نفس الوقت تحسين النماذج على ذلك. Bimodal Dual Generation. 2 في هذا المقال سنقوم بتوضيح ما إذا كان لديك إمكانية إرسال رسائل إلكترونية إلى الموقع الإلكتروني الخاص بك ( and for Java PL and English NL, respectively). This operation can be also seen as a special case of T5’s span masking by either masking the full NL or PL segment from the bimodal inputs. This task aims to improve the alignment between the NL and PL counterparts. أ.ج ، 3.3 تقييم جيد CodeT5 بعد تدريب إرشادي على البيانات الكبيرة غير المحدودة ، نضيف CodeT5 إلى المهام التالية من خلال تعلم التحويل المحدد للمهام أو تعلم العديد من المهام. Code-related tasks can be categorized into generation and understanding tasks. For the former one, our CodeT5 can be naturally adapted with its Seq2Seq framework. For understanding tasks, we investigate two ways of either generating the label as a unigram target sequence ( , ), or predicting it from the vocabulary of class labels based on the last decoder hidden state following ( ) Task-specific Transfer Learning: Generation vs. Understanding Tasks. Raffel et al. 2020 Lewis et al. 2020 We also explore a multi-task learning setting by training a shared model on multiple tasks at a time. Multi-task learning is able to reduce computation cost by reusing the most of model weights for many tasks and has been shown to improve the model generalization capability in NL pre-training ( , ). We follow ( ) to employ the same unified model for all tasks without adding any task-specific networks but allow to select different best checkpoints for different tasks. To notify the model with which task it is dealing with, we design a unified format of task control codes and prepend it into the source inputs as shown in Figure على سبيل المثال، ونحن نستخدم "ترجمة جاوا إلى CSharp:" كطلب مصدر لعمل ترجمة الكود من جاوا إلى CSharp. Multi-task Learning. Liu et al. 2019a Raffel et al. 2020 1 As different tasks have different dataset sizes, we follow Conneau and Lample (2019) to employ a balanced sampling strategy. For N number of datasets (or tasks), with probabilities {qi} N i=1, we define the following multinomial distribution to sample from: where ni is number of examples for i-th task and α is set to 0.7. This balanced sampling aims to alleviate the bias towards high-resource tasks. 4 Experimental Setup 4.1 جمع البيانات قبل التدريب نحن نتبع Feng et al. (2020) لخدمة CodeSearchNet (Husain et al., 2019) لتدريب CodeT5 قبل التدريب، الذي يتكون من ستة PLs مع كل من البيانات الفوتوغرافية والبيولوجية. بالإضافة إلى ذلك، ونحن نشترك أيضاً دو مجموعة من البيانات من C/CSharp من BigQuery1 لضمان أن جميع المهام الناتجة عن التدريب تتقاطع مع PLs مع بيانات التدريب قبل التدريب. في المجموع، ونحن نستخدم حوالي 8.35 مليون حالة للتدريب قبل التدريب. يظهر جدول 1 بعض الإحصاءات الأساسية. من أجل الحصول على علامات التعريف من الكمبيوتر، ونحن نستفيد من الشجرة-sitter2 لتحول PL إلى شجرة تعريفية استثنائية ثم استخراج معلومات نوعية النتائج 4.2 Code-specific Tokenizer Tokenization is a key ingredient for the success of pre-trained language models like BERT and GPT. They often employ a Byte-Pair Encoding (BPE) to-kenizer ( ، ) to alleviate the Out-of-Vocabulary (OoV) issues. Specifically, we train a Byte-level BPE tokenizer following ( ) ) and set the vocabulary size to 32,000 as T5. We add additional special tokens ([PAD], [CLS], [SEP], [MASK0], ..., [MASK99]). This tokenzier is trained on all of our pre-training data with non-printable characters and low-frequent tokens (occurring <3 times) filtered. We compare it with T5’s default tokenizer and find that our tokenizer largely reduces the length of tokenized code sequence by 30% - 45% on downstream tasks. This will accelerate the training and especially benefit generation tasks due to the shorter sequence to predict. We also spot a severe problem for applying the T5’s default tokenizer on source code, where it would encode some common code tokens such as brackets [‘{’, ‘}’] into unknown tokens. Sennrich et al. 2016 Radford et al. 2019 4.3 Downstream Tasks and Metrics We cover most generation and understanding tasks in the CodeXGLUE benchmark ( , وستستستخدم مجموعة البيانات العامة المتوفرة وتقسيم البيانات نفسها بعدها لجميع هذه المهام. Lu et al. 2021 We first consider two cross-modal generation tasks. aims to summarize a function-level code snippet into English descriptions. The dataset consists of six PLs including Ruby, JavaScript, Go, Python, Java, and PHP from CodeSearchNet ( , ). We employ the smoothed BLEU-4 ( , ( ) لتقييم هذه المهمة. في هذا المقال سنقوم بتوضيح كيفية استخدام أجهزة الكمبيوتر المحمولة ( , ) in Java where the input contains both NL texts and class environment contexts, and the output is a function. We evaluate it with BLEU-4, exact match (EM) accuracy, and CodeBLEU ( , ) that considers syntactic and semantic matches based on the code structure in addition to the n-gram match. Code summarization Husain et al. 2019 Lin and Och 2004 Code generation Iyer et al. 2018 Ren et al. 2020 Besides, we consider two code-to-code generation tasks. aims to migrate legacy software from one PL to another, where we focus on translating functions from Java to CSharp and vice versa. aims to convert a buggy function into a correct one. We employ two Java datasets provided by ( ) with various function lengths: small (fewer than 50 tokens) and medium (50-100 tokens). We use BLEU-4 and exact match to evaluate them. Code translation Code refinement Tufano et al. 2019 We also investigate how CodeT5 performs on two understanding-based tasks. The first one is التي تهدف إلى تحديد ما إذا كان الكود ضعيفًا على أنظمة البرمجيات أم لا. ( ) for experiment. The second task is which aims to measure the similarity between two code snippets and predict whether they have the same functionality. We experiment with the Java data provided by ( ). We employ F1 score and accuracy for evaluating these two tasks respectively. In total, our CodeT5 supports six tasks and fourteen sub-tasks in CodeXGLUE with a unified encoder-decoder model. defect detection زكريا et al. 2019 clone detection Wang et al. 2020 4.4 Comparison Models We compare CodeT5 with state-of-the-art (SOTA) pre-trained models that can be categorized into three types: encoder-only, decoder-only, and encoder-decoder models. As models, we consider RoBERTa ( ، ), RoBERTa (code) trained with masked language modeling (MLM) on code, CodeBERT ( , ) trained with both MLM and replaced token detection ( , ), GraphCode-BERT ( , ) using data flow from code, and DOBF ( , ) trained with the identifier deobfuscation objective. Note that although DOBF employs a Seq2Seq model during pre-training, it only aims to train a better encoder for downstream tasks without exploring the poten-tial benefit of the pre-trained decoder. encoder-only Liu et al. 2019B فينغ et al. 2020 Clark et al. 2020 Guo et al. 2021 Rozière et al. 2021 For models, we compare GPT-2 ( , ) and its adaptations on code domain including CodeGPT-2, and CodeGPT-adapted. The difference is that the latter one utilizes a GPT-2 checkpoint for model initialization while the former one is trained from scratch. As models, the current SOTA model for the CodeXGLUE benchmark is PLBART ( ، ) based on BART ( , ) architecture. For pre-training data, most of these models employ CodeSearchNet ( , ) except DOBF and PLBART. DOBF is pre-trained on 7.9M Java and 3.6M Python files from BigQuery while PLBART employs a much larger data with 470M Python and 210M Java functions, and 47M NL posts from StackOverflow. decoder-only راديفورد et al. 2019 encoder-decoder كريم و أل 2021 Lewis et al. 2020 Husain et al. 2019 4.5 نموذج التكوين We build CodeT5 based on Huggingface’s T5 ( , ) PyTorch implementation and employ two sizes of CodeT5-small (60M) and CodeT5-base (220M). We set the maximum source and target sequence lengths to be 512 and 256, respectively. We use the mixed precision of FP16 to accelerate the pre-training. We set the batch size to 1024 and employ the peak learning rate of 2e-4 with linear decay. We pre-train the model with the denoising objective for 100 epochs and bimodal dual training for further 50 epochs on a cluster of 16 NVIDIA A100 GPUs with 40G memory. The total training time for CodeT5-small and CodeT5-base is 5 and 12 days, respectively. Raf-fel et al. 2020 3 In the fine-tuning phase, we find that the tasks in CodeXGLUE ( , ) are quite sensitive to some hyper parameters such as learning rate, training steps, and batch size. We conduct a grid search and select the best parameters based on the validation set. In multi-task learning, we cover all downstream tasks except clone detection. Lu et al. 2021 5 Results and Analysis In this section, we compare CodeT5 with SOTA models on a broad set of CodeXGLUE downstream tasks (§ ), and investigate the effects of our bimodal dual generation and multi-task learning (§ )، ثم بعد ذلك تحليل مفصل عن التفكير المبكر التفكير (§ ). 5.1 5.2 5.3 5.1 CodeXGLUE Downstream Tasks We evaluate two sizes of our model: CodeT5-small and CodeT5-base that are pre-trained with identifier-aware denoising. In addition, we consider the model that continues to train with bimodal dual generation (dual-gen) and show the results with multi-task fine-tuning. The results of all comparison models are obtained from their original papers and also the CodeXGLUE paper ( , ). Lu et al. 2021 ونشير إلى نتائج جمعية الكود من BLEU-4 المزدهر على 6 بيانات PL في جدول . We observe all our model variants significantly outperform prior work with either an encode-only (RoBERTa, CodeBERT, DOBF) or encoder-decoder framework (PLBART). Moreover, the salient performance gap between these two groups of models confirms that encode-only frameworks are suboptimal for generation tasks. Compared to the SOTA encoder-decoder model PLBART, we find that even our CodeT5-small yields better overall scores (also on Python and Java) given that our model is much smaller (60M vs. 140M) and PLBART is pre-trained with much larger Python and Java data (> 100 times). We attribute such improvement to our identifier-aware denoising pre-training and better employment of bi-modal training data . By increasing the model size, our CodeT5-base boosts the overall performance by over 1.2 absolute points over PLBART. Code Summarization. 2 4 We compare CodeT5 with GPT-style models and PLBART in Table ويعزز نظام CodeT5 الصغير جميع الأندرويد التي تعمل فقط على الكمبيوتر، فضلاً عن SOTA PLBART، الذي يثبت مرة أخرى أهمية نموذج الكمبيوتر الذي يعمل على الكمبيوتر في إنتاج أجزاء من الكمبيوتر. بالإضافة إلى ذلك، فإن قاعدة CodeT5 لدينا تتحرك بشكل كبير إلى النتائج SOTA على ثلاث مادة. Code Generation. 3 We compare two code-to-code generation tasks: code translation and code refinement in Table and further consider one naive copy baseline by copying the source input as the target prediction. In the code translation task, our CodeT5-small outperforms most of base-lines and obtains comparable results with PLBART, which shows the advantages of encoder-decoder models in the code-to-code generation setting. Our CodeT5-base further achieves consistent improvements over PLBART across various metrics for translating from Java to C# and vice versa. Code-to-Code Generation Tasks. 4 Here we show one CodeT5’s output of translating C# to Java in Figure . In this case, despite the poor BLEU score, CodeT5 is able to generate a function that reserves the same functionality and even has better readability compared to the ground-truth. This reveals that CodeT5 has a good generalization ability instead of memorizing and repeating what it has seen before. On the other hand, it also suggests that BLEU score is not a perfect evaluation metric for code generation tasks, where sometimes a higher score can instead reflect the problematic copy issues of neural models. 3 مهمة أخرى من إنتاج الكود إلى الكود هي تحسين الكود ، وهي مهمة ملموسة تتطلب اكتشاف أجزاء من الكود التي هي خطأ وتحديدها من خلال إنتاج سلسلة من الكود دون الخطأ. بسبب الترابط الكبير بين الكود الأصلي والهدف ، حتى نهج التعديل البسيط يؤدي إلى تقييمات BLEU عالية للغاية ولكن لا تقييمات دقيقة. وبالتالي ، نحن نتطلع إلى التقييم الحقيقي للكود (EM) لتقييم هذه المهمة. , we observe that EM scores for the small data are consistently higher than the medium one, indicating that it is harder to fix bugs for a longer code snippet. Our CodeT5-base significantly outperforms all baselines on EM and especially boosts over 4.8 points for the more challenging medium task (13.96 vs. GraphCodeBERT’s 9.10), reflecting its strong code understanding capability. 4 نحن مقارنة مع وظائف فهم اثنين من اكتشاف الفجوة و اكتشاف الكون في جدول 5. Understanding Tasks. Specifically, we generate the binary labels as a unigram sequence from the decoder for the defect detection task, while for the clone detection task, we first obtain the sequence embedding of each code snippet using the last decoder state following ( ) and then predict the labels by measuring their similarity. Both CodeT5-small and CodeT5-base outperform all baselines on the defect detection task while CodeT5-base yields 2.6 accuracy score improvement than PLBART. For the clone detection task, our CodeT5 models achieve comparable results to the SOTA GraphCodeBERT and PLBART models. These results demonstrate that with an encode-decoder framework, our CodeT5 can still be adapted well for understanding tasks. لويس et al. 2020 5.2 Effects of Bimodal Dual Generation and Multi-task Learning We examine the effects of bimodal dual generation at pre-training and multi-task learning at fine-tuning. The bimodal pre-training brings consistent improvements for code summarization and generation tasks on both CodeT5-small and CodeT5-base. However, this pre-training task does not help and even sometimes slightly hurts the performance for PL-PL generation and understanding tasks. We anticipate this is because bimodal dual generation learns a better alignment between PL and NL that naturally benefits the former tasks involving both PL and NL. As a side effect, this objective could bias the model towards the PL-NL tasks and affect its performance on PL-PL tasks. في التعلم متعدد المهام ، فإنه بشكل عام تحسين معظم المهام التالية باستثناء ترجمة الكود وتحديد الأخطاء. في المقام الأول ، فإنه يعزز بشكل كبير الأداء على تقييم الكود ، مما لا يفاجئ لأنه تقييم الكود يأخذ الجزء الأكبر من المهام التالية (الستة من الثلاثة عشر) وبالتالي الفائدة الأكبر من التعلم متعدد المهام. Another possible reason is that multi-task training with defect detection would enable the model to better comprehend the code semantics for bug detection, which is also a necessary intermediate step for code refinement. 5.3 Analyzing Identifier-aware Pre-training We provide an ablation study to examine the contribution of each component in our identifier-aware objective. Specifically, we compare the performance of our CodeT5-small on four selected tasks by ablating each of the three objectives: masked span prediction (MSP), identifier tagging (IT), and masked identifier prediction (MIP). As shown in Table , we observe that generally removing one of the objectives would reduce the performance for all tasks, indicating that all objectives contribute to the better code understanding of our CodeT5. However, the effect of each objective differs across tasks. Specifically, removing MSP would largely reduce the performance of all generation tasks but instead increase the defect detection performance. This shows that masked span prediction is more crucial for capturing syntactic information for generation tasks. On the contrary, removing MIP would hurt the defect detection task the most, indicating that it might focus more on code semantic understanding. By combining these objectives, our CodeT5 can better capture both syntactic and semantic information from code. 6 We further provide outputs from CodeT5 and its variant without MIP and IT on code generation in Figure . We observe that CodeT5 can correctly generate the exact function, while the model without MIP and IT fails to recover the identifiers of “s2” and “hasField”. This shows our identifier-aware denoising pre-training can better distinguish and leverage the identifier information. 4 We also investigate the identifier tagging performance and find it achieves over 99% F1 for all PLs, showing that our CodeT5 can confidently distinguish identifiers in code. We then check whether MSP and MIP tasks would have conflicts as they employ the same sentinel tokens for masking. In identifier masking, all occurrences of one unique identifier are replaced with the same sentinel token, resulting in a many-to-one mapping compared to the one-to-one mapping in span prediction. We compare models pre-trained with either MSP or MIP, and both on these two tasks in Table . We report the prediction accuracy and also the ratio of how often they can generate the same number of predictions as the sentinel tokens. We observe that pre-training only with either MIP or MSP would bias the model towards that task, achieving poor accuracy and higher mismatch in number of predictions when applied to the other task. Interestingly, we find that MIP-only objective can better recover the correct number of predictions in the MSP task than MSP-only does for the MIP task, meaning that it is easier to adapt from many-to-one mapping to one-to-one mapping and difficult for the opposite. At last, combining them can help our model to make a good trade-off on both tasks. 7 6 Conclusion We have presented CodeT5, a pre-trained encoder-decoder model that incorporates the token type information from code. We propose a novel identifier-aware pre-training objective to better leverage the identifiers and propose a bimodal dual generation task to learn a better NL-PL alignment using code and its comments. Our unified model can support both code understanding and generation tasks and allow for multi-task learning. Experiments show that CodeT5 significantly outperforms all prior work in most CodeXGLUE tasks. Further analysis also reveals its better code comprehension capability across various programming languages. Broader Impact and Ethical Consideration يضم العمل لدينا عموماً تطبيقات NLP للتكنولوجيا الذكية. مع الهدف من تحسين إنتاجية تطوير البرمجيات باستخدام أساليب التعلم الآلي، فإن البحث في الذكاء الافتراضي قد جذب اهتماماً متزايداً في كل من الأكاديمية والقطاعات خلال العقد الماضي. يمكن أن تساعد تقنيات الذكاء الافتراضي على تقليل كميات عمل متكررة، وتحسين جودة البرمجيات وتحسين إنتاجية تطوير البرمجيات بشكل عام. هذا سيقلل بشكل كبير وقت العمل، ويمكن أيضاً أن يقلل من تكلفة الحساب والعمل، حيث يمكن أن يقلل نقص في أداء النظام أو حتى تدمير النظام بأكمله. يلجأ عملنا إلى حل التحديات الأساسية للقيام بتدريب الكمبيوتر We further discuss the ethical consideration of training CodeT5 and the potential risks when applying it into real-world downstream applications: The training datasets in our study are source code including user-written comments from open source Github repositories and publicly available, which do not tie to any specific application. However, it is possible that these datasets would encode some stereotypes like race and gender from the text comments or even from the source code such as variables, function and class names. As such, social biases would be intrinsically embedded into the models trained on them. As suggested by ( ), interventions such as filtration or modulation of generated outputs may help to mitigate these biases in code corpus. Dataset bias. Chen et al. 2021 Our model pre-training requires non-trivial computational resources though we have tried our best to carefully design our experiments and improve experiments to save unnecessary computation costs. In fact, compared to the recent large-scale language model Codex ( , ), our CodeT5-base has a much smaller model size of 220M than theirs of 12B (∼ 55×). In addition, we experiment on Google Cloud Plat-form which purchases carbon credits to reduce its carbon footprint, training CodeT5-base produced around 49.25 kg CO2 which was totally off-set by the provider. Furthermore, we release our pre-trained models publicly to avoid repeated training for the code intelligence research community. Computational cost. تشان et al. 2021 أ.ج ، As CodeT5 can be deployed to provide coding assistance such as code generation for aiding developers, automation bias of machine learning systems should be carefully considered, especially for developers who tend to over-rely on the model-generated outputs. Sometimes these systems might produce functions that superficially appear correct but do not actually align with the developer’s intents. If developers unintentionally adopt these incorrect code suggestions, it might cause them much longer time on debugging and even lead to some significant safety issues. We suggest practitioners using CodeT5 should always bear in mind that its generation outputs should be only taken as references which require domain experts for further correctness and security checking. Automation bias. تداول الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيارات الثنائية الخيار ( , ) and a small fraction of Google BigQuery, both of which are originally collected from public Github repositories. Pre-trained mod-els might encode some sensitive information ( على الرغم من أننا أجريت عدداً كبيراً من إزالة البيانات من بيانات التدريب لتقليل هذا قبل تدريب نموذجنا، إلا أنه من الممكن أيضاً أن لا يمكن إزالة بعض المعلومات ذات الصلة بالكامل. بالإضافة إلى ذلك، بسبب طبيعة غير تقليدية من نموذج التوليد مثل CodeT5، قد تنتج بعض الكود الذي يؤثر سلباً على البرمجيات وحتى تكون قادرة على الاستفادة من تطوير البرمجيات الخبيثة أكثر تطوراً عندما يتم استخدامه بشكل خاطئ. Security implications. إسماعيل و آل 2019 e.g., Acknowledgements نحن ننسى آكليش ديباك غوتماير، أرميتا ساوا، جونان لي، و تشان جينغ لالقاءات القيمة. نحن ننسى كاتي باكسير للقياس الأخلاقي. نحن ننسى أيضًا المراجعين غير المسموح لهم لعودتهم المفاجئة على رسالتنا. التقارير Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. . In , pages 2655–2668. Association for Computational Linguistics. Unified pre-training for program understanding and generation ورشات المؤتمر 2021 من الفصل الشمالي الأمريكي للاتحاد لللغات الحاسوبية: تقنيات اللغة البشرية، NAACL-HLT 2021, على الانترنت، 6-11 يونيو، 2021 مارك تشين، جيري توريك، جيري توريك، جيري توريك، جيري توريك، جيري توريك، جيري توريك، جيري توريك، جيري توريك، جيري توريك، جيري توريك، جيري توريك، جيري توريك، جيري توريك، جيري توريك، جيري توريك، جيري توريك، جيري توريك، جيري توريك، جيري توريك، جيري توريك، جيري توريك، جيري توريك، جيري توريك، جيري توريك، جيري توريك، جيري توريك، جيري توريك، جيري توريك، جيري توريك، جيري توريك، جيري توريك، جيري توريك، جي . . , abs/2107.03374. Evaluating large language models trained on code كوريا كيفين كلارك، مونغ تانغ لونغ، كوك V. ل، وكريستوفر دي مينيغ. في - OpenReview.net ELECTRA: pre-training text encoders as discriminators rather than generators 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 Colin B. Clement, Dawn Drain, Jonathan Timcheck, Alexey Svyatkovskiy, and Neel Sundaresan. 2020. في , pages 9052–9065. Association for Computational Linguistics. Pymt5: ترجمة متعددة الأدوات للغة الطبيعية كود Python مع التحويلات Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020 Alexis Conneau وGuillaume Lample 2019. . In صفحة 7057 - 7067 Cross-lingual language model pretraining Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada Sergio Cozzetti B. de Souza, Nicolas Anquetil, and Káthia Marçal de Oliveira. 2005. في , pages A study of the documentation essential to software maintenance Proceedings of the 23rd Annual International Conference on Design of Communication: documenting & Designing for Pervasive Information, SIGDOC 2005, Coventry, UK, September 21-23, 2005 68 - 75 م . Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. في , pages 4171–4186. BERT: pre-training of deep bidirectional transformers for language understanding Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers) Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi-aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. . In , pages 13042–13054. Unified language نموذج التدريب المبكر لتفهم اللغة الطبيعية والإنتاج التقدم في أنظمة معالجة المعلومات العصبية 32: المؤتمر السنوي حول أنظمة معالجة المعلومات العصبية 2019، NeurIPS 2019، 8-14 ديسمبر، 2019، فانكوفر، BC، كندا Ahmed Elnaggar, Wei Ding, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Silvia Severini, Florian Matthes, and Burkhard Rost. 2021. . , abs/2104.02443. كود-ترانس: نحو إلغاء لغة السيليكون الكمبيوتر من خلال التعلم العميق والقدرة الذاتية performance computing كوريا Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xi-aocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. . In , pages 1536–1547. Association for Computational Linguistics. Code-bert: A pre-trained model for programming and natural languages Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, EMNLP 2020, Online Event, 16-20 November 2020 دبي، دبي، دبي، دبي، دبي، دبي، دبي، دبي، دبي، دبي، دبي، دبي، دبي، دبي، دبي، دبي، دبي، دبي، دبي، دبي . In - OpenReview.net Graphcodebert: Pre-training code representations with data flow المؤتمر الدولي التاسع حول ممثلات التعلم، ICLR 2021, الحدث الافتراضي، أستراليا، 3-7 مايو 2021 Hamel Husain ، Ho-Hsiang Wu ، Tiferet Gazit ، Miltiadis Allamanis ، و Marc Brockschmidt. . , abs/1909.09436. Code-searchnet challenge: Evaluating the state of semantic code search CoRR Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2018. . In , pages 1643–1652. Association for Computational Linguistics. كتابة اللغة إلى الكود in programmatic context ورشات المؤتمر 2018 حول الطرق الإمبراطورية في معالجة اللغة الطبيعية، بروكسل، بلجيكا، 31 أكتوبر - 4 نوفمبر 2018 Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. 2020. . In رقم 119 من , pages 5110–5121. PMLR. التعلم والتقييم contextual embedding of source code Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event Proceedings of Machine Learning Research Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. في , pages 7871–7880. Association for Computational Linguistics. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020 Chin-Yew Lin و Franz Josef Och. . In . ORANGE: a method for evaluating automatic evaluation metrics for machine translation COLING 2004, 20th International Conference on Computational Linguistics, Proceedings of the Conference, 23-27 August 2004, Geneva, Switzerland Fang Liu, Ge Li, Yunfei Zhao, and Zhi Jin. 2020. . In صفحة 473 - 485. نموذج اللغة المدرسية المتخصصة بالمهارات المتعددة كود التكامل 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020, Melbourne, Australia, September 21-25, 2020 Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian-feng Gao. 2019a. . In , pages 4487–4496. Association for Computational Linguistics. Multi-task deep neural networks معرفة اللغة الطبيعية ورشات المؤتمر 57 للاتحاد لللغات الحاسوبية، ACL 2019، فلوريدا، إيطاليا، 28 يوليو إلى 2 أغسطس، 2019، رقم 1: رسائل طويلة يانغ لوي، مايكل أوت، نانغ غالي، جينغفيا دو، ماني دارد جوشي، دانكي تشين، أوير ليفي، مايك لويس، لوك زيتلماير، وسميرين ستيوانوف. . . , abs/1907.11692. Roberta: A robustly optimized BERT pretraining approach CoRR Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Li-dong Zhou, Linjun Shou, Long Zhou, Michele Tu-fano, Ming Gong, Ming Zhou, Nan Duan, Neel Sun-daresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. 2021. . , abs/2102.04664. Codexglue: A machine learning benchmark dataset for code understanding and generation كوريا Antonio Mastropaolo, Simone Scalabrino, Nathan Cooper, David Nader-Palacio, Denys Poshyvanyk, Rocco Oliveto, and Gabriele Bavota. 2021. . In , pages 336–347. IEEE. Studying the usage of text-to-text transfer transformer to support code-related tasks 43th IEEE/ACM International Conference on Software Engineering, ICSE 2021, مدريد, إسبانيا, 22-30 مايو 2021 Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. . 1 ( 8 ) : 9 Language الطلاب غير المختصين بالمهارات المتعددة OpenAI blog كولين رافيل، نوام شزير، آدم روبرتس، كاترين لي، شارتن ناريغ، مايكل ماتيانا، يانكي زو، ويي لي، و بيتر جون لوي. . , 21:140:1–140:67. Exploring the limits تكنولوجيا التعلم عن طريق تحويل النص إلى النص J. Mach. Learn. Res. Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Am-brosio Blanco, and Shuai Ma. 2020. . أوبتيون 2009/10297 Codebleu: a method for automatic evaluation of code synthesis كوريا Baptiste Rozière, Marie-Anne Lachaux, Lowik Chanussot, and Guillaume Lample. 2020. . In ترجمة غير مراقبة لغات البرمجة Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December . 6-12, 2020, virtual Baptiste Rozière, Marie-Anne Lachaux, Marc Szafraniec, and Guillaume Lample. 2021. . . , abs/2102.07492. DOBF: أهداف التدريب المبكر لتحديث لغات البرمجة CoRR Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. في . The Association for Computer Linguistics. ترجمة آلة عصبية لغات نادرة الوحدات Subword Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers كايتا سوغ، سو تان، تاو تشين، جيانفينغ لو، وتيان جاين ليو. في رقم 97 من , صفحة 5926-5936. MASS: الترتيب المحموم إلى الترتيب قبل التدريب لإنتاج لغات Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA دراسة دراسة التعلم الآلي Yu Sun, Shuohuan Wang, Yu-Kun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019. . , abs/1904.09223. ERNIE: enhanced representation through knowledge integration CoRR Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020. في , صفحة 1433-1443 . Intellicode يتكون: code generation using transformer ESEC/FSE ’20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event, USA, 8-13 نوفمبر 2020 مايكل توفانو، كودي واتسون، جيريلي بيبوتا، ماسيميليانو دي بونا، مارتن ويتي، ودييس بوش-واينيك. . , 28(4):19:1–19:29. An empirical study on learning bug-fixing patches in the wild via neural machine translation أ.م.م.م.م.م.م.م.م Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. . In صفحة 5998 - 6008 Attention is all تحتاج إلى Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA Wenhan Wang, Ge Li, Bo Ma, Xin Xia, and Zhi Jin. 2020. . In صفحة 261 - 271. Detecting code clones with graph neural network and flow-augmented abstract syntax شجرة مؤتمر IEEE الدولي الثاني عشر حول تحليل البرمجيات والتطور والتصنيع الجديد، SANER 2020، لندن، ON، كندا، 18-21 فبراير، 2020 Yaqin Zhou، Shangqing Liu، Jing Kai Siow، Xiaon-ing Du، و Yang Liu. في صفحات 10197-10207 Devign: Effective تحديد الضعف عن طريق التعلم عن بعد عن برنامج واسع عبر الشبكات العصبية الخلفية Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada Daniel Zügner ، Tobias Kirschstein ، Michele Catasta ، Jure Leskovec ، و Stephan Günnemann. في - OpenReview.net التعرف على مصدر التمثيل الإنجليزي الكود من الهيكل والموضوع المؤتمر الدولي التاسع حول ممثلات التعلم، ICLR 2021, الحدث الافتراضي، أستراليا، 3-7 مايو 2021 هذه المقالة متوفرة في archiv تحت سياسة CC by 4.0 Deed (Attribution 4.0 International). هذه المقالة متوفرة في archiv تحت سياسة CC by 4.0 Deed (Attribution 4.0 International).