Authors:
(1) Suriya Gunasekar, Microsoft Research;
(2) Yi Zhang, Microsoft Research;
(3) Jyoti Aneja, Microsoft Research;
(4) Caio C´esar Teodoro Mendes, Microsoft Research;
(5) Allie Del Giorno, Microsoft Research;
(6) Sivakanth Gopi, Microsoft Research;
(7) Mojan Javaheripi, Microsoft Research;
(8) Piero Kauffmann, Microsoft Research;
(9) Gustavo de Rosa, Microsoft Research;
(10) Olli Saarikivi, Microsoft Research;
(11) Adil Salim, Microsoft Research;
(12) Shital Shah, Microsoft Research;
(13) Harkirat Singh Behl, Microsoft Research;
(14) Xin Wang, Microsoft Research;
(15) S´ebastien Bubeck, Microsoft Research;
(16) Ronen Eldan, Microsoft Research;
(17) Adam Tauman Kalai, Microsoft Research;
(18) Yin Tat Lee, Microsoft Research;
(19) Yuanzhi Li, Microsoft Research. Table of Links Abstract and 1. Introduction
2 Training details and the importance of high-quality data
2.1 Filtering of existing code datasets using a transformer-based classifier
2.2 Creation of synthetic textbook-quality datasets
2.3 Model architecture and training
3 Spikes of model capability after finetuning on CodeExercises, 3.1 Finetuning improves the model’s understanding, and 3.2 Finetuning improves the model’s ability to use external libraries
4 Evaluation on unconventional problems with LLM grading
5 Data pruning for unbiased performance evaluation
5.1 N-gram overlap and 5.2 Embedding and syntax-based similarity analysis
6 Conclusion and References
A Additional examples for Section 3
B Limitation of phi-1
C Examples for Section 5 5.1 N-gram overlap N-gram measures the similarity of text segments based on the shared n-word sequences. We calculate the n-gram overlap between the docstrings of each humaneval question and each exercise in the CodeExercises dataset that was generated. We found 4 humaneval questions with 13-gram overlap with at least one of the entries in our dataset. After further investigating, we found out that all the 4 overlap cases in the 13-gram are all false positives such as the example below. Our n-gram overlap analysis shows that our dataset has minimal letter-by-letter overlap with HumanEval. 5.2 Embedding and syntax-based similarity analysis As we just saw, the n-gram analysis is not refined enough to find similar code snippets between HumanEval and CodeExercises. Instead we use a combination of embedding and syntax-based distances. For the embedding distance we compute the L2 distance between the embedding of the code snippets where the embedding is derived from a pre-trained CodeGen-Mono 350M model [NPH+ 23]. We observe that the embedding distance is successful in capturing code pairs where the overall code semantics are similar, which can be inferred via the Python Docstring, function/class names, as well as the code structure. For the syntax-based distance we calculate the (string) edit distance between the abstract syntax trees (ASTs) of two given code snippets. The AST distance successfully identifies overlapping sections between code pairs while being agnostic to non-syntax text such as variable/function naming, comments, and Python Docstrings. For our pruning of CodeExercises we fix a threshold for the embedding distance, and we test several match rate τ for the AST distance. See Appendix C for examples of code pairs that are captured with the embedding distance and various AST match rates τ . We vary τ between 0.95 and 0.8, which corresponds to removing between 42.5K to 354K of the 879.5K total problems in CodeExercises. Table 3 summarizes the performance of our retrained phi-1 on pruned datasets (with τ = 0.95, 0.9, 0.85 and 0.8) versus the original phi-1 trained on full CodeExercises and the 15.5B-parameter StarCoderprompted. We divide the HumanEval problems into two subsets (“similar” and “non-similar”) based on whether or not they have at least one close match (for this given τ ) inside the original CodeExercises dataset. We then report the accuracy of the models on each subset of HumanEval separately. As one can see, even after heavily pruning our dataset, phi-1 still outperforms StarCoder-Prompted by a large margin, which validates that our performance boost is not due to dataset “contamination”, even when the latter term is understood loosely. Note also that the accuracy of all models is lower on the HumanEval non-similar subset versus the similar one. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Suriya Gunasekar, Microsoft Research; (2) Yi Zhang, Microsoft Research; (3) Jyoti Aneja, Microsoft Research; (4) Caio C´esar Teodoro Mendes, Microsoft Research; (5) Allie Del Giorno, Microsoft Research; (6) Sivakanth Gopi, Microsoft Research; (7) Mojan Javaheripi, Microsoft Research; (8) Piero Kauffmann, Microsoft Research; (9) Gustavo de Rosa, Microsoft Research; (10) Olli Saarikivi, Microsoft Research; (11) Adil Salim, Microsoft Research; (12) Shital Shah, Microsoft Research; (13) Harkirat Singh Behl, Microsoft Research; (14) Xin Wang, Microsoft Research; (15) S´ebastien Bubeck, Microsoft Research; (16) Ronen Eldan, Microsoft Research; (17) Adam Tauman Kalai, Microsoft Research; (18) Yin Tat Lee, Microsoft Research; (19) Yuanzhi Li, Microsoft Research. Authors: Authors: (1) Suriya Gunasekar, Microsoft Research; (2) Yi Zhang, Microsoft Research; (3) Jyoti Aneja, Microsoft Research; (4) Caio C´esar Teodoro Mendes, Microsoft Research; (5) Allie Del Giorno, Microsoft Research; (6) Sivakanth Gopi, Microsoft Research; (7) Mojan Javaheripi, Microsoft Research; (8) Piero Kauffmann, Microsoft Research; (9) Gustavo de Rosa, Microsoft Research; (10) Olli Saarikivi, Microsoft Research; (11) Adil Salim, Microsoft Research; (12) Shital Shah, Microsoft Research; (13) Harkirat Singh Behl, Microsoft Research; (14) Xin Wang, Microsoft Research; (15) S´ebastien Bubeck, Microsoft Research; (16) Ronen Eldan, Microsoft Research; (17) Adam Tauman Kalai, Microsoft Research; (18) Yin Tat Lee, Microsoft Research; (19) Yuanzhi Li, Microsoft Research. Table of Links Abstract and 1. Introduction 2 Training details and the importance of high-quality data 2.1 Filtering of existing code datasets using a transformer-based classifier 2.2 Creation of synthetic textbook-quality datasets 2.3 Model architecture and training 3 Spikes of model capability after finetuning on CodeExercises, 3.1 Finetuning improves the model’s understanding, and 3.2 Finetuning improves the model’s ability to use external libraries 4 Evaluation on unconventional problems with LLM grading 5 Data pruning for unbiased performance evaluation 5.1 N-gram overlap and 5.2 Embedding and syntax-based similarity analysis 6 Conclusion and References A Additional examples for Section 3 B Limitation of phi-1 C Examples for Section 5 Abstract and 1. Introduction Abstract and 1. Introduction 2 Training details and the importance of high-quality data 2 Training details and the importance of high-quality data 2.1 Filtering of existing code datasets using a transformer-based classifier 2.1 Filtering of existing code datasets using a transformer-based classifier 2.2 Creation of synthetic textbook-quality datasets 2.2 Creation of synthetic textbook-quality datasets 2.3 Model architecture and training 2.3 Model architecture and training 3 Spikes of model capability after finetuning on CodeExercises, 3.1 Finetuning improves the model’s understanding, and 3.2 Finetuning improves the model’s ability to use external libraries 3 Spikes of model capability after finetuning on CodeExercises, 3.1 Finetuning improves the model’s understanding, and 3.2 Finetuning improves the model’s ability to use external libraries 4 Evaluation on unconventional problems with LLM grading 4 Evaluation on unconventional problems with LLM grading 5 Data pruning for unbiased performance evaluation 5 Data pruning for unbiased performance evaluation 5.1 N-gram overlap and 5.2 Embedding and syntax-based similarity analysis 5.1 N-gram overlap and 5.2 Embedding and syntax-based similarity analysis 6 Conclusion and References 6 Conclusion and References A Additional examples for Section 3 A Additional examples for Section 3 B Limitation of phi-1 B Limitation of phi-1 C Examples for Section 5 C Examples for Section 5 5.1 N-gram overlap N-gram measures the similarity of text segments based on the shared n-word sequences. We calculate the n-gram overlap between the docstrings of each humaneval question and each exercise in the CodeExercises dataset that was generated. We found 4 humaneval questions with 13-gram overlap with at least one of the entries in our dataset. After further investigating, we found out that all the 4 overlap cases in the 13-gram are all false positives such as the example below. Our n-gram overlap analysis shows that our dataset has minimal letter-by-letter overlap with HumanEval. 5.2 Embedding and syntax-based similarity analysis As we just saw, the n-gram analysis is not refined enough to find similar code snippets between HumanEval and CodeExercises. Instead we use a combination of embedding and syntax-based distances. For the embedding distance we compute the L2 distance between the embedding of the code snippets where the embedding is derived from a pre-trained CodeGen-Mono 350M model [NPH+ 23]. We observe that the embedding distance is successful in capturing code pairs where the overall code semantics are similar, which can be inferred via the Python Docstring, function/class names, as well as the code structure. For the syntax-based distance we calculate the (string) edit distance between the abstract syntax trees (ASTs) of two given code snippets. The AST distance successfully identifies overlapping sections between code pairs while being agnostic to non-syntax text such as variable/function naming, comments, and Python Docstrings. For our pruning of CodeExercises we fix a threshold for the embedding distance, and we test several match rate τ for the AST distance. See Appendix C for examples of code pairs that are captured with the embedding distance and various AST match rates τ . We vary τ between 0.95 and 0.8, which corresponds to removing between 42.5K to 354K of the 879.5K total problems in CodeExercises. Table 3 summarizes the performance of our retrained phi-1 on pruned datasets (with τ = 0.95, 0.9, 0.85 and 0.8) versus the original phi-1 trained on full CodeExercises and the 15.5B-parameter StarCoderprompted. We divide the HumanEval problems into two subsets (“similar” and “non-similar”) based on whether or not they have at least one close match (for this given τ ) inside the original CodeExercises dataset. We then report the accuracy of the models on each subset of HumanEval separately. As one can see, even after heavily pruning our dataset, phi-1 still outperforms StarCoder-Prompted by a large margin, which validates that our performance boost is not due to dataset “contamination”, even when the latter term is understood loosely. Note also that the accuracy of all models is lower on the HumanEval non-similar subset versus the similar one. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Textbooks are All You Need: N-gram Overlap and Embedding and Syntax-based Similarity Analysis

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Large Language Models on Memory-Constrained Devices Using Flash Memory: Abstract and Intro

Textbooks Are All You Need: Abstract and Introduction

Textbooks are All You Need: Training Details and the Importance of High-quality Data

Textbooks are All You Need: Filtering of Existing Code Datasets Using a Transformer-based Classifier

Textbooks are All You Need: Creation of Synthetic Textbook-quality Datasets

Textbooks are All You Need: Model Architecture and Training

Large Language Models on Memory-Constrained Devices Using Flash Memory: Abstract and Intro

Textbooks Are All You Need: Abstract and Introduction

Textbooks are All You Need: Training Details and the Importance of High-quality Data

Textbooks are All You Need: Filtering of Existing Code Datasets Using a Transformer-based Classifier

Textbooks are All You Need: Creation of Synthetic Textbook-quality Datasets

Textbooks are All You Need: Model Architecture and Training

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps