Authors:
(1) Suriya Gunasekar, Microsoft Research;
(2) Yi Zhang, Microsoft Research;
(3) Jyoti Aneja, Microsoft Research;
(4) Caio C´esar Teodoro Mendes, Microsoft Research;
(5) Allie Del Giorno, Microsoft Research;
(6) Sivakanth Gopi, Microsoft Research;
(7) Mojan Javaheripi, Microsoft Research;
(8) Piero Kauffmann, Microsoft Research;
(9) Gustavo de Rosa, Microsoft Research;
(10) Olli Saarikivi, Microsoft Research;
(11) Adil Salim, Microsoft Research;
(12) Shital Shah, Microsoft Research;
(13) Harkirat Singh Behl, Microsoft Research;
(14) Xin Wang, Microsoft Research;
(15) S´ebastien Bubeck, Microsoft Research;
(16) Ronen Eldan, Microsoft Research;
(17) Adam Tauman Kalai, Microsoft Research;
(18) Yin Tat Lee, Microsoft Research;
(19) Yuanzhi Li, Microsoft Research.
We begin with publicly available Python code datasets: we use the Python subset of the deduplicated version of The Stack and the StackOverflow, which together contain over 35 million files/samples, totalling over 35B tokens. We annotate the quality of a small subset of these files (about 100k samples) using GPT-4: given a code snippet, the model is prompted to “determine its educational value for a student whose goal is to learn basic coding concepts”.
We then use this annotated dataset to train a random forest classifier that predicts the quality of a file/sample using its output embedding from a pretrained codegen model as features. We note that unlike GPT-3.5, which we use extensively to generate synthetic content (discussed below), we use GPT-4 minimally only for annotations on the quality of a small subset of The Stack and StackOverflow samples. We thus view our usage of GPT-4 as merely a way to avoid tedious human-annotation efforts [DLT+ 23].
Our filtering methodology boosts our model performance significantly even without the synthetic datasets discussed below: for 350M parameter models trained on unfiltered Stack (deduplicated python) and StackOverflow, the HumanEval performance saturates at 12.19% even after training for 96k steps (∼ 200B tokens), while training on the filtered subset achieves 17.68% on HumanEval after 36k steps. We further improve this to 20.12% (reported in Figure 2.1) by training on a combination of the filtered dataset and the synthetic textbooks dataset discussed below.
This paper is available on arxiv under CC BY 4.0 DEED license.