OpenAI's New Code Generator: GitHub Copilot (and Codex)

Written by whatsai | Published 2021/07/25
Tech Story Tags: artificial-intelligence | gpt-3 | github | copilot | machine-learning | technology | hackernoon-top-story | youtubers | web-monetization

TLDR

OpenAI's New Code Generator: GitHub Copilot (and Codex) is a tool by GitHub, which generates code for you. You give the name of a function along with some additional info, and it generates the code quite accurately! This is because it uses a similar model as GPT-3, an extremely powerful natural language model that you most certainly know. In fact, in their new paper released for GitHub copilot, OpenAI tested the model without any further training on code, and solved exactly 0 Python code-writing problems.via the TL;DR App

You’ve probably heard of the recent Copilot tool by GitHub, which generates code for you. You can see this tool as an auto-complete++ for code. You give it the name of a function along with some additional info, and it generates the code for you quite accurately! But it won’t just autocomplete your function.

Rather, it will try to understand what you are trying to do to generate it. It is also able to generate much bigger and more complete functions than classical autocomplete tools. This is because it uses a similar model as GPT-3, an extremely powerful natural language model that you most certainly know.

But if you try to generate code with the primary GPT-3 model from the OpenAI’s API, it won’t work. In fact, in their new paper released for GitHub copilot, OpenAI tested GPT-3 without any further training on code, and it solved exactly 0 Python code-writing problems. So how did they took such a powerful language generation model that is completely useless for code generation and transformed it to fit this new task of generating code? Watch the video to learn more!

Watch the video

References:
►Read the full article: https://www.louisbouchard.ai/github-copilot/
►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/
►GitHub Copilot: https://copilot.github.com/
►Codex/copilot paper: https://arxiv.org/pdf/2107.03374.pdf
►Yannic’s video about GitHub Copilot: https://youtu.be/TrLrBL1U8z0

Video Transcript

00:00

You've probably heard of the recent Copilot tool by GitHub, which generates code for you.

00:05

You can see this tool as an auto-complete++ for code.

00:09

You give it the name of a function along with some additional info, and it generates the

00:13

code for you quite accurately!

00:15

But it won't just autocomplete your function.

00:18

Rather, it will try to understand what you are trying to do to generate it.

00:21

It is also able to generate much bigger and more complete functions than classical autocomplete

00:27

tools.

00:28

This is because it uses a similar model as GPT-3, an extremely powerful natural language

00:33

model that you most certainly know.

00:36

If you're not sure or do not remember how it works, you should watch the video I made

00:39

a year ago when GPT-3 came out.

00:42

Okay, so as you know, GPT-3 is a language model, so it wasn't trained on code but natural

00:47

human language.

00:49

If you try to generate code with the primary GPT-3 model from the OpenAI's API, it won't

00:54

work.

00:55

In fact, in their new paper released for GitHub copilot, OpenAI tested GPT-3 without any further

01:01

training on code, and it solved exactly 0 Python code-writing problems.

01:06

So how did they took such a powerful language generation model that is completely useless

01:10

for code generation and transformed it to fit this new task of generating code?

01:16

The first part is easy.

01:17

It had to understand what the user wants, which GPT-3 is already pretty good at.

01:21

The second part is hard to achieve since GPT-3 never saw code before, well, not a lot.

01:27

As you know, to be such a powerful language model, GPT-3 was trained on pretty much the

01:32

text from the whole internet.

01:34

And now, OpenAI and GitHub are trying to build a similar model, but for code generation.

01:40

Without entering into all the privacy dilemmas spawned with the Copyright issues of the code

01:44

used for training on GitHub, you clearly cannot be at a better place to do that.

01:49

I will come back to these privacy issues at the end!

01:52

Since GPT-3 is the most powerful language model that currently exists, they started

01:56

from there.

01:58

Using a very similar model, they attacked the second part of the problem, generating

02:02

code, by training this GPT model on billions of lines of publicly available GitHub code

02:08

instead of random text from the internet.

02:11

The power of GPT-3 is pretty much the amount of information it can learn from, so doing

02:15

the same thing but specialized on code would certainly yield some amazing results.

02:20

More precisely, they trained this adapted GPT model on 54 million public software repositories

02:27

hosted on GitHub!

02:28

Now, we have a huge model trained on a lot of code examples.

02:32

The problem is, as you know, a model can only be as good as the data it was trained on.

02:37

So if the data is randomly sampled from GitHub, how can you be sure it works and is well-written?

02:43

You can't really know for sure, and it may cause a lot of issues,

02:46

but a great way they found to improve the coding skills of the model further was to

02:50

fine-tune it on code from competitive programming websites and from repositories with continuous

02:56

integration.

02:57

This means that the code is most likely good and well written but in smaller quantity.

03:01

They fine-tuned the model with this new training dataset in a supervised way.

03:05

This means that they trained the same model a second time on a smaller and more specific

03:09

dataset of curated examples.

03:12

Fine-tuning is a powerful technique often used to improve the results for our specific

03:17

needs instead of starting from nothing.

03:19

A model is often much more powerful when trained with more data even if it is not useful for

03:24

our task and further adapted for our task, instead of training a new model from nothing

03:29

with little curated data.

03:31

When it comes to data and deep learning, it's most often the more, the better.

03:36

The descendants of this model are what's used in GitHub Copilot and the Codex models in

03:41

the OpenAI API.

03:43

Of course, Copilot is not perfect yet and has many limitations.

03:46

It won't replace programmers anytime soon, but it showed amazing results and can speed

03:51

up the work of many programmers for coding simple but tedious functions and classes.

03:57

As I mentioned, they trained the copilot's model on billions of lines of public code,

04:01

but from any licenses, and since it was made in collaboration with OpenAI, they will, of

04:06

course, sell this product.

04:08

It's perfectly cool that they want to make money out of a powerful tool they built, but

04:12

it may have some complications when it was made using your code with restrictive licenses.

04:17

If you would like to hear more about this issue in relation to copyright law, the GPL

04:22

license, and terms of service, I'd strongly recommend you watch the great video Yannic

04:26

Kilcher made a few days ago.

04:28

It is linked in the description.

04:29

Thank you for watching!

Written by whatsai | I explain Artificial Intelligence terms and news to non-experts.

Published by HackerNoon on 2021/07/25