In this blog post, I will explain the tech behind GPT Pilot – a dev tool that uses GPT-4 to code an entire, production-ready app.
The primary premise here is that AI has reached a point where it can autonomously generate a substantial portion of code for an application, potentially up to 95%
That sounds great, right?
Well, an app won’t work unless all the code works completely. How can this be achieved? This post is a component of my research project, aimed at assessing whether AI can effectively handle 95% of developers' coding tasks. My approach involves utilizing GPT-4 as a tool to assist in crafting scalable applications under the guidance of developers.
To continue, I'll share the core idea behind GPT Pilot, the concepts underpinning it, and the workflow leading up to the coding phase.
Currently, the project is still in its early stages and can create only simple web apps. Still, I will cover how it works, how much real coding work it can do, and required oversight from the developer.
Here are some apps that I created with it:
Ok, let’s dive in.
First, you enter a description of an app you want to build. Then, GPT Pilot works with an LLM (currently GPT-4) to clarify the app requirements, and finally, it writes the code. It uses many AI Agents that mimic the workflow of a development agency.
After you describe the app, the Product Owner Agent breaks down the business specifications and asks you questions to clear up any unclear areas.
Then, the Software Architect Agent breaks down the technical requirements and lists the technologies that will be used to build the app.
Then, the DevOps Agent sets up the environment on the machine based on the architecture.
Then, the Tech Team Lead Agent breaks down the app development process into development tasks where each task needs to have:
Description of the task (this is the main description upon which the Developer agent will later create code)
Description of automated tests that will need to be written so that GPT Pilot can follow TDD principles
Description for human verification, which is basically how you, the human developer, can check if the task was successfully implemented
Finally, the Developer and the Code Monkey Agents take tasks one by one and start coding the app. The Developer breaks down each task into smaller steps, which are lower-level technical requirements that might not need to be reviewed by a human or tested with an automated test (eg. install some package).
In the next blog post, I will write in more detail about how Developer and Code Monkey work (here’s a sneak peek diagram that shows the coding workflow), but now, let’s see the main pillars upon which the GPT Pilot is built.
I call these the pillars because, since this is a research project, I wanted to be reminded of them as I work on the GPT Pilot. I want to explore the extent to which AI can boost developers’ productivity. As such, every improvement I make needs to lead to that goal.
As I mentioned above, I think that we are still far away from an LLM that can just be hooked up to a CLI and work by itself to create any app by itself. Nevertheless, GPT-4 works really well when writing code. I use ChatGPT all the time to speed up my development process – especially when I need to work on some new technology or an API or if I need to create a standalone script. The first time I realized how powerful it can be was a couple of months ago when it took me 2 hours with ChatGPT to create a Redis proxy that would usually take 20 hours to develop from scratch. I wrote a whole post about that here.
So, to enable AI to generate a fully working app, we need to allow it to work closely with the developer who oversees the development process and acts as a tech team lead while AI writes most of the code. The developer needs to be able to change the code at any moment, and GPT Pilot needs to continue working with those changes (eg. add an API key or fix an issue if the AI model gets stuck).
Here are the areas in which the developer can intervene in the development process:
After each development task is finished, the developer should review it and make sure it works as expected (this is the point where you would usually commit the latest changes)
After each failed test or command run – it might be easier for the developer to debug something (eg. if a port on your machine is reserved but the generated app is trying to use it – then you need to hardcode some other port)
If the AI doesn’t have access to an external service – eg. in case you need to get and add an API key to the environment
Let’s say you want to create a simple app, and you know everything you need to code and have the entire architecture in your head. Even then, you won’t code it out entirely, then run it for the first time and debug all the issues at once. Instead, you will split the app development into smaller tasks, implement one (like adding routes), run it, debug, and then move on to the next task. This way, you can fix issues as they arise.
The same should be the case when AI codes.
Like a human, it will make mistakes for sure, so for it to have an easier time debugging and for the developer to understand what is happening in the generated code, the AI shouldn’t just spit out the entire codebase at once. Instead, the app should be generated and debugged step by step just like a developer would do – eg. setup routes, add database connection, etc.
Other code generators like Smol Developer and GPT Engineer work in a way that you write a prompt about the app you want to build, they try coding out the entire app and give you the entire codebase at once. While AI is great, it’s still far from coding a fully working app on the first try. So these tools give you a codebase that is really hard to get into and, more importantly, one that’s infinitely harder to debug.
I think that if GPT Pilot creates the app step by step, both AI and the developer overseeing it will be able to fix issues more easily, and the entire development process will flow much more smoothly.
GPT Pilot must have the capability to develop extensive, production-ready applications, not limited to small apps that can fit within the context of a language model. The challenge lies in the fact that an LLM (Large Language Model) learns exclusively within the context it's provided. While it's conceivable that in the future, LLMs could be fine-tuned for each unique project, currently, this process appears to be time-consuming and repetitive.
The way that GPT Pilot addresses this issue is with context rewinding, recursive conversations, and TDD.
Let’s discuss these:
The idea behind context rewinding is relatively simple – for solving each development task, the context size of the first message to the LLM has to be relatively the same. For example, the context size of the first LLM message while implementing development task #5 has to be more or less the same as the first message while developing task #50. Because of this, the conversation needs to be rewound to the first message upon each task.
For GPT Pilot to solve tasks #5 and #50 in the same way, it has to understand what has been coded, so far, along with the business context behind all code that’s currently written. With this, it can create the new code only for the task that it’s currently solving and not rewrite the entire app.
I’ll go deeper into this concept in the next blog post, but essentially, when GPT Pilot creates code, it makes the pseudocode for each code block that it writes, as well as descriptions for each file and folder it needs to create. So, when we need to implement each task, in a separate conversation, we show the LLM the current folder/file structure; it selects only the code that is relevant to the current task, and then, we add only that code to the original conversation that will write the actual implementation of the task.
Recursive conversations are conversations with the LLM that are set up in a way that they can be used “recursively”. For example, if GPT Pilot detects an error, it needs to debug it, but let’s say that another error happens during the debugging process. Then, the GPT Pilot needs to stop debugging the first issue, fix the second one, and then get back to fixing the first issue. This is a crucial concept that, I believe, needs to work to make AI build large and scalable apps. It works by rewinding the context and explaining each error in the recursion separately. Once the deepest level error is fixed, we move up in the recursion and continue fixing errors until the entire recursion is completed.
For GPT Pilot to scale the codebase, improve it, change requirements, and add new features, it needs to be able to create new code without breaking previously written code. There is no better way to do this than working with TDD methodology. For all code that GPT Pilot writes, it needs to write tests that check if the code works as intended so that whenever new changes are made, all regression tests can be run to check if anything breaks.
I’ll go deeper into these three concepts in the next blog post, where I’ll break down the entire development process of GPT Pilot.
In this first blog post, I discussed the high-level overview of how GPT Pilot works. In posts 2 and 3, I show you:
But while you are waiting, head over to GitHub, clone the GPT Pilot repository, and experiment with it. Let me know if you are successful, and while you’re there, don’t forget to star the repo – it would mean a lot to me.
Thank you for reading 🙏, and I’ll see you in the next post!
Also published here.
If you have any feedback or ideas, please let me know in the comments or email me at [email protected].