This is the second blog post in a 3-part series where I explain how my team and I created GPT Pilot – the AI coding agent that’s designed to work at scale and build production-ready apps with a developer’s help. In part #1 of this series, I discussed the high-level overview of GPT Pilot. The idea is that AI can now do 95% of all coding that we, developers, are doing. See how I used ChatGPT to code out an entire Redis proxy in 2 hours, which would usually take 20-30 developer hours. However, an app is of no use if it doesn’t fully work or solve the user’s problem. So, until real AGI arrives, you need a developer.
So, this is how GPT Pilot came to life. It is designed to do 95% of the required coding and asks developers for reviews, such as when it becomes stuck and cannot move forward or needs something outside the app, like an API key.
In this post, I walk you through the entire process GPT Pilot goes through when coding an app. I share diagrams to provide a visual representation of everything that’s going on behind the scenes in GPT Pilot. I’m a visual person, so I always create diagrams. To understand how GPT Pilot’s coding works, there are three concepts – context rewinding, recursive conversations, and TDD. See my introduction, which I described in them in part #1 of this series.
The GPT Pilot coding workflow contains 5 steps:
Coding workflow is my favorite part of GPT Pilot, so let’s dive in. Here is a diagram of how it looks visually:
Two important concepts will be mentioned throughout this blog post – development tasks and development steps.
GPT Pilot works in a way that, after breaking down the specifications for developing an app, it creates development tasks that will lead to a fully working app. Development tasks are basically high-level descriptions of what needs to be done that a developer will take and start implementing. Think of them as tasks in Jira (btw, I hate Jira…not sure if anyone relates, but I just wanted to let it out of my system). Here is an example of a development task:
In the diagram above, you see 3 task properties:
Now, when you start developing a task from Jira (development task), you will split it into smaller chunks (we call them development steps) that are actionable items you would set out to implement into the codebase. Each development step can be one of the following:
Command run – a command that needs to be run on the machine, such as a command to install a dependency, start the app to check if the previously implemented steps work, or create a folder.
Code change – the most important development step that explains what exactly needs to be implemented in the actual code to fulfill the current step. It can contain new code that needs to be written or code that needs to be changed. The way it works is the code change is a detailed, human-readable description of what needs to be implemented. It contains both the code that needs to be implemented and the description of what it is being used for. This is very similar to when you ask ChatGPT to code something. It will give you the code as well as the explanations of why it wrote that code.
The reason for this is that code implementation is not so simple. Sometimes, we need to add a snippet to existing code or change the existing implementation. That is why we separated the outline of the coding change (this development task) and the actual implementation of this change that the CodeMonkey agent is dedicated to. I will go deeper into that in the #3 Coding section. Here is an example of a code change:
Human intervention – a development step that AI cannot do by itself and needs human help in fulfilling the step. Then, GPT Pilot asks the developer to do something, and when he/she is done, they write “continue,” and GPT Pilot will continue with the implementation. Here are some reasons why human intervention might be needed:
It’s easy for AI to write a new file that contains code, but in reality, that is rarely the case. For the most part, we write into the existing files and either change the existing code or add new code. Now, AI can do this easily if you give it all of the existing code and instructions for what needs to be implemented. The problem arises when an app scales and the codebase becomes so large that it cannot fit into the LLM context. And this is actually a very common case – at least until we have LLMs with 1M tokens, which doesn’t seem to be coming soon.
When you work on a task in a big codebase, you usually look at a smaller part of the codebase (maybe 1,000 lines) and work only with that subset of code to implement the task.
So, to address this issue and make GPT Pilot truly scalable so that it can create and upgrade large production-ready codebases, we must create a way for the AI to select the smaller part of the codebase (e.g., those 1,000 lines) on which it will implement the current task. Once it’s finished, we can simply add the finished lines back into the original codebase. Let me start explaining this by telling you what happens when GPT Pilot writes code and creates new files and folders. For each file and folder it must create, it needs to write a description of what the idea is behind the file or folder it wants to create. For example, it might want to create a folder utils for which it will write:
Contains utility modules that provide generic, reusable solutions to common problems encountered throughout the application. These utilities are not specific to the app's core domain but offer auxiliary functionality to support and streamline the primary codebase. They encapsulate best practices, reduce code repetition, and make the overall code cleaner and easier to maintain. Examples include functions for data formatting, error handling, debugging tools, string manipulation, data validation, and other shared operations that don't fit within specific modules or components of the app.
Now, for each function, GPT Pilot creates, it writes a description of what the function is supposed to do – that is a pseudocode for the entire codebase.
Now that you know what happens when GPT Pilot writes code, you can understand how it fetches the relevant code for each development step.
Before GPT Pilot codes each step, it first fetches the relevant part of the codebase in a completely separate LLM conversation. That conversation is set up in 3 steps.
If the app becomes extremely huge, we can improve this by first giving the LLM the folders, from which it will select folders, and then we give it relevant files. Before each of these steps, we can also rewind the conversation to the beginning to leave more room in the context.
Here is a diagram of what this looks like:
Now that we can create an LLM message that contains all the code necessary for someone to implement a specific task, we can start with the actual coding process. This happens in a 2-part process:
Here is where the CodeMonkey agent steps in. It is called code monkey because it doesn’t make any decisions but rather simply implements the code that the Developer agent writes. It is given the code relevant for the current task (that is previously selected by LLM in the code-fetching phase) and the description that the Developer agent created in development step #1. Then, the only thing it needs to return is the completely coded sections/files that we can just insert/replace in the codebase.
There are two places where testing is done – (1) after each development task when GPT Pilot creates integration tests that test if the high-level features work as intended and (2) after each development step when it creates smaller unit tests that ensure all functions work as expected.
GPT Pilot has three different types of tests it can do:
After running each test, if successful, GPT Pilot takes on the next task or step and continues with coding, but when the test fails, GPT Pilot needs to debug the error.
The debugging process needs to be robust so that it can be started on any bug that arises, regardless of the error. It also needs to be able to debug any issue that happens during the debugging process. This is where recursive conversations come in, which are conversations with the LLM that are set up in a way that they can be used “recursively.”
Let’s look at the example in the image below. It represents a flow that the GPT Pilot goes through when working on a development task that has five development steps. In this example, during the development of step #3, an error occurs – let’s say it implements a specific code change, but after running a test, it fails. Then, it goes into the recursion level #1 to debug this issue. It breaks down what needs to be done to fix this issue into two steps, but during the implementation of the first step, another error happens. For example, a needed dependency for fixing error #1 doesn’t exist. GPT Pilot then goes into the recursion level #2, which it breaks down into three steps. In the third step, another error occurs. Then, it goes to the third recursion level, which has only 1 step. Once that step is successfully executed, GPT Pilot goes back to the recursion level #2 and finishes debugging error #2. After that, it goes back to debugging error #1, and finally, after error #1 is fixed, it goes back to development step #3, after which it continues the app implementation.
When the recursions go five levels deep, GPT Pilot will stop the debugging process and ask the developer to fix the initial issue it started with. Once the developer resolves this issue, they write the results to GPT Pilot. Then, it can continue the development process as if it debugged the issue itself.
In the first post of this series, I discussed the high-level overview of how GPT Pilot works. In this post, I described the GPT Pilot Coding Workflow, including:
In the final post, I will dive deep into how all the agents are structured. We built the agents modularly because we know they will evolve over time. Please head over to GitHub, clone the GPT Pilot repository, experiment with it, and send me your feedback. I want GPT Pilot to be as helpful to developers as possible, so let me know what you think, how it can be improved, or what works well. Add comments at the bottom of this post, or email me at [email protected].
🌟🌟🌟
Finally, we're trying to raise funds to continue developing GPT Pilot, soit would mean A LOT if you could
🌟🌟🌟
Also published here.