This is the second blog post in a 3-part series where I explain how my team and I created GPT Pilot – the AI coding agent that’s designed to work at scale and build production-ready apps with a developer’s help. of this series, I discussed the high-level overview of GPT Pilot. The idea is that AI can now do 95% of all coding that we, developers, are doing. See in 2 hours, which would usually take 20-30 developer hours. However, an app is of no use if it doesn’t fully work or solve the user’s problem. So, until real AGI arrives, you need a developer. In part #1 how I used ChatGPT to code out an entire Redis proxy So, this is how GPT Pilot came to life. It is designed to do 95% of the required coding and asks developers for reviews, such as when it becomes stuck and cannot move forward or needs something outside the app, like an API key. In this post, I walk you through the entire process GPT Pilot goes through when coding an app. I share diagrams to provide a visual representation of everything that’s going on behind the scenes in GPT Pilot. I’m a visual person, so I always create diagrams. To understand how GPT Pilot’s coding works, there are three concepts – context rewinding, recursive conversations, and TDD. See my introduction, which I described in them in part #1 of this series. The GPT Pilot coding workflow contains 5 steps: Take the next development task in line Break down the task into development steps Take the next development step Fetching of currently implemented code Write code for the current step Run the code or a command Test the new code changes Debug the development step or go to the next step Coding workflow is my favorite part of GPT Pilot, so let’s dive in. Here is a diagram of how it looks visually: #1 Task breakdown Two important concepts will be mentioned throughout this blog post – and . development tasks development steps GPT Pilot works in a way that, after breaking down the specifications for developing an app, it creates development tasks that will lead to a fully working app. are basically high-level descriptions of what needs to be done that a developer will take and start implementing. Think of them as tasks in Jira (btw, I hate Jira…not sure if anyone relates, but I just wanted to let it out of my system). Here is an example of a development task: Development tasks In the diagram above, you see 3 task properties: : what needs to be implemented to fulfill this task description : how can the lead developer determine if the current task is finished – a crucial pillar of GPT Pilot is a developer must be involved throughout the coding process so that you can ensure the development process is going as planned and understand the codebase along the way user_review_goal : the kind of automated test GPT Pilot should write to test if this entire development task works as expected. After a development , GPT Pilot writes unit tests, and after a development , it writes integration or E2E tests. programmatic_goal step task Now, when you start developing a task from Jira ( ), you will split it into smaller chunks (we call them ) that are actionable items you would set out to implement into the codebase. Each development step can be one of the following: development task development steps – a command that needs to be run on the machine, such as a command to install a dependency, start the app to check if the previously implemented steps work, or create a folder. Command run – the most important development step that explains what exactly needs to be implemented in the actual code to fulfill the current step. It can contain new code that needs to be written or code that needs to be changed. The way it works is the code change is a detailed, human-readable description of what needs to be implemented. It contains both the code that needs to be implemented and the description of what it is being used for. This is very similar to when you ask to code something. It will give you the code as well as the explanations of why it wrote that code. Code change ChatGPT The reason for this is that code implementation is not so simple. Sometimes, we need to add a snippet to existing code or change the existing implementation. That is why we separated the outline of the coding change (this development task) and the actual implementation of this change that the CodeMonkey agent is dedicated to. I will go deeper into that in the #3 Coding section. Here is an example of a code change: – a development step that AI cannot do by itself and needs human help in fulfilling the step. Then, GPT Pilot asks the developer to do something, and when he/she is done, they write “continue,” and GPT Pilot will continue with the implementation. Here are some reasons why human intervention might be needed: Human intervention There is a needed API key (e.g., Twitter API key to fetch data from Twitter) GPT Pilot became stuck in a debugging process, and it either filled the entire context length or the recursion conversation was too deep and became unproductive to continue down the recursion depth. GPT Pilot needs a verification if something works as expected – e.g., GPT Pilot is not sure if Mongo is installed properly on the machine and might ask the developer to run some sudo commands and see if it works as expected. #2 Fetching of currently implemented code It’s easy for AI to write a new file that contains code, but in reality, that is rarely the case. For the most part, we write into the existing files and either change the existing code or add new code. Now, AI can do this easily if you give it all of the existing code and instructions for what needs to be implemented. The problem arises when an app scales and the codebase becomes so large that it cannot fit into the LLM context. And this is actually a very common case – at least until we have LLMs with 1M tokens, which doesn’t seem to be coming soon. When you work on a task in a big codebase, you usually look at a smaller part of the codebase (maybe 1,000 lines) and work only with that subset of code to implement the task. So, to address this issue and make GPT Pilot truly scalable so that it can create and upgrade large production-ready codebases, we must create a way for the AI to select the smaller part of the codebase (e.g., those 1,000 lines) on which it will implement the current task. Once it’s finished, we can simply add the finished lines back into the original codebase. Let me start explaining this by telling you what happens when GPT Pilot writes code and creates new files and folders. For each file and folder it must create, it needs to write a description of what the idea is behind the file or folder it wants to create. For example, it might want to create a folder utils for which it will write: Contains utility modules that provide generic, reusable solutions to common problems encountered throughout the application.
These utilities are not specific to the app's core domain but offer auxiliary functionality to support and streamline the primary codebase.
They encapsulate best practices, reduce code repetition, and make the overall code cleaner and easier to maintain. Examples include functions for data formatting, error handling, debugging tools, string manipulation, data validation, and other shared operations that don't fit within specific modules or components of the app. Now, for each function, GPT Pilot creates, it writes a description of what the function is supposed to do – that is a pseudocode for the entire codebase. Now that you know what happens when GPT Pilot writes code, you can understand how it fetches the relevant code for each development step. Before GPT Pilot codes each step, it first fetches the relevant part of the codebase in a completely separate LLM conversation. That conversation is set up in 3 steps. AI is given the development step description along with the entire project file/folder structure and descriptions for each file and folder. From this, LLM tells us which files are relevant for the mentioned step. After narrowing down the necessary files, we give the LLM pseudocode for each file it listed and ask it to tell us which functions are relevant for the current development step. Once we know the pseudocode it selected, we can fetch the actual code and put it into the original conversation, where the LLM will write the description of what needs to be implemented. If the app becomes extremely huge, we can improve this by first giving the LLM the folders, from which it will select folders, and then we give it relevant files. Before each of these steps, we can also rewind the conversation to the beginning to leave more room in the context. Here is a diagram of what this looks like: #3 Coding Now that we can that contains all the code necessary for someone to implement a specific task, we can start with the actual coding process. This happens in a 2-part process: create an LLM message First, the LLM writes the description of what needs to be implemented along with the code. If the entire file needs to be coded, LLM’s response will contain all the code, but if only a part of the code inside a file needs to be changed, LLM will tell us things, like after the Mongo setup, add the following lines of code… As you can imagine, by this being stochastic rather than deterministic, we need to ensure that the written code is inserted into appropriate places or changed correctly. Here is where the agent steps in. It is called code monkey because it doesn’t make any decisions but rather simply implements the code that the Developer agent writes. It is given the code relevant for the current task (that is previously selected by LLM in the code-fetching phase) and the description that the Developer agent created in development step #1. Then, the only thing it needs to return is the completely coded sections/files that we can just insert/replace in the codebase. CodeMonkey #4 Testing There are two places where testing is done – (1) after each development task when GPT Pilot creates integration tests that test if the high-level features work as intended and (2) after each development step when it creates smaller unit tests that ensure all functions work as expected. GPT Pilot has three different types of tests it can do: are the preferred way of testing a step or task because they will be used in a regression test suite so that GPT Pilot can be sure that new code changes don’t break old features. However, automated tests are not always the most optimal way to test new code. Automated tests is a test where we run a specific command and give the output to the LLM, which then tells us if the implementation was successful. For example, we don’t need to create an automated test that will check if we can run an app with npm run start – for that, a simple command run is enough to check if we successfully set up our environment. Command run is the final way to test the app, and it is needed whenever AI cannot test the implementation itself. This is needed, for example, when there are some visual aspects (e.g., CSS animations) that must be checked to see if they work correctly. Human intervention After running each test, if successful, GPT Pilot takes on the next task or step and continues with coding, but when the test fails, GPT Pilot needs to debug the error. #5 Debugging The debugging process needs to be robust so that it can be started on any bug that arises, regardless of the error. It also needs to be able to debug any issue that happens during the debugging process. This is where recursive conversations come in, which are conversations with the LLM that are set up in a way that they can be used “recursively.” Let’s look at the example in the image below. It represents a flow that the GPT Pilot goes through when working on a development task that has five development steps. In this example, during the development of step #3, an error occurs – let’s say it implements a specific code change, but after running a test, it fails. Then, it goes into the recursion level #1 to debug this issue. It breaks down what needs to be done to fix this issue into two steps, but during the implementation of the first step, another error happens. For example, a needed dependency for fixing error #1 doesn’t exist. GPT Pilot then goes into the recursion level #2, which it breaks down into three steps. In the third step, another error occurs. Then, it goes to the third recursion level, which has only 1 step. Once that step is successfully executed, GPT Pilot goes back to the recursion level #2 and finishes debugging error #2. After that, it goes back to debugging error #1, and finally, after error #1 is fixed, it goes back to development step #3, after which it continues the app implementation. When the recursions go five levels deep, GPT Pilot will stop the debugging process and ask the developer to fix the initial issue it started with. Once the developer resolves this issue, they write the results to GPT Pilot. Then, it can continue the development process as if it debugged the issue itself. Conclusion In the first post of this series, I discussed the high-level overview of how GPT Pilot works. In this post, I described the GPT Pilot Coding Workflow, including: How Developer and CodeMonkey agents work together to implement code (write new files or update existing ones), How recursive conversations and context rewinding work in practice and Rewinding the app development process and restoring it from any development step. In the final post, I will dive deep into how all the agents are structured. We built the agents modularly because we know they will evolve over time. Please head over to GitHub, clone the , experiment with it, and send me your feedback. I want GPT Pilot to be as helpful to developers as possible, so let me know what you think, how it can be improved, or what works well. Add comments at the bottom of this post, or email me at . GPT Pilot repository zvonimir@pythagora.ai 🌟🌟🌟 Finally, we're trying to raise funds to continue developing GPT Pilot, so . Thank you 🙏 🌟🌟🌟 it would mean A LOT if you could and/or share it with your friends star GPT Pilot Github repository Also published . here

This post provides insights into new product. 

How GPT Pilot Codes 95% of Your App [Part 2]

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

After 6 Months of Working on a CodeGen Dev Tool (GPT Pilot), This Is What I Learned

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

After 6 Months of Working on a CodeGen Dev Tool (GPT Pilot), This Is What I Learned

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps