In recent years, the role of artificial intelligence has grown noticeably in all industries. When it comes to development, every team in general, as well as every developer in particular, is trying to bring AI as close as possible to their project and codebase. In this article, I will discuss how you can automate the generation of acceptance tests, what problems you should expect during implementation, and how you can reduce these difficulties. At the end, I have composed a roadmap for implementing acceptance test autogeneration, but for now, I suggest you start by looking at the basic concept acceptance tests.
Acceptance tests themselves are high-level tests that involve repeating user actions and checking the system's responses against the desired behavior. These tests are performed in an environment that resembles real production as much as possible. Often, after these tests new code is sent to the server. In this article, we will not consider manual acceptance tests, as any good developer should perform such tests at every deployment. Let's consider the pros and cons of implementing automatic acceptance tests.
Project Quality
Acceptance tests help developers to be sure that changes in the project code will lead to the expected results. This enables the developer to do less manual testing, which ultimately leads to faster development.
Customer Satisfaction
Regardless of who the customer is, once a clear and specific DoD is built and acceptance tests are implemented, it can be clearly and accurately said that the developer understood the task correctly and exactly what the customer had in mind, which practically eliminates unnecessary work
Regression testing
The most obvious plus is that during the development process in a project where acceptance tests already exist, it is much easier to perform regression testing of already existing code. During development, the human factor is very important and there is some possibility that a developer may miss testing some parts of the application.
Agile
When introducing the launch of acceptance tests during deployment, it is possible to understand quite reliably and clearly, and most importantly quickly, whether there are any errors in the new code, which is an additional safety feature.
Expensive to implement
The initial process of implementing acceptance tests can take quite a long time, because this process, in addition to choosing a tool, includes writing acceptance tests, setting up the environment where it will be executed, and also you need to train your colleagues to write these tests, which can be difficult.
Support
Unfortunately, the relevance of acceptance tests should always be at the maximum level, because only in this state they can be trusted and used to check the state of the system.
Balance
When working in acceptance tests, it is also necessary to find a balance between insufficient and redundant coverages. Insufficient ones may not always fully account for the state of the project, and redundant ones can lead to long verification times as well as burnout for developers who need to keep track of redundant states.
Additional dependencies
Adding a new dependency always means an additional chance for error. Implementing acceptance tests for good should increase the number of environments by at least one that is completely similar to the real production.
And if you can't get rid of the last minus when using acceptance tests when using a safe environment, you can try to get rid of all the others with the help of autogeneration
In addition to full autogeneration of acceptance tests, there is a possibility of manual-automatic solution, where all requests for test creation are made manually, but the test itself is already generated by the AI. This solution is a plus for developers, as they will better understand the principles of working with AI and the skills of writing prompts for it, but there is one big drawback. This solution does not speed up development, as the process of writing a prompt will be quite long every time, and interaction with all the tools to build such a process will be complicated. Due to all this, the autogeneration solution is the only one if you want to get a lot of advantages of acceptance tests and almost get rid of their disadvantages.
To build a process we need to do a few things, as follows:
Any task tracker, such as Jira, can be used to create a task. But nobody prevents you from using other trackers as well if they have an API or a callback system. Jira has both and is the de facto standard, so I will show an example of working with it. Documentation on callbacks from Jira can be found here.
Unfortunately, selecting a tracker alone is not enough to work well and several other changes need to be made to the task building process.
A big problem can be the task description problem. Since when working in a team, often the creation of tasks is left to people who are not heavily immersed in technical details, they can be helped by a task template. The task description template may vary from team to team, but globally the following things should be in there:
Formalizing the data leads to a higher success rate in generating acceptance tests. This part of the process is crucial because the incoming data will always be different for different tasks.
The second important action is to modify the standard set of fields in the task. You may already have something similar, so the list of fields below is not required, but it's what I'm going to use next.
Here is the list of fields to be added:
New task types Through trial and error, I came to the conclusion that all tasks can be divided into several types that change their behavior during generation.
The simplest type can be considered those tasks that add something new to the project. Generating such tests is not complicated and just adds test case files to the appropriate place.
A much more complex task is the generation of automatic tests that modify already existing tests. This happens in case we change behavior in methods or extend them. I called this type of task updating old code.
Besides these two types, there are those that do not need to be sent to the AI, such as bugs and research. With research everything is clear and it often does not require writing code and is needed to develop an understanding of what to do next.
As for bugs, such tasks can be divided into research and updating old code. Further, after getting detailed information about the bug, you can create a second task with the required type, depending on which information will or will not be sent to the AI.
List of affected endpoints
This is an additional field. It is optional, but in the beginning, when introducing a new way of writing tasks, you should make this field mandatory to exclude the situation when you forgot to specify an endpoint in the task text.
Field for inserting ready acceptance tests
This field is also optional, but in my implementation, the generated tests are put here.
After setting the fields, for faster implementation, it is worth adding a guide where the whole process will be shown with screenshots or in a video. Also, in order for people unfamiliar with the names of technical entities to be able to write a correct task, it is necessary to write a Glossary. It brings a lot of benefits and greatly reduces the time of writing and implementing tasks. In addition, it helps new colleagues to get into the development and understand the principles of the system. Additionally, the grooming process can be modified to add discussion of technical details so that the project manager or business representatives understand them better and set tasks faster and more accurately.
After creating a task in Jira with all necessary fields and according to a given template, the next step is to send the callbacks to a separate service. Any language/framework can be used to implement such a service. In my case, it was Laravel framework and PHP language. This choice is determined by the competence of the team and the speed of development. If you have the same set of technologies, you can use openai-php/client and lesstif/php-jira-rest-client libraries to quickly get a high-quality result.
The service itself is quite simple and is a proxy server that collects information from the tracker and sends it to the AI. Then the service receives the response and sends it to the right place. I have this field in a task in the tracker, but it can also be sending code to the version control system by creating a branch by task name. This was not done by default because of the specifics of the language model, which we will discuss now.
In order for the autogeneration to work, it was necessary to choose a language model. Due to the fact that at the time of development, one of the few language models that could access the Internet was ChatGPT with Bing search, I decided to use it and I am quite satisfied with the result. To be specific about which model was used, it was gpt-4. There are more advanced versions of this popular language model today, so I suggest choosing the most current version of the model unless you have some other input. You can see a list of current language models for ChatGPT here. If you change the model to gpt-3.5-turbo, the result will change for the worse, but not so much that you should use only version 4. Next, you need to select the prompt.
A prompt is a precise query formulation that a developer sends to the AI to autogenerate acceptance tests. Prompt preparation is the most important process in generating tasks. It determines how well the AI can understand you and generate what you need in the form you want.
Prompts for test generation when creating new code need only a clear description of the task, a description of the response, and a description of the generation process. The success of the queries, in addition to a good prompt, depends on how well the task has been described in Jira. An example of a prompt for new code could be this:
In the case of updating old code, the prompt should include not only the test generation request but also the existing code that needs to be changed. This can be a problem here as language models have limits on the amount of data they can accept, for example for the gpt-4 model this limit is 8192 tokens per second, however, there are already gpt-4 models that support 128000 tokens per second. However, so far they only return 4096 tokens at most and are not a production-grade solution. To get an idea of how many tokens your request might be worth, you can try submitting it to the tokenizer via the link here.
In our PHP project environment, the amount of code is related to the amount of tokens as 5 to 2. Due to the fact that almost any modern product has more characters than the limit per second, we need to change the dataset we send and instead of sending all the code, send only its metadata, which will only contain method names and their arguments with return types. To generate the metadata in php you can use functions that come with many frameworks, for example with PHP frameworks you can use the getClassMetaData()
method in the Doctrine EntityManager in Symfony or the php artisan ide-helper:{generate|meta|models}
commands in Laravel from the barryvdh/laravel-ide-helper package.
If this optimization is not enough, you can use more complicated solutions. One of them is to first make another query to the AI with a file tree to get the right test file to be corrected. This option is not the best as it doubles the number of queries and unfortunately, the answer to this query will not always be correct. The second option is to use vector databases such as Pinecone or Milvus. They allow you to do file classification from your codebase and make the query to the AI much smaller in size.
In my case such edits were not needed, so my prompt looks like this:
Now that we've got the prompt figured out, all that's left is to talk about how cost-effective it is. The implementation of the task formalization process with the necessary template, guide and glossary took about 2 sprints. During this time, managers who had previously been poorly immersed in technical processes became much more aware of how the system worked and began to make fewer mistakes in creating tasks. The second not insignificant outcome was that team grooming now lasts longer, but in the long run this should result in grooming getting back to normal and even decreasing a bit as managers can understand some things themselves without wasting the team's time.
What we've gotten so far is:
The cost of requests to ChatGPT is about 9¢ per task in our model. The speed of writing tasks has decreased by 10-15 minutes and this delay is decreasing all the time. The speed of writing tests increased by 15-20 minutes. As the complexity of the project increases, the speed of writing tests manually decreases. So with generation, the advantage in writing speed will gradually increase.
If we are completely satisfied with the first point, we are not satisfied with the second one and continue to try to improve the issue rate. In addition to making changes to the prompt, we try to send only necessary files with AI tests and try to describe the task better. Unfortunately, such precision of sending does not allow us to send the ready code to the branch in the version control system at once, but the process is not over yet.
The implementation of acceptance test generation has shown that AI has improved quite a lot in recent years and it can really be used as a helper for your work. Also, the integration with AI has a positive impact on the qualification of colleagues, as it has led to the fact that in addition to the obvious advantages of reducing the time of writing code, developers get more qualifications for working with the AI prompt, which is undoubtedly important today, and managers get an in-depth knowledge of the system processes and a reduced percentage of tasks that require editing after testing. Continuous improvement of the prompt gives the opportunity to look ahead with the expectation of speeding up the team's work.
To make it easier for you to implement this for yourself, I've put together a roadmap on how to implement autogeneration in your project.