Vibe-Coding's Missing Piece is Vibe-Testing

AI makes mistakes. Just like people. That's why we developed an entire QA industry to independently verify our work—and it's thriving. This sector grows 7% annually and is expected to reach $115.4 billion by 2032, driven in part by the critical need to validate AI-generated outputs.

Meanwhile, browser automation has emerged as a core area of AI service development. OpenAI recently added this functionality to ChatGPT, their most widely-used product. So the obvious idea comes to many minds—why not use browser automation as part of AI-assisted coding, providing coding agents with real-time feedback? Wouldn't this make the whole system more stable? So how useful are AI-powered QA tools today, can they actually improve both the quality of software delivered to production and the overall developer experience?

The Idea

What to expect from AI-powered quality assurance services? The most reasonable approach comes from crowd-testing platforms, where real users perform testing tasks. This methodology formulates test cases more abstractly and delivers richer feedback beyond simple pass/fail results. Teams gain valuable insights into usability, performance issues, and genuine user pain points that traditional testing often misses. With AI it can be even more flexible, quick and cheap.

AI can also address one of the most critical infrastructure challenges in modern QA: the prohibitive cost of maintenance. Test cases become outdated, components get relocated, and CSS selectors prove unscalable. QA maintenance becomes a separate burden that teams often puts off.

Moreover, QA-agents can be side-kicks in AI-assisted coding workflows, integrated directly into the development human-agents loop, sharing the feedback, taking actions, playing a main role in so-called “shift-left“ approach.

A comprehensive AI should therefore meet general requirements:

On-Demand Testing During Development. Developers should be able to complete a feature implementation and immediately request AI-powered testing..
Automated Test Case Development. AI should autonomously generate comprehensive test cases for new features. These test cases need flexible storage options—either locally alongside the codebase or remotely in cloud repositories—ensuring they remain accessible and maintainable.
Intelligent Test Execution and Management. The system should handle test execution and management with automatic updates. Ideally, it maintains access to both the codebase and documentation, allowing it to update test cases dynamically as requirements evolve.
Comprehensive Test Reporting with Business Intelligence. Test reports should go beyond simple pass/fail metrics to provide actionable insights across multiple dimensions: Business Logic Analysis, Usability Assessment, Performance Metrics.

Below we will see that full-fledged services that satisfy all the described requirements do not exist. However, some of these tasks can be solved quite well. QA tools are being developed by both LLM providers - particularly OpenAI - and independent startups. There are also a number of open-source solutions.

OpenAI: Computer-Using Agent

Browser automation has been concerning Sam Altman's company for quite some time. Six months ago, in January 2025, the agent that can solve tasks on the web using the browser, was launched. They call it Operator and it’s powered by a dedicated model based on GPT-4o - Computer-Using Agent (yes, this is the model name, CUA for short). Essentially it’s a vLLM plus Browser-Farm, controlled through the chat interface.

The company continues to invest further in Operator, making it widely available for ChatGPT users with Pro subscriptions. OpenAI also made it clear that they are working on their own Chromium-based browser. One can speculate that advanced browser automation will be one of their new product features, and ChatGPT can be used as a playground for this technology.

Operator is not a QA-tool specifically, so it will be tricky to make it work that way through the chat UI. But it’s possible to build on top of CUA, as it’s available through the API. OpenAI demonstrates how it can be done with little open-source tool called Testing Agent.

Essentially Testing Agent is a CUA with an access to Playwright and some basic UI on top of it. It’s easy to launch, you’ll only need OpenAI’s API key.

It’s works in a quite naive manner. You give the agent the natural-language task, point it to some URL you want to test, configure it with some very limited test data, launch and wait. It then creates test plan according to your task, starts Chrome browser locally, takes screenshots, exchanges them with OpenAI’s CUA model, runs actions that LLM generates.

Even though it’s just a demo, it’s extremely limiting:

It’s controlled with very basic UI, not an MCP-server, so it’s not pluggable to your code-editor of choice
It’s very slow, mostly because it waits for the model to recognise the screenshot
It doesn’t provide any machine-readable output
It doesn’t support passing env-variables to test
It doesn’t follow instructions well enough. It’s designed basically to serve one case

I doubt that OpenAI's tool can be used for QA unless something quite sophisticated is built on top of it. And there is no such service at the moment. This applies to both prompt preparation and the contract for browser interaction - screenshots alone are clearly insufficient to effectively avoid misunderstandings and hallucinations.

Browser Use

The task of more precise and controlled interaction between browsers and LLMs is being solved by the Browser Use team. This Californian startup raised a $17 million seed round and got quite a lot of attention lately. They develop their own protocol of browser-llm communication based not on screenshots, but on text-based context which consists of indexed interactive elements, opened tabs, HTML attributes, user parameters and many more. Since it’s core is open-sourced, you can dive into it itself.

The extended context approach by itself solves many problems:

Browser Use is LLM-agnostic, it doesn’t even require vision-LLM to operate
It’s more stable and performant than screenshot-based solutions like Operator
It can be prompted more specifically and accurately, as LLM has better understanding of the environment it operates with
It’s extendable with custom functions and hooks it’s easy to build on top of it

One of the primary Browser Use’s applications is of course web scraping. Imagine writing your scraping code just once without worrying about target website’s markup being changed. Wait, this is exactly what we want from new generation of QA tools. And yeah, there are already couple of them that are built on top of Browser Use.

We are now approaching tools that can be unstable, and it took me a while to run some of them locally. Since all of them use MCP as an interface, the best debugging tool available is the MCP inspector.

Vibetest Use

Browser Use’s child tool is a very basic MCP instrument that provides functionality for testing web pages. Users control a very limited set of parameters: the URL of the web page being tested, the number of simultaneously running agents, and toggling between headless/non-headless modes.

The tool begins by spawning a "scout agent" that identifies the main interactive elements of the page and creates tasks for the testing agents. The prompts used are extremely abstract - all decisions regarding test task are left to the LLM. Clarifying instructions are not supported.

Next, Vibetest Use distributes testing tasks across agents, collects and summarises test reports, and provides feedback to the user. And this is it. Intelligent distribution is the only strong point of Vibetest Use, however, there are plenty of problems:

It's adapted only for Gemini family models, so you'll need a Google API Key to run it
It provides very limited control over the testing process
It doesn't allow extending automatically generated scenarios
It's poorly maintained. To get it running, we first had to fix incompatibility issues with the new version of Browser Use

Overall, Vibetest Use is unlikely to be particularly useful directly, but similar tools could be built based on it for specific tasks within various business processes.

Another solution based on Browser Use is “browser agent that can vibe-test web applications via MCP” from the startup called Operative.

Web Eval Agent

Naming is not strongest part of AI-tooling in general, and Web Eval Agent is not an exception. Operative's main product is Lovable-like software focused on generating applications based on existing internal organisation APIs. But surprisingly, they started with a testing tool first and it eventually became part of their primary use case.

Web Eval Agent is also available through the MCP and provides some functionality on top of Browser Use. First, like Operator, it comes with convenient UI for interacting with the testing agent. Next, it orchestrates and distributes your testing tasks. However, it doesn't use any parallelisation, so you'll end up with just one agent/browser per request.

The tool then executes the task, collects screenshots, network and console logs and errors, and finally summarises everything in a text report. The idea is that you can immediately feed these errors back to your code-generation agent and close the loop. Unlike Vibetest Use, Web Eval Agent allows you to provide a prompt that it uses to instruct the testing agent and draw conclusions from test results.

There are some downsides as well - the tool only works with Operative's internal credits. Every chat completion consumes one credit, and even a basic test can easily take 20-30 requests. So you'll need another subscription for the tool to work.

But main limitation is that it's very basic in functionality - it only provides a thin layer on top of Browser Use without any additional QA tooling.

As with Vibetest Use, this tool would require significant adaptation for effective use: eliminating vendor lock-in, enriching the context with necessary business logic, and formalising and structuring the output.

AI-QA Startups

None of the tools mentioned above really work out of the box, partly because they are kind of side projects of their creators, developed as secondary efforts to showcase the technology or as experimental attempts. On the other hand, there are a large number of startups that are solving exactly this problem - AI-assisted testing. Here are just some of them:

Most of these tools go far beyond simple MCP servers—they're full-featured testing platforms similar to BrowserStack and SauceLabs. Unlike basic test generators, these platforms handle the entire testing lifecycle: they generate, store, and execute automated tests. What sets them apart from previous generations is their deep AI integration, which unlocks new capabilities:

Auto-generation of tests from prompts and product requirements documents (PRDs)
AI-powered test steps, where all logic is described with natural language
Testing personas that simulate different user behaviors and scenarios Auto-healing tests that adapt and fix themselves when applications change
Auto-healing tests that adapt and fix themselves when applications change

The typical workflow starts with AI-assisted analysis of product requirements, followed by test case creation through the platform's interface—whether generated automatically or built manually by the user.

The resulting tests are presented in two distinct formats: either as generated code (like in TestSprite) or abstracted through UI-based test steps (Thunder Code). For code-based approaches, users edit test steps directly through the platform's built-in code editor. With UI-abstracted formats, modifications happen conversationally through an AI assistant chat interface.

The value of these platforms lies primarily in adapting LLMs to QA needs. They can already definitively reduce the time spent on developing and maintaining automated tests. However, their problems are also obvious - vendor lock-in, high costs, and difficulties with integration into development workflows and AI-assistant coding.

Conclusions

Despite the clear demand for AI-powered quality assurance solutions, none of the current tools meet all four outlined requirements. OpenAI's Testing Agent is too basic and inflexible, while open-source tools like Vibetest Use and Web Eval Agent built on Browser Use lack the stability and control needed for production workflows. Even the dedicated AI-QA startups, though more comprehensive, suffer from vendor lock-in and integration challenges that limit their utility in AI-assisted coding environments.

However, the foundation exists. Browser Use's text-based protocol works well, and it's already possible to build custom solutions on top of these tools to meet specific requirements.