I've been running the same vibe coding benchmark across multiple AI models: generate Tetris, Snake, Pac-Man, and Space Invaders from scratch using only high-level prompts inside VS Code. No hand-crafted specs. No corrective micromanagement. Just describe the game and let the model figure it out.
I did this with Claude Sonnet 4.5 in GitHub Copilot. Impressive. I did it with GPT-5.3 Codex. Not impressive. Now I've done it with GPT-5.4.
Here's what actually happened.
Watch Full Video
https://youtu.be/qQNupv9UYkU?embedable=true
Setup and Constraints
Environment: VS Code with the Codex extension for GPT-5.4 access Project scaffold: .NET Razor Pages (pre-created — no framework setup required during the session) Methodology: Vibe coding. Prompts stay high-level. No line-by-line debugging. If the model can't figure out the implementation, that's the data point.
The Razor Pages scaffold matters for one reason: it lets me run the code immediately without dependency hell. The model has to produce working, executable code — not a web-rendered demo that only lives in a browser sandbox.
That distinction is more important than it sounds. I'll come back to it.
Results: Game by Game
Tetris
GPT-5.4 produced working Tetris on the first attempt. The game logic was sound. The "next block" preview was present — something GPT-5.3 Codex notably missed.
The problem: the layout was built for reading, not playing. The game was styled like a blog post. You had to scroll to see the full board. One follow-up prompt fixed it, but it shouldn't have needed one.
Score: Logic ✅ | Layout ❌ (one-prompt fix) | First-try completeness: 80%
Pac-Man
This is where the comparison gets interesting. GPT-5.3 Codex produced something that functionally wasn't Pac-Man. Wrong mechanics, missing core features, broken structure. A complete miss.
GPT-5.4 is measurably better. The maze renders. Pac-Man moves. There's a ghost gang.
But:
- Ghosts don't chase. They move, they don't pursue.
- Bottom boundary detection is broken.
- Visually, this looks like a late-90s school project. Technically functional. Not Pac-Man.
This is still a failed implementation of Pac-Man if we're being precise. It's a better failed implementation than 5.3 produced, but the game is incomplete.
Score: Recognizable ✅ | Correct ghost AI ❌ | Boundary detection ❌ | First-try completeness: 45%
Space Invaders
Works. Nothing interesting to report. The game is simple enough that the model handles it without incident.
Score: ✅
Snake
Also works. Snake is essentially a linked list with collision detection. Not a meaningful test of model capability at this level.
Score: ✅
The Time Problem
GPT-5.4 is slow. The full session — four games, layout fix, review — took approximately 90 minutes.
That's relevant if you're evaluating this for professional use. Rapid prototyping this is not.
Why Web Demos Look Different (And Why That Matters)
Here's something worth addressing directly: if you've seen AI coding demos on YouTube or Twitter, the results usually look incredible. Games that work, polish that surprises, UIs that feel real.
Those demos are almost always done in web chat interfaces. There's a reason for that.
When you ask ChatGPT or Claude.ai to "make a Pac-Man clone," the model is drawing on an extremely optimized pattern. Millions of users have asked similar things. The output has been refined — implicitly, through RLHF and data — toward what people find impressive in a browser context.
This experiment asked for something different: source code that a developer runs locally on their own machine. The model has to generate an actual project structure, handle runtime compatibility, and produce something that executes outside a sandboxed environment.
That's harder. The results reflect it.
Web demo quality and developer-tool quality are not the same metric. Evaluating AI coding assistants only through web demos is like benchmarking a database on a single SELECT query.
GPT-5.4 vs. GPT-5.3 Codex vs. Claude Sonnet 4.5
Here's the blunt ranking for this specific use case:
|
Model |
UI Quality |
Game Logic |
First-Try Completeness |
Speed |
|---|---|---|---|---|
|
Claude Sonnet 4.5 (Copilot) |
High |
Strong |
High |
Fast |
|
GPT-5.4 |
Mediocre |
Partial |
Moderate |
Slow (~90 min) |
|
GPT-5.3 Codex |
Poor |
Weak |
Low |
— |
GPT-5.4 is a genuine improvement over 5.3. The Pac-Man delta alone proves that — going from "wrong game" to "broken implementation" is real progress. But it's still not close to what Claude Sonnet 4.5 delivers in Copilot for this benchmark.
What This Tells Us
1. Model version alone doesn't predict developer experience. GPT-5.4 is a newer model. It's not the best tool for this workflow. Recency ≠ quality for specific tasks.
2. Vibe coding still needs a quality baseline. The "high-level prompt only" methodology is a legitimate way to evaluate AI assistants. If a model can't implement Pac-Man ghost AI without explicit instruction, that's information. Developers shouldn't have to spec out pathfinding to get working ghost behavior.
3. The in-VS-Code experience is not the web chat experience. This keeps coming up in evaluations and people keep ignoring it. If you're choosing a coding assistant, test it in your actual environment. Not in a browser demo.
Bottom Line
GPT-5.4 is better than GPT-5.3 Codex for vibe coding arcade games in VS Code. It is not impressive. It is not the best tool for this job. It took 90 minutes and still produced an incomplete Pac-Man.
If you're doing serious vibe coding work inside VS Code, GitHub Copilot with Claude Sonnet 4.5 is the current benchmark to beat. The gap isn't close.
I'll keep running this challenge across new model releases. If GPT-5.4 improves or a new contender emerges, the results will speak for themselves.
Have you run similar benchmarks across models? What's your current go-to for in-editor AI coding? Drop it in the comments.
