I Built 10 Pokemon Agents That Learn From Each Other Using Kafka and Flink

I built an agent to play Pokemon. Took inspiration from ClaudePlaysPokemon and the AlphaEvolve paper from DeepMind. My initial approach was just brute force in a single threaded loop, but I quickly started exploring spinning up 10 agents at a time to speed run and then learn. Each subsequent run gets slightly better.

Data Flow:

Agent → Tapes Proxy → Kafka (agent.telemetry.raw)
                          ├→ telemetry-consumer (JSONL)
                          ├→ Flink SQL (anomaly detection)
                          │    └→ Kafka (agent.telemetry.alerts)
                          └→ DuckDB (ad-hoc queries)

The setup I ended at was using Kafka and Flink for reinforcement learning and dumping that to JSONL and DuckDB for a quick and dirty observational memory.

stereOS VM (/workspace)
┌──────────────────────────────────────────────────┐
│                                                  │
│  PyBoy (headless, window="null")                 │
│    ↓ memory addresses                            │
│  MemoryReader → BattleState / OverworldState      │
│    ↓                                             │
│  Strategy Engine (heuristic or LLM)              │
│    ↓ button inputs                               │
│  GameController → PyBoy                          │
│                                                  │
│  Tapes ← proxies LLM API calls, records sessions │
│                                                  │
└──────────────────────────────────────────────────┘
  ↕ shared mount (./ ↔ /workspace)
Host: frames/  .tapes/  pokedex/

The agent runs headless inside a stereOS VM. PyBoy emulates Pokemon Red with no display server. It reads game memory directly from known RAM addresses (battle state, HP, moves, map position, badges) and picks actions through a strategy engine. No screenshots, no vision model. Just memory reads and button inputs. The emulator runs roughly 100x faster than real time with rendering disabled.

I ran the strategy engine at three tiers. Low is pure heuristics: true/false, move left, up, down, right. Medium introduces the LLM to reason based on what the agent is looking at. High is full LLM-powered reasoning with collective context across sessions. I started at low, moved to medium with Claude Code, then went full high to get past the hard parts.

The hard parts were not what I expected. The first six Pokedex logs are just me discovering game mechanics that are invisible to an agent. There is an 8-second cooldown on every door transition. The game enforces this so you do not accidentally walk back through a door you just entered. The agent had no idea. It would walk through the door to leave the bedroom, immediately try to move, hit the cooldown, and interpret the lack of movement as being stuck. Hundreds of turns burned on this before I figured it out.

Getting outside the house was log two. Getting past Professor Oak was worse. After you pick a starter Pokemon, you have to talk to the professor. He gives you introductions. You have to physically walk around him in the lab. That navigation is clunky even for humans. For an agent reading memory addresses, it was a wall. The other thing that changes at that point in the game is dialogue choices. Up until the Pokemon selection, you can just mash A through every text blob and keep moving. But when you pick a Pokemon, you have to say yes or no. The agent was defaulting to nothing. It was not selecting either option. All of this nuance had to be discovered through trial and error, logged in the Pokedex, and fed back into the next run.

Talking to NPCs turned out to be critical. Without NPC conversations, the agent just wander the map aimlessly. The context clues for what to do next, where to go, what items to find, all of that comes from dialogue. The agent needed to learn that talking to people is not optional. It is how you unlock the next objective.

The evolution loop is where it gets interesting. Inspired by AlphaEvolve, the agent treats its navigation parameters (stuck threshold, door cooldown, waypoint skip distance, axis preference) as a genome. Each generation, 10 variants run in parallel. A fitness function scores them. The winner survives. But before AlphaEvolve kicks in, I run a Factorial Learning Environment pass. FLE works like this: if you take two steps forward and get blocked, take three steps back and reevaluate where to go next. The agent wanders through the dark, figures out what works, and writes the successful route to JSON inside the Pokedex. The next time it runs, it does not go through the mess of rediscovering where the door is. It already knows. That is how you bootstrap into AlphaEvolve. FLE builds the map. AlphaEvolve optimizes the route.

The interesting part is the evolution loop. Inspired by AlphaEvolve, the agent treats its navigation parameters (stuck threshold, door cooldown, waypoint skip distance, axis preference) as a genome. Each generation, 10 variants run in parallel. A fitness function scores them. The winner survives. Cold start: 1/10 generations improved. With historical telemetry: 4/10.

| Run | Historical entries | Gens improved | Final score |
|-----|-------------------|---------------|-------------|
| 1   | 0                 | 1/10          | 39,415      |
| 2   | 10                | 3/10          | 12,836      |
| 3   | 20+               | 3/10          | 17,319      |
| 4   | 30+               | 4/10          | 39,423      |

A cold start run improved in only 1 out of 10 generations. With accumulated historical telemetry from previous runs, that jumped to 3 or 4 out of 10. The best run (Run 4) explored for 7 generations before finding a breakthrough at Gen 8 by touching parameters no previous run had tried. The historical observer showed the standard parameter space was exhausted, which pushed the LLM to explore new dimensions.

What makes this work across runs is the telemetry pipeline. Every conversation turn flows through a Tapes.dev proxy into Kafka. Flink SQL jobs run anomaly detection in real time: stuck loop detection (10+ repeated actions in 30 seconds) and token spike detection (input tokens exceeding 2x the rolling average). Alerts feed back into a SQLite database as nodes, which the observational memory system picks up on the next run.

  Agent Session
  ┌──────────┐
  │ Reads at  │◄──── observations.md (local, single-session)
  │ startup   │◄──── historical_insights.md (cross-session, from JSONL)
  └─────┬─────┘
        │
        ▼
  Agent runs, makes API calls through tapes.dev Proxy
        │
   ┌────▼─────────┐
   │ Tapes Proxy   │──────────── THE SERVICE BOUNDARY
   └──┬────────┬───┘
      │        │
 ┌────▼────┐ ┌─▼──────────────────┐
 │ SQLite  │ │ Kafka               │
 │ (local) │ │ agent.telemetry.raw │
 └────┬────┘ └──┬─────────────┬───┘
      │         │             │
      │    ┌────▼────┐   ┌───▼────────────┐
      │    │ Flink   │   │ JSONL sink     │
      │    │ (real-  │   │ data/telemetry │
      │    │  time)  │   └───┬────────────┘
      │    └────┬────┘       │
      │         │       ┌────▼─────────────────┐
 ┌────▼─────┐   │       │ Historical Observer  │
 │ Observer  │   │       │ (DuckDB queries)     │
 │ (single  │   │       │                      │
 │  session) │   │       │ Cross-session        │
 └────┬──────┘   │       │ insights:            │
      │         │       │ - token trends       │
      │    ┌────▼────┐  │ - recurring failures │
      │    │ alerts  │  │ - efficiency deltas  │
      │    │ consumer│  └────┬─────────────────┘
      │    └────┬────┘       │
      │         │            │
 ┌────▼─────────▼────────────▼──┐
 │   observations.md             │  ◄── single-session
 │   historical_insights.md      │  ◄── cross-session
 └──────────────────────────────┘

The observational memory is the piece I am most interested in expanding on for the self-healing aspect. It reads the Tapes database after each run, extracts noteworthy events through heuristic pattern matching (errors, file creations, token usage anomalies), tags them by priority, and writes them to markdown files alongside the database. No LLM calls required for this step. The next generation of agents reads those observations before mutating parameters.

Once the agent picks up a Pokemon, battles start. I built a separate battle memory system for this. When the agent finds a Pokemon's weakness, it logs that in the Pokedex. The next time it encounters that Pokemon, it already knows the matchup. It can set up its team composition before the fight instead of discovering the weakness mid-battle. Losing a battle is the equivalent of a failed test. The agent goes back, reiterates, and comes back with a better loadout.

here is a clear gap though. Every run eventually hits a plateau where the LLM proposes near identical variants for multiple consecutive generations. The historical observer records the convergence but nothing acts on it yet. Run 4's breakthrough happened despite that gap, not because of a designed escape mechanism. Closing that loop, detecting convergence and injecting a diversification signal automatically, is the next step. That is the self-healing piece.

The feedback loop (agent runs, telemetry persists, observer surfaces patterns, next run reads those patterns) applies well beyond games. What I am building is essentially a runbook. A runbook to speed run Pokemon. If anybody needs to go beat Brock or Misty, the pathway is already mapped. That same pattern applies to large scale refactors where each PR is a generation. Sprint telemetry revealing which modules have the highest revision rates. Massive COBOL monoliths where you map every capability, every code path, every bug, and use that as the iteration loop for an agent to continuously refactor toward a target. 5,000-line files that need to be broken apart. The whole thing runs locally. Repo readme goes into more detail.

https://github.com/papercomputeco/pokemon-kafka?embedable=true