I trained 1,500 parallel bots to play DDNet, a cooperative 2D platformer with grappling hooks, freeze mechanics, and maps that take humans hours to complete. This is what it looks like:

I grew up playing this game. Back in high school I watched SethBling’s MarI/O, a neural network learning to play Super Mario World, and immediately thought: what if I could do this for DDNet? Years later, Yosh’s RL agent teaching itself TrackMania brought the idea back. By that point I’d studied computer science and actually had the tools to try. So I did.

If you’re not familiar with reinforcement learning, Lilian Weng’s overview and OpenAI’s Spinning Up are great starting points. The short version: an agent learns by trial and error, no labeled data required.

The First Attempt (and Why It Failed)

DDNet is a C++ codebase with 40k+ commits dating back to 2007. There’s no Python API, no gym environment, no convenient env.step(). My first idea was to keep everything in C++ and run neural network inference directly in the game server using ONNX Runtime. No inter-process communication, no serialization, everything in one process.

I got it working. PPO training, frame-skip, reward shaping. Then the policy collapsed. The agent learned to output the same action regardless of what it observed. Debugging a neural network in C++ with limited tooling is not fun. After a few weeks of this I scrapped the approach.

The lesson: prototype fast, abandon fast.

The Architecture That Worked

The fix was splitting responsibilities. Keep inference in Python (better debugging, TensorBoard, fast iteration) but move the bots into the game server (no client rendering overhead).

I modified the DDNet server to run 127 headless bots per process and communicate with Python through /dev/shm (shared memory, no sockets, no serialization overhead). Why 127? The server supports 128 client slots without touching internal data structures. I kept one open so I could connect as a human observer to debug.

With 8-12 server processes running in parallel, that’s 1,000-1,500 bots training simultaneously.

How it works each tick

  1. The C++ server extracts observations (tile grid, player state, checkpoint info) and writes them to shared memory
  2. Python reads the observations, runs neural network inference, and writes actions back
  3. The server applies the actions, steps the physics, and calculates rewards
  4. Frame-skip of 3: the agent decides every 3rd tick, with rewards accumulating between decisions

Each server gets a single shared memory region with contiguous arrays of [actions | observations | rewards | dones]. Switching from an interleaved memory layout to this contiguous one gave a 2.4x speedup (single memcpy instead of N strided writes). The server pauses during inference to prevent stale action repeats, keeping the training signal clean.

The shared memory implementation and C++/Python integration was new territory for me. Claude Code helped a lot here, especially for the tedious parts like getting the memory layout right and debugging cross-language data alignment issues. For a side project, having an AI assistant handle that kind of grunt work makes a real difference.

What the agent sees and does

The observation space:

  • 43x31 tile grid (float32), a local view around the player, roughly 21 tiles in each direction
  • Player state: position, velocity, hook state, freeze status (15 dimensions, stacked 4 frames for temporal context)
  • Checkpoint navigation: next checkpoint position + progress (4 dimensions)

The action space is hybrid:

  • Discrete: direction (left/right/none), jump (yes/no), hook (yes/no)
  • Continuous: aim angle + magnitude

This hybrid space is more natural for platformers than discretizing aim into angular buckets.

Observation space visualization

When the Agent Confidently Goes the Wrong Way

With the architecture working, I wrote a simple reward function: scan the map for checkpoint tiles, reward the agent for reaching them in order. Seemed reasonable.

One of the first training runs ended like this:

The agent learned to sprint confidently in the completely wrong direction. Some quick investigation in-game revealed why: there’s an unreachable teleporter way off the intended course, and the reward function was happily guiding the agent toward it.

Checkpoint weirdness

This is a DDNet-specific problem. The game has existed since 2007 and maps have been made for over 15 years. Every map author places checkpoints, teleporters, and freeze tiles differently. It’s the wild west. A reward function that assumes sequential, reachable checkpoints will break on maps that don’t follow that convention. And a lot of maps don’t.

The DDNet community of maintainers, modders, and mappers was very helpful here. Without people who actually understand the mapping ecosystem, I would have spent a lot longer figuring out why things were breaking.

Reward Engineering: Getting It Right

Version 1: Checkpoint rewards

The first real reward function scanned the map for time checkpoint tiles at startup:

  • +10 for reaching a new checkpoint
  • Continuous progress signal: reward proportional to distance-toward-next-checkpoint improvement (high water mark only, no penalty for backtracking)
  • Small velocity bonus toward the checkpoint direction
  • Penalties for death tiles, freeze, and time

This worked okay on well-structured maps but fell apart on maps with teleporters, since checkpoint 5 might be spatially behind checkpoint 4. It also had a freeze penalty problem: on many DDNet maps you must get frozen to proceed. Penalizing freeze teaches the agent to avoid it at all costs, which means it never learns the actual route.

Version 2: Waypoint-based rewards

The fix was to stop relying on checkpoint positions and instead trace the actual path through the map.

I built two tools for this: a Huffman decompressor for DDNet’s ghost replay format (.gho files, they use a static frequency table baked into the engine) to extract waypoints from human replays, and a tile-level BFS pathfinder that respects teleporters and avoids death tiles.

The reward became:

  • Progress along the waypoint path (high water mark, no backtrack penalty)
  • Distance-from-path penalty with a 128-unit grace zone (~4 tiles, enough slack for different movement styles)
  • +100 finish bonus

This handled teleporters correctly because the waypoint path encodes the actual route, not the spatial layout.

Checkpoint Respawning: Don’t Always Start From Scratch

Early in training the agent dies within the first few seconds. Later it can reach checkpoint 10 but still restarts from checkpoint 0 every episode. Most training time is spent re-doing sections the agent has already mastered.

The fix: adaptive checkpoint respawning.

  • Track a rolling buffer of episode outcomes (highest checkpoint reached)
  • On reset, sample a spawn position from the top-performing episodes
  • 20% of resets still start from the beginning to maintain full-map skill
  • Spawns are capped at a “frontier”, the highest checkpoint proven reachable from the start, so the agent can’t spawn ahead of its actual capability

This is similar to curriculum learning but driven by the agent’s own performance data rather than a manual schedule.

Tooling

A few things that made this project possible:

  • TensorBoard: essential for spotting policy collapse, reward plateaus, and entropy death
  • Ghost parser: full DDNet Huffman decompression in Python for extracting human demonstrations
  • Map pathfinder: BFS on the tile grid with teleporter and freeze-tile awareness
  • Waypoint visualizer (visualize_waypoints.py): verify that generated paths make sense before burning 35 minutes on a training run

What’s Next

Curriculum learning across maps. Right now every agent is trained on a single map. The goal is to start on easy maps and gradually introduce harder ones, so the agent builds transferable skills instead of memorizing one route.

(Way) longer training. 35-minute runs were enough to validate the approach and tune hyperparameters, but they barely scratch the surface. The interesting question is what happens when you let it run for hours or days.

Multi-agent cooperation. This is DDNet’s signature feature. Many maps require two players to coordinate (one freezes, the other rescues), and that’s a whole new level of difficulty for RL.


References