In 2022 I was tasked with designing an interactive robot system that takes a natural language instruction and outputs a working robot trajectory. This is a write-up of the design decisions and a showcase of a web demo using MuJoCo that. The code in this demo is a rebuild of the original code

"Prompt: Put all the cubes on the plate"

How can you make robots listen to what a user says? Human-Robot-Interaction (HRI) has come a long way over the last couple of years. Vision-Language-Action Models are the de facto state of the art in robotics research. PhAIL is a leaderboard aimed at measuring real-world performance against humans, which makes it clear that general physical AI is not there yet.

Industry often still relies on hand-crafted solutions. In 2022, I was tasked with a natural language interface for an industrial robot. Inspired by a Google paper called Code as Policies, I started designing a prototype. Try the demo here, or look at the source code here.

The architecture

Current Vision-Language-Action models reason on vision and language. In this system, only text is used. What stood out to me after initial testing was the unreasonable effectiveness of solving tasks in 3D space using a text representation of the scene only.

The system works in three stages: scene extraction → LLM → resolver → waypoints. The LLM never sees coordinates. Instead of risking hallucinations, we hide coordinates behind symbolic representations of the object (i.e. their name). Instead of move_to_point(5.221, 471.26, 1.95), the LLM only has to call move_to_point("Apple"). The resolver handles name-to-position mapping.

Architecture overview

How the LLM produces a plan

Unlike a VLA, an LLM doesn’t output motor commands. During the design stage, we had a couple of options to consider:

Free-form code generation:

This is what Code as Policies did. The authors gave the model a maximum amount of freedom by having it generate valid python code and then calling exec() on it. This is the most expressive, as you’re able to use all control structures (if/else, loops, …), but is unsafe not only from a cybersecurity perspective, but also from a physical perspective: Typing in a wrong coordinate into a Kuka KR 100 P will (and did) punch holes into concrete.

Tool calling / slot filling

Slot filling gives your LLM templates that force it to correctly parameterize the empty slots. If the slots are the parameters of movement primitives (like move_to_point()), valid output is guaranteed. The downside is that we get flat sequences only without the control structures that python provides.

Constrained decoding

Constrained decoding forces a decoder model to only output a valid context-free grammar over the sampled tokens (Outlines). This would combine code expressiveness with tool-calling-level correctness.

Constrained decoding is very interesting, but I chose tool calling as a practical middle ground for a minimal web demo.

Primitives and composability

Primitive hierarchy

We define two types of movement primitives the LLM can reason on. Atomic ones like move_to_point or open_gripper and higher order primitives like pick and place that consist of atomic primitives. Waypoints are resolved programmatically and hidden from the LLM.

Primitive expansion

Primitives are added manually by a developer. Since each method just returns a list of waypoints, new methods can be composed of existing ones. Each new primitive is fully interoperable with all other primitives, leading to a combinatorial action space growth for each new method.

One function = one primitive

In python, adding a new primitive is a single function definition. The @primitive-decorator automatically registers the tool for LLM inference. Introspection allows for implicit prompt-building by reading the method signature (including docstrings).

@primitive
def pick(scene: Scene, object_name: str) -> list[Waypoint]:
    """Pick up an object. Approaches from above, grasps, and lifts."""
    obj = scene.get_object(object_name)
    return [
        Waypoint(position=obj.position.offset(z=0.10), label="approach"),
        Waypoint(gripper="open"),
        Waypoint(position=obj.position, label="lower"),
        Waypoint(gripper="close"),
        Waypoint(position=obj.position.offset(z=0.15), label="lift"),
    ]
const pick: PrimitiveDef = {
  name: "pick",
  description: "Pick up an object. Approaches from above, grasps, and lifts.",
  parameters: {
    type: "object",
    properties: {
      object_name: { type: "string", description: "Name of the object to pick up" },
    },
    required: ["object_name"],
  },
  handler: (scene, params): Waypoint[] => {
    const name = params.object_name as string;
    const obj = scene.getObject(name);
    const above = offsetPosition(obj.position, 0, 0, 0.10);
    const lift = offsetPosition(obj.position, 0, 0, 0.15);
    return [
      { position: above, label: `approach_above(${name})` },
      { gripper: "open", label: "open_gripper" },
      { position: obj.position, label: `lower_to(${name})` },
      { gripper: "close", label: "close_gripper" },
      { position: lift, label: `lift(${name})` },
    ];
  },
};
{
  "type": "function",
  "function": {
    "name": "pick",
    "description": "Pick up an object. Approaches from above, grasps, and lifts.",
    "parameters": {
      "type": "object",
      "properties": {
        "object_name": {
          "type": "string",
          "description": "Name of the object to pick up"
        }
      },
      "required": ["object_name"]
    }
  }
}

Resolving Waypoints

The resolver produces a list of waypoints that any simulation backend can consume. Here we use MuJoCo, but the principle works everywhere.

def resolve(tool_calls: list[ToolCall], scene: Scene) -> Trajectory:
    trajectory = Trajectory()
    for call in tool_calls:
        handler = get_handler(call.name)
        waypoint = handler(scene, **call.params)
        trajectory.add(waypoint)
    return trajectory
function resolve(toolCalls: ToolCall[], scene: Scene): Trajectory {
  return toolCalls.map(call => {
    const prim = registry.get(call.name);
    return prim.handler(scene, call.params);
  });
}

Potential improvements

Since 2022, the landscape of HRI has drastically changed. Between hooking up agentic frameworks like openclaw to physical robots and open source vision language models, this approach is no longer at the cutting edge.

What I find interesting is where it could go. Multi-turn planning (re-extracting the scene after each action and replanning) would handle tasks the current flat-sequence approach can’t. Constrained decoding, as mentioned above, is the most promising direction for getting code expressiveness without sacrificing correctness. Improving 3D reasoning via Vision-Language-Models and other sensor modalities would close the loop to the real world.

What I really enjoy is being able to watch this system solve a surprising amount of tasks with just a few primitives and a fairly low-powered LLM.