The generalization gap in robot manipulation
Humanoid robots are surprisingly bad at picking things up. Watch a human reach for a mug in an office, then in a coffee shop, then in a home kitchen, and the task looks effortless. The hand adjusts. The eyes track. The body repositions. A humanoid robot, by contrast, can perform the same motion in one environment but fail completely in another.
The traditional approach to solving this seems obvious: collect lots of real robot videos performing manipulation tasks across many different environments. Let the robot learn from these demonstrations. But this strategy hits a wall so hard it's become the limiting factor in humanoid robotics. Each real-world trial is expensive. Hardware can break. The diversity required to cover different lighting, furniture heights, surface textures, and object types explodes the data collection burden. By the time you've gathered enough videos to show meaningful variation, you've invested months and thousands of dollars. And even then, show the robot a new coffee shop it hasn't seen, and performance collapses.
The result is that existing humanoid manipulation systems work reliably only in controlled laboratory settings. They're brittle. They don't generalize. This is the fundamental problem that HERO, a new system for loco-manipulation with humanoid robots, addresses head-on by rejecting the premise entirely.
Simulation plus vision models
Instead of trying to scale real-world data collection, what if the problem is framed differently? What if you use simulation for what it's genuinely good at, and large vision models for what they're good at?
Simulation is excellent at one thing: letting you practice control policies thousands of times without breaking hardware. A simulated robot can reach for the same object orientation 10,000 times with random variations, and each trial costs pennies. Meanwhile, large vision models trained on billions of internet images already understand object categories, spatial relationships, and scene context. They can look at an RGB-D image and reliably identify "that's a mug on a table, roughly at this position, roughly this orientation."
The key insight in HERO is that these aren't competing approaches, they're complementary. Vision models handle scene understanding and generalization to new environments. Simulation-trained control policies handle the precision of execution. By separating these concerns, the system should inherit the generalization properties of large vision models while maintaining the precision that comes from intensive simulation training.
This is a genuine departure from the prevailing approach. Most humanoid manipulation work tries to solve perception and control jointly on whatever real-world data can be collected. HERO flips this by asking: why not use proven, large-scale vision systems for what they've already learned, and simulation for what it does best?
The kinematic drift problem
Here's where the paper gets at something roboticists understand but rarely discuss explicitly: the gap between where a robot thinks its hand is and where it actually is.
Every robot has forward kinematics, mathematical models that predict "if I move joint A by 30 degrees and joint B by 45 degrees, my hand ends up at position X." In reality, this model is never perfect. Friction in joints, cable stretch, servo lag, sensor noise, and material deformation accumulate. Even small errors compound. Miss by 2 centimeters on one reach, and the next reach, which relies on the previous one, is now 4 centimeters off. After a few movements, the robot's understanding of where its hand is has drifted dramatically from reality. The robot reaches confidently toward empty air.
Traditional solutions to this problem are either brittle or expensive. Careful hardware calibration helps but degrades over time. Closed-loop vision tracking of the hand requires additional cameras and processing. Most existing humanoid systems accept kinematic drift as unavoidable noise and work around it with conservative safety margins.
HERO's insight is that this error is learnable and predictable. Instead of trying to eliminate the error entirely, train a system to predict and correct it. Instead of assuming the gap is random noise, treat it as systematic bias that can be learned from data.
Residual learning for hand control
The technical heart of the paper is elegant when stripped of jargon: train a neural network to predict the difference between where classical kinematics say the hand should be and where it actually ends up, then use that prediction to correct in real time.
This residual approach combines four components working together. Classical inverse kinematics converts "I want my hand at position X" into "move these joints by these amounts." This is well-established robotics but never perfect in reality. A learned forward kinematics model, trained in simulation, predicts where the hand actually ends up given joint positions. This captures systematic biases and errors that the classical model misses.
Then comes goal adjustment: if sensors see the hand is at position Y when theory predicts position X, nudge the target slightly to close that gap. Finally, replanning handles larger errors. If the error grows too large, don't keep pushing forward, replan the reaching motion mid-execution.
Together, these components reduce end-effector tracking error by 3.2 times compared to naive classical kinematics. This is substantial. The tracking error was the silent killer preventing reliable manipulation, and reducing it by more than 3 times directly translates to manipulation success rates that were previously impossible.
Critically, this approach trains entirely in simulation. The neural network learns residuals, small corrections to perfect kinematics, rather than the entire control problem. This kind of learning transfers well to real robots because the residuals capture first-order effects, not the full complexity of the real world. A learned correction for systematic arm bias should work whether the robot is in a lab or a coffee shop.
Connecting vision to control
Now HERO has an accurate hand control system. The next step is building a complete manipulation pipeline that takes a natural language command like "pick up the mug" and executes it reliably across different environments.
The full system works like this. A large-vocabulary vision model interprets the RGB-D image, understanding object categories it's never explicitly trained on and estimating 3D locations and orientations. Given the robot's current position, the object's location, and the height of the surface (which varies from 43 centimeter office desks to 92 centimeter counters), a loco-manipulation planner generates a whole-body motion that gets the robot's torso and hand into the right position. Then the residual-aware end-effector tracker executes that motion precisely. During execution, if the vision system updates its understanding of the object location, the system replans the reaching motion.
The power of this architecture is modularity. Each component can be developed, tested, and improved independently. The vision system doesn't need to know about kinematics. The trajectory planner doesn't need to know about learning. The hand-tracker does its job without worrying about scene understanding. If one component improves, the entire system benefits. This is sharply different from monolithic end-to-end approaches where improving perception might break control or vice versa.
This approach also connects naturally to related work like learning adaptive neural teleoperation for humanoid robots, which explores how neural networks can refine classical control, and end-to-end learning for robot vision, which shows the complementary challenge of visual perception in dynamic environments. HERO's modular decomposition offers a way to combine insights from both directions.
Real-world validation
A system that works in simulation means nothing. HERO's validation spans a spectrum designed to show both that the system works and which components matter most.
In simulation, the residual-aware controller reduces tracking error by 3.2 times as claimed. Then comes modular testing in real labs: Can the vision model identify objects it hasn't seen before? Can the loco-manipulation planner handle different surface heights? Finally, end-to-end real-world tests in actual environments: offices, coffee shops, homes. Real objects: mugs, apples, toys. Real variation: different table heights from 43 to 92 centimeters, different lighting, different clutter.
The system generalizes to environments and objects it never trained on. It reliably manipulates objects across the full range of surface heights tested. This is the proof that the modular approach works. By separating vision generalization from control learning, the system inherits the generalization properties of large vision models while maintaining the precision of simulation-trained control.
A new approach to building humanoid robots
HERO represents a fundamentally different way of thinking about humanoid robot training. The "collect more real data" approach has hit diminishing returns. This paper shows that simulation plus large models is a more scalable direction.
More broadly, HERO demonstrates that you don't need one monolithic end-to-end system solving perception and control jointly. Vision generalization and control precision are different problems better solved with different tools. And systems that learn to correct for reality rather than trying to perfectly model it are more robust and practical.
This approach could reshape how practical humanoid robots are built. Instead of expensive real-world data collection campaigns, future work might see improved sim-to-real transfer as simulation fidelity improves (though only for what matters for control, not perception). Vision models could become the default perception backbone, freeing roboticists to focus on control and planning. New robot systems could emerge that quickly adapt to new environments because generalization is built into the architecture.
The paper demonstrates this works for tabletop manipulation of everyday objects. The frontier ahead includes more complex manipulation tasks, two-handed reaching, long-horizon reasoning, objects that deform, and navigation combined with manipulation. HERO's modular approach gives a clear path forward for each. That clarity is itself the contribution: not just a working system, but a blueprint for how to think about the problem.
This is a Plain English Papers summary of a research paper called Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.
