Section 46.4: Learning from humans: HumanPlus, OmniH2O/HOVER, motion retargeting | Building Embodied AI: From Perception to Autonomous Action

"Human motion is not the answer; it is a clue that still has to survive embodiment."
A Retargeting Review Session

Retargeting human motion into humanoid whole-body behavior. — **Figure 46.4A**: Human data becomes useful only after intent, timing, and contact are mapped into the robot's own body and constraints.

Big Picture

Learning from humans gives humanoids a data advantage, but only when retargeting preserves task intent while respecting contact, reach, and torque limits.

A common retargeting objective is $\min_q \|\phi_{\mathrm{human}} - \phi_{\mathrm{robot}}(q)\|_W^2 + \lambda_c C(q) + \lambda_l L(q)$, where $\phi$ encodes task-relevant pose features, $C(q)$ penalizes contact inconsistency, and $L(q)$ penalizes joint or balance-limit violations. The critical idea is that not every human detail matters equally. End-effector intent and contact timing often matter more than exact elbow angle.

HumanPlus, HOVER, OmniH2O-style work, and related motion-retargeting pipelines all confront the same embodied gap: the human demonstrator and the humanoid do not share mass distribution, joint ranges, or contact mechanics. Retargeting is therefore an inference problem, not a copy problem.

Intent Survives, Coordinates Do Not

Good retargeting preserves what the human was trying to accomplish, not every raw joint angle from the original motion.

Figure 46.4.1 treats retargeting as a loop: observe human behavior, infer task-relevant features, solve the embodiment mapping, and verify executable success on the robot.

Theory

The right retargeting features depend on the task. For locomotion, center-of-mass timing and foot contacts matter. For manipulation, hand pose, gaze, and object-relative trajectories matter. For loco-manipulation, all of them matter together.

This is why motion datasets alone are not enough. A good dataset carries timing, contact, object state, and sometimes force cues so the retargeter can distinguish stylistic variation from essential task structure.

Evaluation should therefore include both geometric metrics and executable metrics: pose similarity, contact timing agreement, balance margin, torque peaks, and actual task completion.

Algorithm: Embodied Motion Retargeting

Capture human motion and task context, including objects and contact timing if possible.
Choose task-relevant features rather than copying all joints equally.
Solve the retargeting objective under joint, balance, and contact constraints.
Replay on the robot or simulator and log feasibility violations and timing drift.
If the motion is not executable, revise the feature set before blaming the controller.

Worked Example

A small retargeting ledger can already separate good task-intent preservation from geometric overfitting.

human_features = {"left_hand_to_box_cm": 4.0, "right_foot_contact": 1, "torso_yaw_deg": 18}
robot_trial = {"left_hand_to_box_cm": 5.3, "right_foot_contact": 1, "torso_yaw_deg": 15}

errors = {
    "hand_error_cm": round(abs(human_features["left_hand_to_box_cm"] - robot_trial["left_hand_to_box_cm"]), 1),
    "contact_match": int(human_features["right_foot_contact"] == robot_trial["right_foot_contact"]),
    "yaw_error_deg": abs(human_features["torso_yaw_deg"] - robot_trial["torso_yaw_deg"]),
}
print(errors)

{'hand_error_cm': 1.3, 'contact_match': 1, 'yaw_error_deg': 3}

Expected output interpretation. The hand and torso errors are small while the contact event is preserved. That suggests the retargeting kept task intent and support timing, which matters more than exact whole-body imitation for many tasks.

Code Fragment 46.4.1: Retargeting should be evaluated on task features and contact agreement, not on raw pose similarity alone.

Library Shortcut

Use motion-retargeting pipelines, whole-body simulators, and robot-data stacks such as LeRobot to keep demonstration and execution artifacts synchronized.

Practical Recipe

Select the task features that actually matter before collecting imitation data.
Record contact timing and object state whenever possible.
Retarget with explicit feasibility penalties.
Evaluate on execution metrics, not only geometric similarity.
Keep failed motions as diagnostics because they reveal missing embodiment features.

Common Failure Mode

A visually plausible retargeted motion can still be dynamically impossible, unsafe, or task-irrelevant for the robot body.

Practical Example

A human can lean and twist to place a box on a shelf while compensating with subtle ankle control. A humanoid with different hip or ankle limits may need a step adjustment rather than a direct pose imitation.

Memory Hook

The robot is not a puppet. It is an organism with different bones, muscles, and excuses.

Research Frontier

Recent work pushes from motion tracking toward video-driven, object-aware whole-body learning and motion priors that fill gaps between sparse demonstrations. The open problem is preserving intent under large embodiment mismatch.

Paper Spotlight

"HumanPlus: Humanoid Shadowing and Imitation from Humans" (Fu et al., RSS 2024) demonstrates whole-body humanoid imitation from egocentric video. More than 40 skills are trained from approximately 40 hours of human demonstration data. The key contribution is a shadowing pipeline that maps egocentric human motion into real-time humanoid control without motion-capture suits, making large-scale human demonstration collection practical for dexterous manipulation and loco-manipulation tasks.

Self Check

Which feature would you preserve first for a carry task: hand trajectory, foot contacts, torso orientation, or joint angles, and why?

This section is useful for teaching the distinction between imitation and embodiment. Students often begin by assuming the goal is faithful visual copying. The real goal is executable task transfer.

It is also a natural place to introduce data contracts. Demonstration data becomes much more valuable when it records task context and contact semantics rather than only pose streams.

Retargeting Tool Map

Tool or Library	Role in the Topic	Builder Advice
LeRobot-style data tooling	Store demonstrations with synchronized metadata	Keep contact and object state beside pose data.
Whole-body simulators	Check executability before hardware rollout	Reject motions that only look right in kinematics space.
Retargeting pipelines	Map human features into robot features	Tune feature weighting by task, not by generic motion similarity.

Cross-References

This section connects to robot datasets, teleoperation, and cross-embodiment learning.

Mini Lab

Retarget one short human demonstration into a humanoid simulation, then compare raw pose error against task-feature error and balance feasibility.

When retargeting fails, ask whether the missing piece is feature choice, contact semantics, embodiment mismatch, or controller feasibility. Different failures imply different dataset improvements.

Section References

HumanPlus project page. https://humanplus.github.io/

Primary current source for human-motion-driven humanoid control.

HOVER project page. https://www.hover-policy.org/

Current reference for versatile neural whole-body control.

LeRobot documentation. https://huggingface.co/docs/lerobot/en/index

Practical stack for storing and training from robot demonstrations.

Key Takeaway

The purpose of human data is not mimicry. It is executable task transfer under a different body.

Exercise 46.4.1

Define a retargeting evaluation for a shelf-placement task. Include one geometric metric, one contact metric, one balance metric, and one task-completion metric.