Part Overview
This part covers interaction-driven learning, from policy gradients and off-policy methods to safe exploration and sim-to-real transfer. It connects formal ideas with the tools and labs needed to build working systems.
Chapters: 7. Each chapter includes theory, recipes, practical code, a library shortcut, and exercises.
Reinforcement Learning for Embodied Agents gives the reader a working layer of the embodied AI stack. Later chapters assume this layer when agents must perceive, plan, act, and recover from mistakes.
This chapter develops reinforcement learning refresher as part of the embodied AI stack.
- 14.1 Learning from interaction; return and discounting
- 14.2 Policies and value functions
- 14.3 Exploration vs. exploitation
- 14.4 Model-free vs. model-based; on- vs. off-policy
- 14.5 Why RL is hard in embodied systems (sample cost, reward, safety)
This chapter develops policy gradient methods and PPO as part of the embodied AI stack.
- 15.1 Direct policy optimization; stochastic policies
- 15.2 The policy gradient theorem; REINFORCE
- 15.3 Actor-critic and advantage estimation (GAE)
- 15.4 Trust regions; TRPO to PPO
- 15.5 PPO in practice: the implementation details that matter
This chapter develops value-based and off-policy methods as part of the embodied AI stack.
- 16.1 Q-learning; deep Q-networks
- 16.2 Replay buffers and target networks
- 16.3 Continuous control: DDPG, TD3, SAC
- 16.4 Maximum-entropy RL
- 16.5 Sample efficiency and off-policy failure modes
This chapter develops massively parallel and GPU RL as part of the embodied AI stack.
- 17.1 Why thousands of parallel envs changed the field
- 17.2 Learning to walk in minutes: the parallel-RL recipe
- 17.3 Isaac Lab with SKRL / rl_games / RSL-RL
- 17.4 MJX/Brax-training and JAX RL
- 17.5 Teacher-student and privileged-information distillation
This chapter develops reward design and goal specification as part of the embodied AI stack.
- 18.1 Why rewards are dangerous
- 18.2 Sparse vs. dense; shaping done right
- 18.3 Goal-conditioned policies; hindsight experience replay
- 18.4 Reward hacking, with case studies
- 18.5 Human preferences and learned reward models (RLHF for control)
This chapter develops exploration in embodied worlds as part of the embodied AI stack.
- 19.1 Why embodied exploration is expensive and risky
- 19.2 Intrinsic motivation, curiosity, count-based and novelty methods
- 19.3 Safe exploration
- 19.4 Exploration under partial observability
This chapter develops sim-to-real transfer (RL focus) as part of the embodied AI stack.
- 20.1 The reality gap revisited
- 20.2 What transfers and what does not
- 20.3 Domain randomization, system identification, adaptation (RMA)
- 20.4 Fine-tuning on hardware; safe real-world RL
- 20.5 Measuring transfer performance
What's Next?
After this part, Part V: Learning from Demonstration and Robot Data extends the stack.
How to Read Part IV as a Builder
Part IV treats reinforcement learning as a closed-loop engineering discipline: define the observation and action contract, choose a maintained library only after the mechanism is visible, and save one construct-matched evaluation artifact for every comparison. The sequence links domain randomization, PPO, GPU-scale RL, reward design, and sim-to-real transfer into one practical workflow.
Policy, reward, exploration, scale, transfer. If a robot learning result is hard to reproduce, check those five contracts before changing the neural network.