Part IV: Reinforcement Learning for Embodied Agents | Building Embodied AI: From Perception to Autonomous Action

Part Overview

This part covers interaction-driven learning, from policy gradients and off-policy methods to safe exploration and sim-to-real transfer. It connects formal ideas with the tools and labs needed to build working systems.

Chapters: 7. Each chapter includes theory, recipes, practical code, a library shortcut, and exercises.

Why This Part Matters

Reinforcement Learning for Embodied Agents gives the reader a working layer of the embodied AI stack. Later chapters assume this layer when agents must perceive, plan, act, and recover from mistakes.

Chapter 14 Reinforcement Learning Refresher

This chapter develops reinforcement learning refresher as part of the embodied AI stack.

14.1 Learning from interaction; return and discounting
14.2 Policies and value functions
14.3 Exploration vs. exploitation
14.4 Model-free vs. model-based; on- vs. off-policy
14.5 Why RL is hard in embodied systems (sample cost, reward, safety)

Chapter 15 Policy Gradient Methods and PPO

This chapter develops policy gradient methods and PPO as part of the embodied AI stack.

15.1 Direct policy optimization; stochastic policies
15.2 The policy gradient theorem; REINFORCE
15.3 Actor-critic and advantage estimation (GAE)
15.4 Trust regions; TRPO to PPO
15.5 PPO in practice: the implementation details that matter

Chapter 16 Value-Based and Off-Policy Methods

This chapter develops value-based and off-policy methods as part of the embodied AI stack.

16.1 Q-learning; deep Q-networks
16.2 Replay buffers and target networks
16.3 Continuous control: DDPG, TD3, SAC
16.4 Maximum-entropy RL
16.5 Sample efficiency and off-policy failure modes

Chapter 17 Massively Parallel and GPU RL

This chapter develops massively parallel and GPU RL as part of the embodied AI stack.

17.1 Why thousands of parallel envs changed the field
17.2 Learning to walk in minutes: the parallel-RL recipe
17.3 Isaac Lab with SKRL / rl_games / RSL-RL
17.4 MJX/Brax-training and JAX RL
17.5 Teacher-student and privileged-information distillation

Chapter 18 Reward Design and Goal Specification

This chapter develops reward design and goal specification as part of the embodied AI stack.

18.1 Why rewards are dangerous
18.2 Sparse vs. dense; shaping done right
18.3 Goal-conditioned policies; hindsight experience replay
18.4 Reward hacking, with case studies
18.5 Human preferences and learned reward models (RLHF for control)

Chapter 19 Exploration in Embodied Worlds

This chapter develops exploration in embodied worlds as part of the embodied AI stack.

19.1 Why embodied exploration is expensive and risky
19.2 Intrinsic motivation, curiosity, count-based and novelty methods
19.3 Safe exploration
19.4 Exploration under partial observability

Chapter 20 Sim-to-Real Transfer (RL focus)

This chapter develops sim-to-real transfer (RL focus) as part of the embodied AI stack.

20.1 The reality gap revisited
20.2 What transfers and what does not
20.3 Domain randomization, system identification, adaptation (RMA)
20.4 Fine-tuning on hardware; safe real-world RL
20.5 Measuring transfer performance

What's Next?

After this part, Part V: Learning from Demonstration and Robot Data extends the stack.

How to Read Part IV as a Builder

Part IV treats reinforcement learning as a closed-loop engineering discipline: define the observation and action contract, choose a maintained library only after the mechanism is visible, and save one construct-matched evaluation artifact for every comparison. The sequence links domain randomization, PPO, GPU-scale RL, reward design, and sim-to-real transfer into one practical workflow.

Part IV Memory Anchor

Policy, reward, exploration, scale, transfer. If a robot learning result is hard to reproduce, check those five contracts before changing the neural network.