Section 7.7: Controllers vs. policies; when learning helps and when it makes control unsafe | Building Embodied AI: From Perception to Autonomous Action

A Careful Control Loop

Big Picture

Controllers vs. policies; when learning helps and when it makes control unsafe is one lens on Control for AI Practitioners. We study it because an embodied agent needs decisions that survive contact with noisy sensors, delayed effects, and changing environments.

This section develops the technical contract for Controllers vs. policies; when learning helps and when it makes control unsafe into a usable mental model. First we define the object of study, then we connect it to the agent loop, then we test it with a compact implementation.

The key question in Controllers vs. policies; when learning helps and when it makes control unsafe is practical: what must the agent know, what can it observe, what action is available, and what evidence shows that the action worked under the stated conditions?

Action Is The Test

A representation earns its place when it changes the measurable action interface. In Controllers vs. policies; when learning helps and when it makes control unsafe, the reader should keep asking which decision becomes easier, safer, or more reliable.

Theory

For Controllers vs. policies; when learning helps and when it makes control unsafe, the practical design rule is to make the interface inspectable before optimization begins: inputs, outputs, units, latency, bounds, and failure labels should all be visible in the saved artifact.

A classical controller is usually strongest when the goal, state, model class, and safety limits are explicit. A learned policy is strongest when perception, contact variation, human preference, or high-dimensional context is too complex to hand-code. The safest hybrid treats the learned policy as a proposal generator and the controller or safety filter as the executable contract.

Safety Filter Before Deployment

Let $\tilde u_t=\pi_\theta(o_t)$ be the learned policy command. A safety filter chooses the closest admissible command, $u_t=\arg\min_u\|u-\tilde u_t\|^2$ subject to actuator limits, collision margins, stability constraints, and emergency-stop rules. If the filter changes many commands, the policy is not ready for the robot even if the task reward is high.

Mechanism

The mechanism in Controllers vs. policies; when learning helps and when it makes control unsafe is the contract between representation and action. Name what enters the module, what leaves it, which assumptions make that transformation valid, and which log would reveal a bad handoff.

Worked Example

The safest place for learning in a control loop is as a residual on top of a classical controller, wrapped by a safety filter. The command is $u = u_\text{base}(x) + \pi_\theta(x)$: the base controller owns the nominal behavior, and the learned residual nudges it. A control barrier function (CBF) then enforces a hard safety set. For a barrier $h(x)\ge 0$ (here, staying left of a wall), the filter requires $\dot h(x) + \alpha h(x) \ge 0$ and projects the requested command onto the closest admissible one. Code Fragment 7.7.1 runs the same residual policy with and without the filter.

import numpy as np

# u = u_base(x) + pi_theta(x), guarded by a control barrier function.
# Plant: 1D mass. Barrier h(x) = x_wall - x >= 0  (stay left of the wall).
m, dt, x_wall, alpha = 1.0, 0.05, 1.0, 4.0

def u_base(x, v):   return 8.0 * (0.9 - x) - 4.0 * v   # nominal PD toward x=0.9
def pi_theta(x, v): return 5.0                          # learned residual (unsafe alone)

def cbf_filter(u, x, v):
    # h = x_wall - x, hdot = -v.  Enforce hddot + 2*alpha*hdot + alpha^2*h >= 0
    # with acceleration a = u/m affecting hddot = -a.
    h, hdot, a = (x_wall - x), -v, u / m
    margin = (-a) + 2 * alpha * hdot + alpha ** 2 * h
    if margin >= 0:
        return u, False
    u_safe = m * (2 * alpha * hdot + alpha ** 2 * h)    # minimal correction: margin -> 0
    return u_safe, True

for label, use_filter in [("no safety filter", False), ("with CBF filter", True)]:
    x = v = 0.0; interventions = 0; xmax = -9.0
    for _ in range(200):
        u = u_base(x, v) + pi_theta(x, v)
        if use_filter:
            u, did = cbf_filter(u, x, v); interventions += int(did)
        x += v * dt; v += (u / m) * dt
        xmax = max(xmax, x)
    breached = "BREACHED" if xmax > x_wall + 1e-3 else "safe"
    print(f"{label:>17}: max x={xmax:.3f} wall={x_wall} -> {breached}  interventions={interventions}")

no safety filter: max x=1.616 wall=1.0 -> BREACHED interventions=0 with CBF filter: max x=1.000 wall=1.0 -> safe interventions=198

Code Fragment 7.7.1: the residual policy alone drives the mass through the wall. The CBF filter holds it exactly at the barrier, but it had to override nearly every command. That intervention rate is the key diagnostic: a filter that rewrites almost every action is a signal that the learned policy is not ready for the robot, even if its task reward looks high. Learning should improve perception or task selection, not be the layer that quietly removes a safety constraint.

Library Shortcut

The fragment should expose where a learned policy enters the feedback loop, what monitor bounds it, and which controller owns recovery. ROS 2 control and safety filters should log authority transitions.

Practical Recipe

Write the observation, action, and success metric before choosing a model.
Build a baseline that is simple enough to debug by inspection.
Add the library implementation only after the baseline behavior is understood.
Record failures as structured cases: perception error, state error, planning error, control error, or evaluation error.
Run at least one perturbation test before trusting the result.

Common Failure Mode

The common mistake in Controllers vs. policies; when learning helps and when it makes control unsafe is to celebrate the component score before checking the closed-loop handoff. The failure usually appears at the boundary: stale state, wrong frame, delayed action, saturated actuator, or metric that ignores the real task cost.

Practical Example

A robotics team using Controllers vs. policies; when learning helps and when it makes control unsafe should log not only final success, but intermediate observations, chosen actions, controller status, and recovery events. The logs reveal whether the method is solving the task or merely passing the easiest episodes.

Memory Hook

A good embodied system makes controllers vs. policies; when learning helps and when it makes control unsafe visible twice: once in the design sketch and once in the replay artifact. The second view keeps the first one honest.

Research Frontier

For Controllers vs. policies; when learning helps and when it makes control unsafe, treat frontier claims as hypotheses until they expose enough detail to reproduce the result: data boundary, embodiment, controller interface, evaluation panel, and failure cases.

Self Check

Can you name the observation, state estimate, action, success metric, and most likely failure mode for Controllers vs. policies; when learning helps and when it makes control unsafe? If not, the system boundary is still too vague.

Production Pattern

Controllers vs. policies; when learning helps and when it makes control unsafe sits inside the Part II robotics contract: geometry defines where things are, kinematics defines what motion is possible, dynamics defines what motion costs, control defines how errors are corrected, and sensing defines what the agent can know on time.

Compare controllers and learned policies only under the same sensors, action limits, disturbances, and safety filters. This makes the section useful to students, builders, and researchers at the same time: the idea has an intuitive role, a formal interface, a runnable check, and a failure mode that can be reproduced.

Mechanism To Watch

For Controllers vs. policies; when learning helps and when it makes control unsafe, control closes the loop between estimated state and action. Keep reference, measured state, error signal, control law, actuator limits, and safety fallback separate in the evidence record.

Library Choices And Verification Checks

Tool or Library	What It Handles	Verification Check
python-control	analyzes linear systems, transfer functions, state-space models, and feedback loops	Verify units, sample time, poles, stability margin, and reference scaling.
CasADi	formulates optimization-based controllers with constraints and horizons	Verify constraints, warm start, solver status, and deadline behavior.
Drake	models dynamical systems, multibody plants, optimization, and controllers	Verify scalar type, plant finalization, frame convention, and solver status.
do-mpc	formulates optimization-based controllers with constraints and horizons	Verify constraints, warm start, solver status, and deadline behavior.
ROS 2 control	supports practical work on Controllers vs. policies; when learning helps and when it makes control unsafe	Verify the library output against the hand-built baseline on one small case.

Use this recipe when turning Controllers vs. policies; when learning helps and when it makes control unsafe into code, a simulator experiment, or a robot diagnostic. The point is not to use every library. The point is to keep the hand-built baseline and the maintained-tool path comparable.

Write the control objective, measured state, actuator command, update rate, and saturation policy.
Run a step-response test before adding learning, with overshoot, settling time, and steady-state error logged.
Compare the hand controller with python-control, CasADi, Drake, do-mpc, or ROS 2 control on the same plant model.
Record latency, missed deadlines, saturation events, constraint violations, and recovery actions.
Only compare controllers and policies when they share sensors, action limits, disturbance tests, and safety checks.

Evidence Gate

For Controllers vs. policies; when learning helps and when it makes control unsafe, compare methods only through one saved artifact that preserves the inputs, outputs, units, timestamps, latency budget, configuration, seed, metric definition, and failure labels relevant to this section. The comparison is meaningful only when the same script evaluates the same panel.

Exercise Extension

Extend the section exercise by adding one perturbation specific to Controllers vs. policies; when learning helps and when it makes control unsafe and one latency or uncertainty check. Save the result in the EvidenceRecord schema, then explain which library output you trust and why.

A learned policy can hide an unsafe control interface until the disturbance changes. Check action limits, latency, recovery authority, safety-filter intervention rate, out-of-distribution observations, and fallback behavior before scaling training. For this section, first reproduce one controller-only case and one policy-proposal case under the same disturbance panel. If the two disagree, inspect whether learning improved perception or task selection, or merely bypassed a constraint that the classical controller was enforcing.

Technical Core

Controllers vs. policies; when learning helps and when it makes control unsafe needs a topic-native core: variables, equations or system contracts, an algorithmic procedure, an expected output, and a failure diagnosis. Figure 7.7.T summarizes the chain this section must preserve when moving from a teaching example to a real embodied system.

Figure 7.7.T: The technical core for Controllers vs. policies; when learning helps and when it makes control unsafe connects assumptions, model, algorithm, evidence, and failure analysis.

Formal Object

A hybrid controller can be written as $\tilde u_t=\pi_\theta(o_t)$ followed by $u_t=\mathcal F(\tilde u_t,\hat s_t,\mathcal C)$, where $\mathcal F$ enforces constraints $\mathcal C$. Learning helps when $\pi_\theta$ supplies useful context or candidate actions; classical control remains responsible for timing, limits, recovery, and interpretable safety evidence.

Controller evaluation loop

Define the reference, measured state, error signal, actuator command, update rate, and saturation policy.
Run a step or disturbance response before adding learning.
Log overshoot, settling time, steady-state error, latency, saturation, and recovery behavior.
Compare PID, LQR, or MPC only under the same plant, sensors, limits, disturbance panel, and metric code.

Technical Contract For Controllers vs. policies; when learning helps and when it makes control unsafe

Contract Field	What To Specify	Why It Matters
State and observation	Variables, units, timestamps, frames, and uncertainty.	Prevents a model score from being mistaken for robot capability.
Action interface	Command type, limits, update rate, and safety fallback.	Makes the learned or planned output executable.
Evidence artifact	Trace, metric, configuration, seed, and failure label.	Allows baseline and library path to be compared in one pass.
Tool path	python-control, CasADi, do-mpc, Drake, ROS 2 control, MuJoCo	Shows the practical library route after the mechanism is understood.

For Controllers vs. policies; when learning helps and when it makes control unsafe, expected output is a trace where the relevant error decreases, overshoot stays within the design bound, and actuator commands remain within limits under the stated timing budget.

Failure Mode To Test

Controllers vs. policies; when learning helps and when it makes control unsafe should be stress-tested under delay, integral windup, actuator saturation, unmodeled friction, and reference-frame mismatch before the nominal trace is trusted.

Section References

Core references for Controllers vs. policies; when learning helps and when it makes control unsafe: Modern Robotics; Murray, Li, and Sastry; Siciliano et al.; LaValle; and official documentation for Drake, MuJoCo, Pinocchio, CasADi, python-control, GTSAM, ROS 2, and OpenCV as applicable.

Use these references to check notation, frame conventions, units, solver assumptions, and maintained-library behavior.

Key Takeaway

Controllers vs. policies; when learning helps and when it makes control unsafe is useful when it makes the perception-action loop more reliable, not when it merely adds a more impressive model name.

Exercise 7.7.1

Design a method-matched experiment for Controllers vs. policies; when learning helps and when it makes control unsafe. Specify the environment, observations, actions, metric, one perturbation, and the library output you would compare against the hand-built baseline.