Section 52.3: Safety violations and constraint satisfaction | Building Embodied AI: From Perception to Autonomous Action

For Safety violations and constraint satisfaction, a benchmark conclusion survives reruns only when the panel, seed policy, exclusion rules, and raw episode artifacts are inspectable.
An Evaluation Methodologist

Big Picture

Embodied systems operate inside hard or soft constraints: collision margins, force limits, speed caps, no-go zones, thermal budgets, and human-separation rules. Evaluation is incomplete if these are not measured explicitly.

Safety violations and constraint satisfaction illustration for Chapter 52. — **Figure 52.3.1**: A constraint-aware evaluation view highlights forbidden states, action-rate limits, and intervention counts alongside ordinary task outcome.

Why This Matters

Safety violations and constraint satisfaction matters because evaluation choices rewrite the scientific claim. If the metric drops time, energy, or safety terms that the deployment team cares about, the benchmark no longer matches the real decision.

A common constraint statistic is the satisfaction rate $$C = 1 - \frac{1}{N}\sum_{i=1}^{N}\mathbb{1}\{\exists t: g(x_{i,t}, u_{i,t}) < 0\},$$ where $g(x,u) \ge 0$ defines the allowed set. For soft constraints, also log the violation magnitude and duration.

Key Insight

Constraint satisfaction is not a detail added after task scoring. It changes which episodes count as acceptable and often changes which baseline should be considered competitive at all.

Algorithmic View

State every hard and soft constraint in measurable units before running the benchmark.
Log the first violation time, maximum violation magnitude, and total time outside the safe set.
Distinguish near-boundary episodes from true violations so threshold tuning can be audited.
Aggregate per-constraint statistics before merging them into a chapter-level summary.
Pair every violation with a replay artifact and causal postmortem label.

Worked Example

A manipulator that completes a cabinet-opening task in 95 percent of trials but exceeds wrist torque limits in 8 percent of them is not simply 'slightly worse'. It violates a deployment gate.

samples = [
    {"clearance_cm": 12.0, "speed_mps": 0.7},
    {"clearance_cm": 4.0, "speed_mps": 0.8},
    {"clearance_cm": 9.5, "speed_mps": 1.3},
]

violations = []
for s in samples:
    v = {
        "clearance_violation": s["clearance_cm"] < 8.0,
        "speed_violation": s["speed_mps"] > 1.0,
    }
    violations.append(v)

print(violations)

[{'clearance_violation': False, 'speed_violation': False}, {'clearance_violation': True, 'speed_violation': False}, {'clearance_violation': False, 'speed_violation': True}]

Code Fragment 52.3.1 evaluates two explicit constraints per sample, which is the minimum structure needed to discuss constraint satisfaction clearly.

Expected output: The output separates clearance and speed failures instead of collapsing them into one vague unsafe label. That separation is what supports diagnosis and mitigation.

Library Shortcut

In production, constraint evaluation belongs in the controller or monitor stack, with alerts exported into the benchmark artifact. The maintained tools save you from hand-parsing timestamps and threshold crossings after the fact.

Safety-violation evaluation needs time-resolved evidence: Pandas aggregates constraint margin and violation duration, SciPy compares paired severity measures, DVC pins hazard scenarios, MLflow or Weights and Biases records policy versions, and ROS 2 bags retain the exact frames where a guard was late.

Constraint tables often reveal that two models with similar utility differ sharply in safety profile. One may violate rarely but severely, while another grazes boundaries often without crossing them. Both patterns matter.

The section's concrete artifact is a violation ledger with minimum margin, duration, speed at violation, intervention source, and post-intervention state. That ledger is what turns a binary unsafe count into an engineering diagnosis.

A common failure pattern is to count only whether a violation happened, not how long it lasted or how large it became. That throws away the information needed for risk ranking and controller redesign.

Cross-References

This section sets up Section 54.2 on safe exploration and Section 54.3 on barrier functions, where constraint satisfaction moves from measurement to enforcement.

Lab Recipe

Define two hard constraints and one soft constraint for an existing embodied task. Run a panel, compute satisfaction rate, maximum violation, and time-outside-safe-set, then inspect the worst episode replay.

Failure Mode

Do not average all violations into one severity-free percentage when the underlying hazards have different consequences. A minor workspace excursion and a force spike on a human-contact surface should not carry the same semantic weight.

Practical Example

For a delivery robot, relevant constraints include pedestrian clearance, maximum cornering speed, and stop-distance budget. The benchmark should tell you which one failed first and how close nominal runs operate to the boundary.

Research Frontier

Open problems include combining learned uncertainty with formal constraints, calibrating near-boundary warnings, and deciding which soft constraint violations are acceptable during adaptation or exploration.

Self Check

Can you list one hard constraint, one soft constraint, and one severity field you would log for your platform? If not, the safety envelope is still underspecified.

Key Takeaway

Constraint satisfaction turns evaluation into an operational review. The benchmark must say whether the robot stayed inside the allowed envelope, not just whether it reached the goal.

Exercise 52.3.1

Take a benchmark you know and rewrite its success definition so any hard-constraint violation marks the episode unacceptable. Explain how that changes the leaderboard logic.

Fun Note

Reporting 95 percent task success without mentioning the wrist torque spikes is a bit like reporting a flight on time while omitting that the landing gear scraped the runway. The headline is technically accurate, which is the problem.

Section References

Ames, A. D. et al. "Control Barrier Function Based Quadratic Programs for Safety Critical Systems." (2017). https://arxiv.org/abs/1609.06408

Useful background for thinking about measurable constraint sets.

Koopman, P., and Wagner, M. "Challenges in Autonomous Vehicle Safety." (2017). https://arxiv.org/abs/1705.01284

A reminder that safety evaluation depends on explicit operational constraints.

What's Next

Section 52.4 extends this constraint-aware view into robustness by asking how performance degrades under shift, perturbation, and worst-case tails.