Section 52.1: Why accuracy is not enough | Building Embodied AI: From Perception to Autonomous Action

A robot can classify the scene perfectly and still drive into the wrong next action.
A Closed-Loop Experimentalist

Big Picture

Embodied evaluation starts from the fact that task value is produced by a closed loop, not by a single predictor. Accuracy can improve while latency, safety violations, and failed recoveries make the overall robot worse.

Illustration of an evaluation dashboard where a high perception score coexists with timeout, collision, and energy penalties. — **Figure 52.1.1**: This section begins with the core mismatch: a system can look accurate at one stage while failing as a closed-loop embodied agent.

Figure 52.1.2: Accuracy is only one upstream input. The deployment claim is made on utility after the whole loop, including monitor intervention, has run.

Why This Matters

In an embodied system, the useful question is not whether a perception or policy submodule has a high scalar score. The useful question is whether the full loop reaches goals faster, more safely, and more reproducibly under the same rollout panel.

A compact utility model is $$J = \mathbb{{E}}[\mathbb{{1}}\{{\text{{success}}\}} - \lambda_v V - \lambda_t T - \lambda_e E - \lambda_r R],$$ where $V$ counts violations, $T$ is completion time, $E$ is resource or energy cost, and $R$ counts recoveries or rescues. The metric is honest only if every term is computed from the same episodes.

Key Insight

The diagram matters because it shows exactly where isolated accuracy can disappear: state estimation can be stale, action selection can be slow, and monitors can intervene often enough to erase any upstream gain.

Algorithmic View

Freeze a task panel with explicit initial states, perturbations, and reset rules.
Run the closed loop, not only the predictor, and log action timestamps, monitor states, and termination causes.
Aggregate success, violations, time, energy, and recovery into one utility table.
Compare methods only on paired episodes or a fixed seed schedule.
Inspect failure traces before celebrating any utility improvement.

Worked Example

A pick-and-place policy can improve grasp-point classification from 88 percent to 93 percent, yet lower the final task score because it now pauses longer before actuation and triggers more timeout recoveries.

from dataclasses import dataclass

@dataclass
class Episode:
    success: int
    violations: int
    time_s: float
    energy_j: float
    recoveries: int


def utility(ep: Episode,
            lambda_v: float = 4.0,
            lambda_t: float = 0.02,
            lambda_e: float = 0.005,
            lambda_r: float = 0.5) -> float:
    return (
        ep.success
        - lambda_v * ep.violations
        - lambda_t * ep.time_s
        - lambda_e * ep.energy_j
        - lambda_r * ep.recoveries
    )

baseline = Episode(success=1, violations=0, time_s=16.0, energy_j=38.0, recoveries=0)
accurate_but_slow = Episode(success=1, violations=0, time_s=29.0, energy_j=52.0, recoveries=1)

print({
    "baseline_utility": round(utility(baseline), 3),
    "accurate_but_slow_utility": round(utility(accurate_but_slow), 3),
})

{'baseline_utility': 0.49, 'accurate_but_slow_utility': -0.34}

Code Fragment 52.1.1 compares two successful episodes under one utility contract, showing that a slower, recovery-heavy policy can lose despite identical task success.

Expected output: The two episodes both succeed, but the second receives a lower utility because extra delay, energy use, and one recovery consume the apparent gain. That is the signature of a stage metric that is not sufficient on its own.

Library Shortcut

The hand-built utility is about 20 lines. In practice, a benchmark runner can stream episode traces into Pandas, MLflow, or Weights and Biases Artifacts so the same utility contract is computed automatically for every rollout while preserving the raw evidence.

Accuracy audits in this section should join perception scores to downstream decisions: Pandas groups failures by task phase, SciPy estimates paired confidence intervals, DVC pins the evaluation panel, MLflow or Weights and Biases records model lineage, and ROS 2 bags preserve the sensor frames that explain why a correct label still produced a bad action.

When teams say accuracy is not enough, the real scientific claim is that the stage metric is not construct-matched to the deployment objective. The cure is not to discard accuracy, but to embed it inside a utility table that also sees timing, control, and intervention.

The practical stack for an accuracy-is-not-enough review is a single episode table with columns for observation, predicted state, selected action, outcome, latency, and recovery. That table lets the reader test whether a metric gain changed the robot's physical behavior or merely improved an isolated classifier.

A common postmortem pattern is that the high-scoring method moved the error to a later stage: better detection, worse pose estimate; better pose estimate, slower planning; better planning, more aggressive actions. Closed-loop artifacts reveal where the gain leaked away.

Cross-References

It connects backward to Chapter 12 on task suites, anticipates Section 52.2 on multi-objective metrics, and points forward to Chapter 53 on uncertainty and Chapter 54 on safety.

Lab Recipe

Create an episode table for one embodied task with columns for `success`, `time_s`, `energy_j`, `violation_count`, and `recovery_count`. Add one model change that improves an upstream score, then verify whether the total utility also improves.

Failure Mode

Do not compare a perception accuracy number from one task distribution with a utility score from another. Once the panels diverge, the comparison stops being an evaluation and becomes a story.

Practical Example

For a warehouse mobile manipulator, the deployment review should compare utility across matched pallets, aisle widths, battery states, and operator reset rules. A model that localizes objects better but demands more interventions does not earn promotion.

Research Frontier

Recent embodied benchmarks increasingly log videos, monitor traces, and action latency beside scalar task scores. The open problem is to make these richer artifacts easy to aggregate without losing comparability.

Self Check

Can you state one case where success rate stayed constant while the overall utility changed sign? If not, the difference between stage metrics and closed-loop value is not yet solid.

Key Takeaway

Accuracy is a diagnostic input, not the deployment objective. The real claim lives in a same-panel utility table that includes success, violations, delay, energy, and recovery.

Exercise 52.1.1

Take one embodied benchmark you know well and design a utility function that would demote a policy that succeeds often but needs frequent human rescue or burns excessive time.

Fun Note

A robot can classify the scene perfectly and still drive into the wrong next action.

Section References

Henderson, P. et al. "Deep Reinforcement Learning that Matters." (2018). https://arxiv.org/abs/1709.06560

A reminder that seemingly better stage metrics often disappear under careful end-to-end evaluation.

Sutton, R. S., and Barto, A. G. "Reinforcement Learning: An Introduction." (2018). http://incompleteideas.net/book/the-book-2nd.html

Useful background for reward, utility, and rollout-based assessment.

What's Next

Section 52.2 keeps the same matched-panel discipline and asks how to combine success, path quality, time, and energy into a more interpretable score.