A robot can classify the scene perfectly and still drive into the wrong next action.
A Closed-Loop Experimentalist
Embodied evaluation starts from the fact that task value is produced by a closed loop, not by a single predictor. Accuracy can improve while latency, safety violations, and failed recoveries make the overall robot worse.
Why This Matters
In an embodied system, the useful question is not whether a perception or policy submodule has a high scalar score. The useful question is whether the full loop reaches goals faster, more safely, and more reproducibly under the same rollout panel.
A compact utility model is $$J = \mathbb{{E}}[\mathbb{{1}}\{{\text{{success}}\}} - \lambda_v V - \lambda_t T - \lambda_e E - \lambda_r R],$$ where $V$ counts violations, $T$ is completion time, $E$ is resource or energy cost, and $R$ counts recoveries or rescues. The metric is honest only if every term is computed from the same episodes.
The diagram matters because it shows exactly where isolated accuracy can disappear: state estimation can be stale, action selection can be slow, and monitors can intervene often enough to erase any upstream gain.
- Freeze a task panel with explicit initial states, perturbations, and reset rules.
- Run the closed loop, not only the predictor, and log action timestamps, monitor states, and termination causes.
- Aggregate success, violations, time, energy, and recovery into one utility table.
- Compare methods only on paired episodes or a fixed seed schedule.
- Inspect failure traces before celebrating any utility improvement.
Worked Example
A pick-and-place policy can improve grasp-point classification from 88 percent to 93 percent, yet lower the final task score because it now pauses longer before actuation and triggers more timeout recoveries.
from dataclasses import dataclass
@dataclass
class Episode:
success: int
violations: int
time_s: float
energy_j: float
recoveries: int
def utility(ep: Episode,
lambda_v: float = 4.0,
lambda_t: float = 0.02,
lambda_e: float = 0.005,
lambda_r: float = 0.5) -> float:
return (
ep.success
- lambda_v * ep.violations
- lambda_t * ep.time_s
- lambda_e * ep.energy_j
- lambda_r * ep.recoveries
)
baseline = Episode(success=1, violations=0, time_s=16.0, energy_j=38.0, recoveries=0)
accurate_but_slow = Episode(success=1, violations=0, time_s=29.0, energy_j=52.0, recoveries=1)
print({
"baseline_utility": round(utility(baseline), 3),
"accurate_but_slow_utility": round(utility(accurate_but_slow), 3),
})
{'baseline_utility': 0.49, 'accurate_but_slow_utility': -0.34}Expected output: The two episodes both succeed, but the second receives a lower utility because extra delay, energy use, and one recovery consume the apparent gain. That is the signature of a stage metric that is not sufficient on its own.
The hand-built utility is about 20 lines. In practice, a benchmark runner can stream episode traces into Pandas, MLflow, or Weights and Biases Artifacts so the same utility contract is computed automatically for every rollout while preserving the raw evidence.
Accuracy audits in this section should join perception scores to downstream decisions: Pandas groups failures by task phase, SciPy estimates paired confidence intervals, DVC pins the evaluation panel, MLflow or Weights and Biases records model lineage, and ROS 2 bags preserve the sensor frames that explain why a correct label still produced a bad action.
When teams say accuracy is not enough, the real scientific claim is that the stage metric is not construct-matched to the deployment objective. The cure is not to discard accuracy, but to embed it inside a utility table that also sees timing, control, and intervention.
The practical stack for an accuracy-is-not-enough review is a single episode table with columns for observation, predicted state, selected action, outcome, latency, and recovery. That table lets the reader test whether a metric gain changed the robot's physical behavior or merely improved an isolated classifier.
A common postmortem pattern is that the high-scoring method moved the error to a later stage: better detection, worse pose estimate; better pose estimate, slower planning; better planning, more aggressive actions. Closed-loop artifacts reveal where the gain leaked away.
Cross-References
It connects backward to Chapter 12 on task suites, anticipates Section 52.2 on multi-objective metrics, and points forward to Chapter 53 on uncertainty and Chapter 54 on safety.
Create an episode table for one embodied task with columns for `success`, `time_s`, `energy_j`, `violation_count`, and `recovery_count`. Add one model change that improves an upstream score, then verify whether the total utility also improves.
Do not compare a perception accuracy number from one task distribution with a utility score from another. Once the panels diverge, the comparison stops being an evaluation and becomes a story.
For a warehouse mobile manipulator, the deployment review should compare utility across matched pallets, aisle widths, battery states, and operator reset rules. A model that localizes objects better but demands more interventions does not earn promotion.
Recent embodied benchmarks increasingly log videos, monitor traces, and action latency beside scalar task scores. The open problem is to make these richer artifacts easy to aggregate without losing comparability.
Can you state one case where success rate stayed constant while the overall utility changed sign? If not, the difference between stage metrics and closed-loop value is not yet solid.
Accuracy is a diagnostic input, not the deployment objective. The real claim lives in a same-panel utility table that includes success, violations, delay, energy, and recovery.
Take one embodied benchmark you know well and design a utility function that would demote a policy that succeeds often but needs frequent human rescue or burns excessive time.
A robot can classify the scene perfectly and still drive into the wrong next action.
Section References
Henderson, P. et al. "Deep Reinforcement Learning that Matters." (2018). https://arxiv.org/abs/1709.06560
A reminder that seemingly better stage metrics often disappear under careful end-to-end evaluation.
Sutton, R. S., and Barto, A. G. "Reinforcement Learning: An Introduction." (2018). http://incompleteideas.net/book/the-book-2nd.html
Useful background for reward, utility, and rollout-based assessment.
Section 52.2 keeps the same matched-panel discipline and asks how to combine success, path quality, time, and energy into a more interpretable score.