For Success rate, path efficiency, time and energy cost, a benchmark conclusion survives reruns only when the panel, seed policy, exclusion rules, and raw episode artifacts are inspectable.
An Evaluation Methodologist
Embodied evaluation needs vector metrics because time, path length, smoothness, and energy all compete under fixed task success. This section shows how to keep those quantities interpretable without hiding tradeoffs.
Why This Matters
Success rate, path efficiency, time and energy cost matters because evaluation choices rewrite the scientific claim. If the metric drops time, energy, or safety terms that the deployment team cares about, the benchmark no longer matches the real decision.
One useful representation is the episode score vector $$m_i = [s_i,\; \rho_i,\; t_i,\; e_i],$$ with $s_i$ for success, $\rho_i = d^*_i / d_i$ for path efficiency, $t_i$ for completion time, and $e_i$ for energy. A scalar summary is acceptable only after the vector is logged and inspectable.
Path efficiency and energy cost should remain visible as separate columns even if the benchmark publishes one scalar rank. Otherwise teams optimize the weighted sum while reviewers lose the tradeoff surface.
- Compute shortest-feasible reference distance or nominal task budget before running candidate methods.
- Record actual path length, action count, elapsed time, and energy or torque proxy for every episode.
- Normalize each quantity against a baseline or physical reference when cross-task aggregation is needed.
- Report both the raw vector and any scalarized utility.
- Audit whether the scalar ranking changes under small weight perturbations.
Worked Example
A quadruped navigation policy may tie the baseline on success rate but use 30 percent more turning and 18 percent more battery because it oscillates in narrow passages. A scalar success metric would never show the regression.
episodes = [
{"success": 1, "optimal_path_m": 8.0, "actual_path_m": 9.0, "time_s": 21.5, "energy_j": 440.0},
{"success": 1, "optimal_path_m": 8.0, "actual_path_m": 11.6, "time_s": 28.2, "energy_j": 590.0},
]
summary = []
for ep in episodes:
path_eff = ep["optimal_path_m"] / ep["actual_path_m"]
summary.append({
"success": ep["success"],
"path_efficiency": round(path_eff, 3),
"time_s": ep["time_s"],
"energy_j": ep["energy_j"],
})
print(summary)
[{'success': 1, 'path_efficiency': 0.889, 'time_s': 21.5, 'energy_j': 440.0}, {'success': 1, 'path_efficiency': 0.69, 'time_s': 28.2, 'energy_j': 590.0}]Expected output: Both episodes succeed, but the second is visibly less efficient, slower, and more expensive. That is the point: success rate should not erase operational cost.
Once the vector definition is stable, use a benchmark dataframe plus MLflow or Weights and Biases tables to compute per-task and aggregate summaries automatically. The important part is that the raw vector remains accessible.
Success, path efficiency, time, and energy belong in one paired episode table. Pandas computes per-route deltas, SciPy bootstraps confidence intervals, DVC freezes the route panel, MLflow or Weights and Biases keeps run lineage, and ROS 2 bags expose whether a faster route came from better planning or more aggressive control.
Scalarization is a policy decision, not a law of nature. Different applications weight speed, smoothness, and energy differently, so good benchmark design publishes the underlying vector and documents any chosen weights.
The reproducible artifact for this section is a run ledger that stores path length, elapsed time, watt-hours or battery drop, replan count, controller saturation, and terminal success for the same seed and map. Comparing any subset without the rest invites a misleading leaderboard.
When teams overcompress these metrics, they often rediscover hidden regressions late in deployment, especially thermal issues, battery drain, and route oscillations that were invisible in success-only dashboards.
Cross-References
This section links back to Chapter 7 on control-oriented costs and tradeoffs and forward to Section 52.4 on robustness metrics, where the same vector idea is extended across perturbation families.
Instrument one navigation or manipulation task with path efficiency and energy proxy. Then perturb controller gains and verify whether the tradeoff surface changes even when success stays flat.
Do not compute path efficiency against a reference planner that quietly violates the robot's kinematic or dynamic limits. The denominator must be a feasible reference, not a fantasy route.
For autonomous driving, this metric vector can include route completion, jerk, delay, and energy. For drones, swap path efficiency for trajectory deviation and power draw. The structure stays the same while the physics changes.
Recent embodied benchmarks are moving toward Pareto-front reporting and scenario-conditioned scorecards, especially for fleets where battery, heat, and wear matter as much as mission completion.
Can you explain why a benchmark should publish both the metric vector and the scalar rank? If not, you are still trusting the aggregation more than the evidence.
Success rate is the entry ticket, not the whole evaluation. Path efficiency, time, and energy reveal whether the robot succeeded in a deployable way.
Choose one embodied task and design a feasible reference path or action budget. Then define a vector metric and one scalar utility, and explain which tradeoffs the scalar hides.
Two robots can tie on success rate while one spirals through the room like a caffeinated roomba and the other walks the shortest path. The leaderboard will not tell you which one is deploying Monday.
Section References
Paden, B. et al. "A Survey of Motion Planning and Control Techniques for Self-Driving Urban Vehicles." (2016). https://arxiv.org/abs/1604.07446
A useful reference for trajectory quality and control-oriented evaluation quantities.
Official robot benchmarking and fleet telemetry documentation for the platform under study.
Use platform-native energy, thermal, or power interfaces rather than guessed proxies when possible.
Section 52.3 adds hard constraints to this vector view by asking whether the robot moved efficiently and remained inside the allowed envelope.