Section 17.6: Throughput, wall-clock, and cost engineering | Building Embodied AI: From Perception to Autonomous Action

A Careful Control Loop

Big Picture

Throughput, wall-clock, and cost engineering is where GPU RL becomes an engineering system. A useful run is not the one with the largest steps-per-second number; it is the one that reaches a held-out success target with known memory use, known evaluation overhead, and known compute cost.

For Throughput, wall-clock, and cost engineering, GPU RL depends on simulator fidelity, PPO rollout semantics, reward terms, and reset distribution being versioned in the same training artifact.

This section develops the cost contract for massively parallel RL. We define the denominator for every claim: environment steps per second, policy updates per minute, wall-clock to target success, GPU memory at peak, evaluation time, and dollars per successful checkpoint.

The key question is practical: did the optimization make learning cheaper, or did it only move time from rollout collection into compilation, synchronization, logging, or evaluation?

Report The Target, Not Only The Rate

Steps per second is a systems metric. Wall-clock to held-out success is the learning metric that matters to a builder deciding what to run next.

Theory

Let a run collect $S$ environment steps in $W$ wall-clock seconds. The raw throughput is $S/W$, but the cost metric should include training, evaluation, checkpointing, and failed runs. If the target is a 90 percent held-out success rate, the relevant question is how many dollars and minutes were spent before the first checkpoint reached that target.

There is a common tension between throughput and sample efficiency. Larger batches can improve device utilization, but they may reduce update frequency or increase policy lag. Smaller batches may learn with fewer samples but waste the accelerator.

Mechanism

The mechanism is an accounting loop: measure rollout time, learner time, evaluation time, synchronization time, peak memory, and target success in the same run. Only then can you decide whether the bottleneck is simulation, policy inference, advantage computation, optimizer updates, logging, or evaluation.

Worked Example

Code Fragment 17.6.1 computes the metrics that should appear together in a cost report. The same artifact contains throughput, wall-clock to target, evaluation overhead, and compute spend.

# Compute throughput and cost from one RL training run.
# The target-success checkpoint, not peak steps per second, drives the decision.
env_steps = 1_200_000_000
train_minutes = 42.0
eval_minutes = 6.0
gpu_dollars_per_hour = 2.20
heldout_success = 0.92
target_success = 0.90

total_minutes = train_minutes + eval_minutes
steps_per_second = env_steps / (train_minutes * 60)
total_cost = (total_minutes / 60) * gpu_dollars_per_hour
cost_per_billion_steps = total_cost / (env_steps / 1_000_000_000)

print(f"train throughput: {steps_per_second:,.0f} env steps/s")
print(f"total wall-clock: {total_minutes:.1f} min")
print(f"held-out success: {heldout_success:.2f}")
print(f"target reached: {heldout_success >= target_success}")
print(f"total compute cost: ${total_cost:.2f}")
print(f"cost per billion steps: ${cost_per_billion_steps:.2f}")

train throughput: 476,190 env steps/s total wall-clock: 48.0 min held-out success: 0.92 target reached: True total compute cost: $1.76 cost per billion steps: $1.47

Code Fragment 17.6.1 computes a cost report from one training run. The target-success line keeps throughput honest by tying the speed claim to held-out behavior, while evaluation minutes keep wall-clock accounting complete.

Expected output: the trace should include both systems and learning metrics. A run that reports 476,190 steps per second but omits held-out success has not shown that those steps bought a deployable policy.

Library Shortcut

In practical systems, rely on the framework's profiler, GPU telemetry, and logger rather than hand timing one function. Isaac Lab, RSL-RL, rl_games, SKRL, Brax, and MJX can all produce impressive throughput; the engineering shortcut is to export comparable cost records from the same evaluation script.

Practical Recipe

Define the success target before training, such as 90 percent held-out success across 256 evaluation seeds.
Measure rollout, learning, evaluation, logging, and checkpoint time separately.
Track peak GPU memory, utilization, host-device transfer time, and compilation time where relevant.
Report dollars per target-reaching checkpoint, not only dollars per hour.
Repeat the run across seed panels when the result will support a paper table or hardware decision.

Common Failure Mode

The common mistake is to maximize training throughput by reducing evaluation frequency, then miss the first checkpoint that actually generalizes. Evaluation is part of the wall-clock budget, not an optional afterthought.

Practical Example

A team choosing between two GPU instances should compare them on one script that trains to the same held-out success target. The faster instance can still be the worse choice if it runs out of memory, needs a smaller batch, or spends more time compiling and evaluating.

Memory Hook

The fastest run is not always the cheapest run. The cheapest successful run is the one that reaches the target before curiosity turns into a hyperparameter sweep.

Research Frontier

The frontier is shifting from raw simulator speed toward full-stack efficiency: GPU-resident sensors, batched rendering, compilation-aware training, automatic curriculum scheduling, and evaluation systems that can keep up with faster learners. The open problem is measuring cost per robust behavior, not cost per simulator step.

Self Check

Can you report steps per second, wall-clock to target, evaluation overhead, peak memory, instance cost, failed-run count, and held-out success from one artifact? If not, the cost claim is incomplete.

The idea in this section becomes useful when every number has a denominator. Steps per second uses training seconds. Wall-clock to target uses training plus evaluation seconds. Cost uses all billable time. Paper tables and engineering decisions should not mix these denominators.

The graduate-level habit is to report the bottleneck breakdown. A low throughput run may be simulation-bound, learner-bound, memory-bound, evaluation-bound, logging-bound, or compilation-bound. Each bottleneck implies a different fix.

Practical Tool Choices For This Section

Tool or Library	Role in the Topic	Builder Advice
Steps per second	Raw simulator and learner throughput	Use it to locate systems bottlenecks, not to claim policy quality.
Wall-clock to target	Minutes until held-out success threshold is reached	Use it as the primary training-speed metric.
Peak GPU memory	Capacity pressure from rollout, model, and optimizer state	Use it to explain batch-size limits and instance choice.
Evaluation overhead	Time spent measuring held-out behavior	Use it in total wall-clock because evaluation frequency changes checkpoint selection.
Cost per success	Billable compute until the first target-reaching checkpoint	Use it when comparing GPU instances, frameworks, or recipes.

A robust implementation starts with a cost ledger. The ledger makes it impossible to claim a throughput win from one run and a success win from another run.

Start the ledger before launch with instance type, price, batch size, target metric, and evaluation cadence.
Update the ledger at each checkpoint with train seconds, eval seconds, peak memory, success, and reset reasons.
Mark failed runs explicitly so cost estimates include search and debugging, not only the winning run.
Save profiler summaries with the same run ID as the policy checkpoint.
Compare cost only when success is computed on the same held-out panel.

# Record one cost ledger row for a GPU RL checkpoint.
# Keep systems metrics and held-out behavior in the same artifact.
from dataclasses import dataclass, asdict

@dataclass
class CostLedgerRow:
    checkpoint: int
    train_minutes: float
    eval_minutes: float
    peak_gpu_gb: float
    heldout_success: float
    billable_cost_usd: float

    def as_row(self) -> dict[str, object]:
        return asdict(self)

row = CostLedgerRow(
    checkpoint=320,
    train_minutes=42.0,
    eval_minutes=6.0,
    peak_gpu_gb=18.4,
    heldout_success=0.92,
    billable_cost_usd=1.76,
)
print(row.as_row())

{'checkpoint': 320, 'train_minutes': 42.0, 'eval_minutes': 6.0, 'peak_gpu_gb': 18.4, 'heldout_success': 0.92, 'billable_cost_usd': 1.76}

Code Fragment 17.6.2 records one checkpoint-level cost ledger row. Keeping peak memory, evaluation time, held-out success, and billable cost together prevents later tables from comparing numbers that came from different runs.

When a cost result disappoints, avoid changing the algorithm first. Identify whether the budget was lost to simulator stepping, learner updates, GPU memory pressure, host-device transfers, compilation, evaluation, logging, or failed hyperparameter searches. Each cause has a different repair.

Evaluation Recipe

For throughput and cost claims, compare only construct-matched metrics that are co-computed in one pass on one configuration: same environment panel, same policy checkpoint, same held-out seed set, same perturbation suite, same success definition, and same cost accounting window. Save throughput, wall-clock, memory, evaluation overhead, success, failure labels, and billable cost as one artifact.

Key Takeaway

Throughput is useful when it lowers wall-clock and cost to a held-out behavior target. Raw steps per second is only the first line of the ledger.

Exercise 17.6.1

Build a cost ledger for two GPU RL recipes. For each, record steps per second, train minutes, evaluation minutes, peak memory, instance price, held-out success, failed-run count, and dollars to first target-reaching checkpoint.

What's Next?

This section turned throughput into a cost ledger: steps per second, wall-clock to target, evaluation overhead, peak memory, failure count, and dollars per successful checkpoint. Return to Chapter 17 with one rule for every GPU RL result: speed and success belong in the same artifact.

References & Further Reading

Foundational Papers, Tools, and Practice References

Makoviychuk, V. et al. (2021). Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning. arXiv.

Isaac Gym is the right source for understanding raw GPU simulation throughput. In this section, use it to separate impressive step rates from full cost accounting that includes evaluation and failed runs.

Paper

Freeman, C. D. et al. (2021). Brax: A Differentiable Physics Engine for Large Scale Rigid Body Simulation. arXiv.

Brax is relevant because JAX-native simulation can shift the bottleneck from stepping to compilation, memory, or evaluation. Cost reports should account for those phases separately.

Paper

NVIDIA Isaac Lab documentation.

Isaac Lab provides a realistic setting for end-to-end cost measurement: task setup, training runner, checkpointing, play scripts, and evaluation videos. Those components all consume wall-clock time.

Tool

Google DeepMind MuJoCo MJX documentation.

MJX is useful when comparing accelerator-native MuJoCo-style workloads. Its static-shape and compilation behavior should be logged separately from steady-state steps per second.

Tool

Rudin, N. et al. (2022). Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning. CoRL.

Rudin et al. are a benchmark for fast wall-clock locomotion. For cost engineering, read the result as a prompt to ask what hardware, evaluation cadence, and success threshold define the headline time.

Paper

RSL-RL repository.

RSL-RL gives readers concrete PPO runner code to profile. It is useful for locating whether time is spent in rollout storage, advantage computation, optimizer updates, logging, or evaluation.

Tool