A Careful Control Loop
Throughput, wall-clock, and cost engineering is where GPU RL becomes an engineering system. A useful run is not the one with the largest steps-per-second number; it is the one that reaches a held-out success target with known memory use, known evaluation overhead, and known compute cost.
For Throughput, wall-clock, and cost engineering, GPU RL depends on simulator fidelity, PPO rollout semantics, reward terms, and reset distribution being versioned in the same training artifact.
This section develops the cost contract for massively parallel RL. We define the denominator for every claim: environment steps per second, policy updates per minute, wall-clock to target success, GPU memory at peak, evaluation time, and dollars per successful checkpoint.
The key question is practical: did the optimization make learning cheaper, or did it only move time from rollout collection into compilation, synchronization, logging, or evaluation?
Steps per second is a systems metric. Wall-clock to held-out success is the learning metric that matters to a builder deciding what to run next.
Theory
Let a run collect $S$ environment steps in $W$ wall-clock seconds. The raw throughput is $S/W$, but the cost metric should include training, evaluation, checkpointing, and failed runs. If the target is a 90 percent held-out success rate, the relevant question is how many dollars and minutes were spent before the first checkpoint reached that target.
There is a common tension between throughput and sample efficiency. Larger batches can improve device utilization, but they may reduce update frequency or increase policy lag. Smaller batches may learn with fewer samples but waste the accelerator.
The mechanism is an accounting loop: measure rollout time, learner time, evaluation time, synchronization time, peak memory, and target success in the same run. Only then can you decide whether the bottleneck is simulation, policy inference, advantage computation, optimizer updates, logging, or evaluation.
Worked Example
Code Fragment 17.6.1 computes the metrics that should appear together in a cost report. The same artifact contains throughput, wall-clock to target, evaluation overhead, and compute spend.
# Compute throughput and cost from one RL training run.
# The target-success checkpoint, not peak steps per second, drives the decision.
env_steps = 1_200_000_000
train_minutes = 42.0
eval_minutes = 6.0
gpu_dollars_per_hour = 2.20
heldout_success = 0.92
target_success = 0.90
total_minutes = train_minutes + eval_minutes
steps_per_second = env_steps / (train_minutes * 60)
total_cost = (total_minutes / 60) * gpu_dollars_per_hour
cost_per_billion_steps = total_cost / (env_steps / 1_000_000_000)
print(f"train throughput: {steps_per_second:,.0f} env steps/s")
print(f"total wall-clock: {total_minutes:.1f} min")
print(f"held-out success: {heldout_success:.2f}")
print(f"target reached: {heldout_success >= target_success}")
print(f"total compute cost: ${total_cost:.2f}")
print(f"cost per billion steps: ${cost_per_billion_steps:.2f}")
Expected output: the trace should include both systems and learning metrics. A run that reports 476,190 steps per second but omits held-out success has not shown that those steps bought a deployable policy.
In practical systems, rely on the framework's profiler, GPU telemetry, and logger rather than hand timing one function. Isaac Lab, RSL-RL, rl_games, SKRL, Brax, and MJX can all produce impressive throughput; the engineering shortcut is to export comparable cost records from the same evaluation script.
Practical Recipe
- Define the success target before training, such as 90 percent held-out success across 256 evaluation seeds.
- Measure rollout, learning, evaluation, logging, and checkpoint time separately.
- Track peak GPU memory, utilization, host-device transfer time, and compilation time where relevant.
- Report dollars per target-reaching checkpoint, not only dollars per hour.
- Repeat the run across seed panels when the result will support a paper table or hardware decision.
The common mistake is to maximize training throughput by reducing evaluation frequency, then miss the first checkpoint that actually generalizes. Evaluation is part of the wall-clock budget, not an optional afterthought.
A team choosing between two GPU instances should compare them on one script that trains to the same held-out success target. The faster instance can still be the worse choice if it runs out of memory, needs a smaller batch, or spends more time compiling and evaluating.
The fastest run is not always the cheapest run. The cheapest successful run is the one that reaches the target before curiosity turns into a hyperparameter sweep.
The frontier is shifting from raw simulator speed toward full-stack efficiency: GPU-resident sensors, batched rendering, compilation-aware training, automatic curriculum scheduling, and evaluation systems that can keep up with faster learners. The open problem is measuring cost per robust behavior, not cost per simulator step.
Can you report steps per second, wall-clock to target, evaluation overhead, peak memory, instance cost, failed-run count, and held-out success from one artifact? If not, the cost claim is incomplete.
The idea in this section becomes useful when every number has a denominator. Steps per second uses training seconds. Wall-clock to target uses training plus evaluation seconds. Cost uses all billable time. Paper tables and engineering decisions should not mix these denominators.
The graduate-level habit is to report the bottleneck breakdown. A low throughput run may be simulation-bound, learner-bound, memory-bound, evaluation-bound, logging-bound, or compilation-bound. Each bottleneck implies a different fix.
| Tool or Library | Role in the Topic | Builder Advice |
|---|---|---|
| Steps per second | Raw simulator and learner throughput | Use it to locate systems bottlenecks, not to claim policy quality. |
| Wall-clock to target | Minutes until held-out success threshold is reached | Use it as the primary training-speed metric. |
| Peak GPU memory | Capacity pressure from rollout, model, and optimizer state | Use it to explain batch-size limits and instance choice. |
| Evaluation overhead | Time spent measuring held-out behavior | Use it in total wall-clock because evaluation frequency changes checkpoint selection. |
| Cost per success | Billable compute until the first target-reaching checkpoint | Use it when comparing GPU instances, frameworks, or recipes. |
A robust implementation starts with a cost ledger. The ledger makes it impossible to claim a throughput win from one run and a success win from another run.
- Start the ledger before launch with instance type, price, batch size, target metric, and evaluation cadence.
- Update the ledger at each checkpoint with train seconds, eval seconds, peak memory, success, and reset reasons.
- Mark failed runs explicitly so cost estimates include search and debugging, not only the winning run.
- Save profiler summaries with the same run ID as the policy checkpoint.
- Compare cost only when success is computed on the same held-out panel.
# Record one cost ledger row for a GPU RL checkpoint.
# Keep systems metrics and held-out behavior in the same artifact.
from dataclasses import dataclass, asdict
@dataclass
class CostLedgerRow:
checkpoint: int
train_minutes: float
eval_minutes: float
peak_gpu_gb: float
heldout_success: float
billable_cost_usd: float
def as_row(self) -> dict[str, object]:
return asdict(self)
row = CostLedgerRow(
checkpoint=320,
train_minutes=42.0,
eval_minutes=6.0,
peak_gpu_gb=18.4,
heldout_success=0.92,
billable_cost_usd=1.76,
)
print(row.as_row())
When a cost result disappoints, avoid changing the algorithm first. Identify whether the budget was lost to simulator stepping, learner updates, GPU memory pressure, host-device transfers, compilation, evaluation, logging, or failed hyperparameter searches. Each cause has a different repair.
For throughput and cost claims, compare only construct-matched metrics that are co-computed in one pass on one configuration: same environment panel, same policy checkpoint, same held-out seed set, same perturbation suite, same success definition, and same cost accounting window. Save throughput, wall-clock, memory, evaluation overhead, success, failure labels, and billable cost as one artifact.
Throughput is useful when it lowers wall-clock and cost to a held-out behavior target. Raw steps per second is only the first line of the ledger.
Build a cost ledger for two GPU RL recipes. For each, record steps per second, train minutes, evaluation minutes, peak memory, instance price, held-out success, failed-run count, and dollars to first target-reaching checkpoint.
What's Next?
This section turned throughput into a cost ledger: steps per second, wall-clock to target, evaluation overhead, peak memory, failure count, and dollars per successful checkpoint. Return to Chapter 17 with one rule for every GPU RL result: speed and success belong in the same artifact.
Isaac Gym is the right source for understanding raw GPU simulation throughput. In this section, use it to separate impressive step rates from full cost accounting that includes evaluation and failed runs.
Brax is relevant because JAX-native simulation can shift the bottleneck from stepping to compilation, memory, or evaluation. Cost reports should account for those phases separately.
NVIDIA Isaac Lab documentation.
Isaac Lab provides a realistic setting for end-to-end cost measurement: task setup, training runner, checkpointing, play scripts, and evaluation videos. Those components all consume wall-clock time.
Google DeepMind MuJoCo MJX documentation.
MJX is useful when comparing accelerator-native MuJoCo-style workloads. Its static-shape and compilation behavior should be logged separately from steady-state steps per second.
Rudin et al. are a benchmark for fast wall-clock locomotion. For cost engineering, read the result as a prompt to ask what hardware, evaluation cadence, and success threshold define the headline time.
RSL-RL gives readers concrete PPO runner code to profile. It is useful for locating whether time is spent in rollout storage, advantage computation, optimizer updates, logging, or evaluation.