A Careful Control Loop
Continuous control: DDPG, TD3, SAC is one lens on value-based and off-policy methods. We study it because an embodied agent needs decisions that survive contact with noisy sensors, delayed effects, and changing environments.
For Continuous control: DDPG, TD3, SAC, off-policy learning depends on replay semantics, environment API, target computation, and GPU-scale batching being fixed before comparison with policy-gradient methods.
Continuous control creates a problem that tabular Q-learning cannot solve directly: a robot torque, steering angle, or gripper velocity can take infinitely many values. Taking a max over every possible action is no longer a small lookup.
DDPG, TD3, and SAC solve this by pairing a critic with an actor. The critic estimates value for a continuous action, while the actor proposes the action to evaluate or execute. This section explains how each method controls the bootstrapping error that appears when the critic and actor improve each other from imperfect off-policy data.
In discrete DQN, the max over actions is explicit. In continuous control, the actor network becomes the mechanism that searches the action space, so actor errors and critic errors can amplify each other.
Theory
DDPG uses a deterministic actor $\mu_\phi(o)$ and a critic $Q_\theta(o,a)$. The actor is trained to choose actions that the critic values highly, while the critic is trained from replayed Bellman targets. This is efficient, but brittle: if the critic overestimates an action, the actor will move toward that action.
TD3 addresses that brittleness with three design choices. It trains two critics and uses the smaller target value, delays actor updates so the critic has time to improve, and adds small clipped noise to the target action so the critic cannot exploit a narrow spike in value. The target commonly has the form:
$$y = r + \gamma \min_i Q_{\theta_i^-}(o', \mu_{\phi^-}(o') + \epsilon)$$
SAC changes the objective by rewarding both task return and action entropy. The policy is stochastic, so the agent keeps useful diversity in its action choices:
$$J(\pi)=\mathbb{E}\left[\sum_t r(o_t,a_t) + \alpha \mathcal{H}(\pi(\cdot|o_t))\right]$$
The temperature $\alpha$ controls how much the policy values entropy. In contact-rich robotics, that entropy can help discover recovery actions that a deterministic actor would stop trying too early.
Addressing Function Approximation Error in Actor-Critic Methods (Fujimoto et al., ICML 2018) — clipped double-Q targets and delayed actor updates reduce the overestimation bias that makes continuous-control policies brittle. For embodied agents, that bias is dangerous because an overvalued action becomes a high-torque command the actor learns to chase onto hardware.
DDPG is the simplest actor-critic route for continuous actions, TD3 is a conservative correction for critic overestimation, and SAC is a maximum-entropy route that keeps exploration inside the policy objective.
Worked Example
Code Fragment 1 computes a TD3-style target from two critics. The smaller critic value is used because overestimated value is more dangerous than underestimated value when the actor is trained to chase high values.
# Compute a TD3 target with clipped double critics.
# The smaller critic value limits overestimation before the actor sees it.
reward = 0.3
gamma = 0.98
critic_1_target = 1.40
critic_2_target = 0.90
target_policy_noise = 0.05
smoothed_action = 0.62 + target_policy_noise
bootstrap = min(critic_1_target, critic_2_target)
td3_target = reward + gamma * bootstrap
print(f"smoothed_action={smoothed_action:.2f}")
print(f"bootstrap={bootstrap:.2f}")
print(f"td3_target={td3_target:.2f}")
The expected output shows the TD3 target being anchored by the lower critic value, not the optimistic one. Readers should interpret the td3_target of 1.18 as a deliberately conservative bootstrap built from a slightly perturbed target action.
bootstrap uses the smaller of critic_1_target and critic_2_target, which is TD3's clipped double-Q idea. The smoothed_action value represents target policy smoothing, a guard against learning a critic spike at one precise continuous action.The same target logic maps cleanly to embodied control. For a torque-controlled arm, target smoothing says the critic should value a small neighborhood of torques, not a single fragile torque vector that only works in simulation.
Stable-Baselines3 provides DDPG, TD3, and SAC behind a compact API, while CleanRL exposes each update in a readable script. Use the library to avoid fragile training boilerplate, but still log critic disagreement, action saturation, entropy, and environment-condition labels.
Practical Recipe
- Use DDPG only when a deterministic actor is acceptable and the task is well shaped.
- Prefer TD3 when critic overestimation or narrow action spikes appear in evaluation.
- Prefer SAC when exploration, recovery behavior, or multimodal actions matter.
- Log action saturation at actuator bounds, since saturated actors can hide critic problems.
- Evaluate under mass, friction, delay, and sensor-noise shifts with the same action limits.
A continuous-action policy can exploit simulator details by choosing precise torques that are unavailable, unsafe, or unstable on hardware. If evaluation reports return without action-limit violations, actuator saturation, and critic disagreement, the result is incomplete.
A manipulation team can compare TD3 and SAC on the same pushing task by co-computing success, contact force, action saturation, and recovery after an object slip. TD3 may produce crisp deterministic pushes, while SAC may preserve enough stochasticity to recover from surprising contact changes.
For continuous control: ddpg, td3, sac, the useful test is simple: could a teammate point to the log line, plot, or trace that proves the idea changed the agent's next action?
Continuous-control research keeps pushing on safer off-policy learning, better uncertainty estimates for critics, and policy learning from mixed simulation and robot data. The open embodied question is how to make a critic admit uncertainty before an actor turns that uncertainty into a high-torque command.
Can you state which network proposes the action, which network evaluates it, which target network supplies the bootstrap value, and which logs reveal action saturation or critic disagreement? If not, the continuous-control loop is not auditable.
DDPG, TD3, and SAC differ less in their environment interface than in their attitude toward critic error. DDPG trusts one critic and one deterministic actor. TD3 distrusts overestimation enough to train two critics. SAC distrusts premature certainty enough to optimize entropy alongside return.
That distinction matters in embodied work because physical action errors have asymmetric cost. An underestimated value may slow learning, but an overestimated high-torque action can break contact, drop an object, or leave the training distribution entirely.
| Tool or Library | Role in the Topic | Builder Advice |
|---|---|---|
| MuJoCo | Continuous dynamics | Use it to test mass, friction, and delay perturbations under fixed action limits. |
| Stable-Baselines3 | DDPG, TD3, and SAC baselines | Use it for maintained continuous-control implementations with consistent logging. |
| CleanRL | Readable algorithm updates | Use it when you need to inspect actor loss, critic loss, entropy, and target updates. |
| Tianshou | Composable off-policy training | Use it to swap collectors, buffers, policies, and critics without rewriting the experiment. |
| ROS 2 control logs | Hardware action evidence | Use them to verify that learned actions remain inside actuator limits and safety envelopes. |
A robust continuous-control implementation logs the actor's action and the critic's evidence for that action. Code Fragment 2 records the fields that separate deterministic control, clipped double-Q control, and entropy-regularized control.
- Record action vectors before and after clipping or squashing.
- Log critic disagreement for TD3 and SAC.
- Log entropy or temperature for SAC.
- Track actuator saturation and safety-envelope violations.
- Compare all methods on one perturbation panel with the same action bounds.
# Build one audit record for a continuous-control action.
# The fields expose actor output, critic disagreement, and safety limits.
from dataclasses import dataclass, asdict
@dataclass
class ContinuousControlAudit:
algorithm: str
raw_action: float
executed_action: float
critic_1: float
critic_2: float
entropy: float
saturated: bool
def as_row(self) -> dict[str, object]:
return asdict(self)
record = ContinuousControlAudit(
algorithm="SAC",
raw_action=1.18,
executed_action=1.00,
critic_1=2.4,
critic_2=1.7,
entropy=0.42,
saturated=True,
)
print(record.as_row())
The expected output is an audit row that immediately exposes two continuous-control risks: the actor asked for more torque than the actuator allowed, and the critics disagree materially about the value of that command. That combination should trigger closer inspection before anyone treats the rollout return as robust evidence.
ContinuousControlAudit records the raw actor output, clipped executed action, two critic values, entropy, and saturation flag. These fields explain whether a high return came from robust control or from repeatedly pushing against an action bound.When a continuous-control method fails, first ask whether the actor left the safe action region, whether the critics disagreed, whether entropy collapsed, or whether replay lacked the new dynamics. Then rerun one perturbation while plotting action histograms and critic disagreement beside the episode video.
For DDPG, TD3, and SAC, compare return, success, energy use, contact force, action saturation, entropy, and critic disagreement in one evaluation script over one perturbation panel. Do not compare SAC entropy from one run to TD3 return from another run and call it an algorithmic conclusion.
Continuous off-policy control is about managing actor-critic feedback. TD3 reduces overestimation, SAC preserves exploration through entropy, and both need action-level logs before embodied deployment.
Choose a continuous-control task and define one co-computed metric panel for DDPG, TD3, and SAC. Include at least one task metric, one safety metric, one action-distribution metric, and one critic diagnostic.
What's Next?
This section turned continuous control: DDPG, TD3, SAC into a testable embodied-learning contract: define the loop, choose the tool, save one comparable artifact, and diagnose failure by interface. Next, continue with Section 16.4, where the same evaluation habit carries into the next reinforcement-learning decision.
Watkins, C. J. C. H., and Dayan, P. (1992). Q-learning. Machine Learning.
The canonical derivation of tabular Q-learning and its convergence proof. Read to understand the off-policy update rule and why the max over next-state actions makes Q-learning off-policy by construction; this distinction carries through to DQN and all its successors.
Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature.
Demonstrates that replay buffers and target networks together stabilize Q-learning with neural function approximators. Read Section 2 for the DQN algorithm and the supplementary for network architecture; replay and target-network ideas appear in every subsequent off-policy deep RL method including DDPG, TD3, and SAC.
Lillicrap, T. P. et al. (2015). Continuous control with deep reinforcement learning. arXiv.
Adapts DQN to continuous action spaces by combining a deterministic policy gradient actor with a Q-function critic and using replay and target networks from DQN. Read Algorithm 1 for the full update loop; DDPG is the direct predecessor to TD3 and understanding its overestimation problem motivates TD3's twin-critic design.
Identifies and fixes the Q-value overestimation problem in DDPG through three mechanisms: clipped double critics, delayed policy updates, and target-policy smoothing. Read Section 4 for each fix and the ablation in Section 5; these three tricks are now standard practice for off-policy continuous-control and appear directly in SAC variants.
Haarnoja, T. et al. (2018). Soft Actor-Critic. ICML.
Combines off-policy learning with a maximum-entropy objective, adding an automatic temperature parameter that balances exploration and exploitation without manual tuning. Read Section 4 for the soft Bellman equation and the entropy temperature update; SAC is the most widely used off-policy baseline for continuous robot control tasks.
A modular PyTorch RL library with clean separation between collector, trainer, and policy components. Use it to prototype off-policy algorithms without reimplementing replay buffers and target-network logic; the policy abstraction makes it straightforward to compare DQN, DDPG, TD3, and SAC in a common framework.