Skip to content

Twin Delayed Deep Deterministic Policy Gradient (TD3)


TD3 is a popular DRL algorithm for continuous control. It extends DDPG with three techniques: 1) Clipped Double Q-Learning, 2) Delayed Policy Updates, and 3) Target Policy Smoothing Regularization. With these three techniques TD3 shows significantly better performance compared to DDPG.

Original paper:

Reference resources:

Implemented Variants

Variants Implemented Description, docs For continuous action space

Below are our single-file implementations of TD3:

The has the following features:

  • For continuous action space
  • Works with the Box observation space of low-level features
  • Works with the Box (continuous) action space


poetry install
poetry install -E pybullet
python cleanrl/ --help
python cleanrl/ --env-id HopperBulletEnv-v0
poetry install -E mujoco # only works in Linux
python cleanrl/ --env-id Hopper-v3

Explanation of the logged metrics

Running python cleanrl/ will automatically record various metrics such as various losses in Tensorboard. Below are the documentation for these metrics:

  • charts/episodic_return: episodic return of the game
  • charts/SPS: number of steps per second
  • losses/qf1_loss: the MSE between the Q values at timestep \(t\) and the target Q values at timestep \(t+1\), which minimizes temporal difference.
  • losses/actor_loss: implemented as -qf1(data.observations, actor(data.observations)).mean(); it is the negative average Q values calculated based on the 1) observations and the 2) actions computed by the actor based on these observations. By minimizing actor_loss, the optimizer updates the actors parameter using the following gradient (Fujimoto et al., 2018, Algorithm 1)2:
\[ \nabla_{\phi} J(\phi)=\left.N^{-1} \sum \nabla_{a} Q_{\theta_{1}}(s, a)\right|_{a=\pi_{\phi}(s)} \nabla_{\phi} \pi_{\phi}(s) \]
  • losses/qf1_values: implemented as `qf1(data.observations, data.actions).view(-1); it is the average Q values of the sampled data in the replay buffer; useful when gauging if under or over esitmations happen

Implementation details

Our is based on the from sfujim/TD3. Our presents the following implementation differences.

  1. uses a two separate objects qf1 and qf2 to represents the two Q functions in the Clipped Double Q-learning architecture, whereas (Fujimoto et al., 2018)2 uses a single Critic class that contains both Q networks. That said, these two implementations are virtually the same.

Experiment results

To run benchmark experiments, see benchmark/ Specifically, execute the following command:

Below are the average episodic returns for (3 random seeds). To ensure the quality of the implementation, we compared the results against (Fujimoto et al., 2018)2.

Environment (Fujimoto et al., 2018, Table 1)2
HalfCheetah 9018.31 ± 1078.31 9636.95 ± 859.065
Walker2d 4246.07 ± 1210.84 4682.82 ± 539.64
Hopper 3391.78 ± 232.21 3564.07 ± 114.74

Note that uses gym MuJoCo v2 environments while (Fujimoto et al., 2018)2 uses the gym MuJoCo v1 environments. According to the openai/gym#834, gym MuJoCo v2 environments should be equivalent to the gym MuJoCo v1 environments.

Also note the performance of our seems to be worse than the reference implementation on Walker2d. This is likely due to openai/gym#938. We would have a hard time reproducing gym MuJoCo v1 environments because they have been long deprecated.

One other thing could cause the performance difference: the original code reported the average episodic return using determinisitc evaluation (i.e., without exploration noise), see sfujim/TD3/, whereas we reported the episodic return during training and the policy gets updated between environments steps.

Learning curves:

Tracked experiments and game play videos:

  1. Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N.M., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2016). Continuous control with deep reinforcement learning. CoRR, abs/1509.02971. 

  2. Fujimoto, S., Hoof, H.V., & Meger, D. (2018). Addressing Function Approximation Error in Actor-Critic Methods. ArXiv, abs/1802.09477. 

Back to top