Deep Deterministic Policy Gradient Method

Paper's here:

Basically you have two network, one called the actor, another critic.

In DDPG the actor perform a deterministic policy (given input, the output is not a probabilistic distribution, but a value). We let the critic to judge how good did the actor do; then the actor can use this information to improve its policy. The problem then become: how can we train such a network in Keras?

Of course you can't. For my own convenience I wrote a small library called canton. Check for more details.

  1. Define the inputs: (state, action, reward, done?, next-state)

    s1 = tf.placeholder(tf.float32,shape=[None,self.inputdims])
    a1 = tf.placeholder(tf.float32,shape=[None,self.outputdims])
    r1 = tf.placeholder(tf.float32,shape=[None,1])
    isdone = tf.placeholder(tf.float32,shape=[None,1])
    s2 = tf.placeholder(tf.float32,shape=[None,self.inputdims])
  2. Train the critic by minimizing the MSE loss between predicted Q and calculated Q (using Temporal-Difference):

    a2 = self.actor_target(s2)
    q2 = self.critic_target([s2,a2])
    q1_target = r1 + (1-isdone) * self.discount_factor * q2
    q1_predict = self.critic([s1,a1])
    critic_loss = tf.reduce_mean((q1_target - q1_predict)**2)
  3. Train the actor by maximizing the expected reward:

    a1_predict =
    q1_predict = self.critic([s1,a1_predict])
    actor_loss = tf.reduce_mean(- q1_predict)
  4. You may have noticed that there are two network called "actor_target" and "critic_target". They are copies of actor and critc, with their weights slowly gradually updated from actor and critic (to reduce action-sample correlation thus stabilize learning).

    tau = tf.Variable(0.001)
    aw =
    atw = self.actor_target.get_weights()
    cw = self.critic.get_weights()
    ctw = self.critic_target.get_weights()
    shift1 = [tf.assign(atw[i], aw[i]*tau + atw[i]*(1-tau))
        for i,_ in enumerate(aw)]
    shift2 = [tf.assign(ctw[i], cw[i]*tau + ctw[i]*(1-tau))
        for i,_ in enumerate(cw)]
  5. That's it. By using the canton library, instead of creating multiple versions of networks in Keras or dealing with variable scopes in TensorFlow, you can now update everything in one

    opt = tf.train.RMSPropOptimizer(1e-4)
        cstep = opt.minimize(critic_loss,
        astep = opt.minimize(actor_loss,
    def feed(memory):
        [s1d,a1d,r1d,isdoned,s2d] = memory # d suffix means data
        sess = ct.get_session()
        res =[critic_loss,actor_loss,

Code available at

At the time of this writing:


last modified: 2017-02-18 22:08