Wednesday, November 21, 2018

[Reinforcement Learning] Get started to learn policy gradient method for reinforcement learning

This post is about my first time to learn policy gradient method for reinforcement learning. Basically, there are already a lot of materials on the internet, but in this time, I only want to focus on a tutorial as follows: ( sorry, they are written in Chinese )
https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/5-1-policy-gradient-softmax1/
https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/5-2-policy-gradient-softmax2/




The CartPole demo source code is in the link below:
https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/tree/master/contents/7_Policy_gradient_softmax


The most important part of it is the Policy Gradient algorithm:


Here I am going to represent some dump data from the demo example so that it may help somehow to understand Policy Gradients in this CartPole game.

An episode means the agent plays the game and through the end of the game.
So, Here I will give an example, episode 0:

The actions are done by the agent in this episode :
[0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0]
The state looks like:
array([-0.03290773, -0.01920455,  0.00946775,  0.03302626]), 
array([-0.03329182,  0.17578036,  0.01012827, -0.25665452]),
...
This episode has a total number of rewards is 13.
('episode:', 0, '  reward:', 13)
The rewards are in details:
[1.0, 1.0, 1.0, 1.0, 
 1.0, 1.0, 1.0, 1.0, 
 1.0, 1.0, 1.0, 1.0, 1.0]
We need to transfer rewards to discount and normalized rewards
[12.2478977 , 11.36151283, 10.46617457,  9.5617925 ,  
 8.64827525, 7.72553056,  6.79346521,  5.85198506,  
 4.90099501,  3.940399  , 2.9701    ,  1.99      ,  1. ]
Here I draw the diagram to help more understand the source code in the demo for reference.


Now, based on the tutorial, we know that the loss function for policy gradients as follows:
 # 最大化 总体 reward (log_p * R) 就是在最小化 -(log_p * R), 而 tf 的功能里只有最小化 loss
 neg_log_prob = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=all_act, labels=self.tf_acts) # 所选 action 的概率 -log 值
 # 下面的方式是一样的:
 # neg_log_prob = tf.reduce_sum(-tf.log(self.all_act_prob)*tf.one_hot(self.tf_acts, self.n_actions), axis=1)
 loss = tf.reduce_mean(neg_log_prob * self.tf_vt)  # (vt = 本reward + 衰减的未来reward) 引导参数的梯度下降

I dump neg_log_prob variable for this episode:
[0.62964076, 0.7338653 , 0.62811565, 0.65259093, 
 0.6791367 , 0.7074002 , 0.7367828 , 0.76643014, 
 0.7952844 , 0.5788585 , 0.8118235 , 0.56644785, 0.8317267 ]

Then, we can verify these values by hand:
 # neg_log_prob = tf.reduce_sum(-tf.log(self.all_act_prob)*tf.one_hot(self.tf_acts, self.n_actions), axis=1)


import math
import numpy as np
one_hot_acts = np.array([[1., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.]])

all_act_prob = np.array([[0.53278315, 0.46721685],
       [0.5199501 , 0.48004982],
       [0.53359634, 0.4664037 ],
       [0.520695  , 0.47930512],
       [0.50705457, 0.49294546],
       [0.49292403, 0.50707597],
       [0.47865143, 0.5213486 ],
       [0.4646689 , 0.53533113],
       [0.45145282, 0.5485472 ],
       [0.4394621 , 0.5605378 ],
       [0.44404763, 0.55595237],
       [0.4324621 , 0.56753784],
       [0.43529704, 0.56470305]])

# for neg_log_prob[0]
-math.log((all_act_prob[0]*one_hot_acts[0].transpose())[0])
==> 0.6296407856314258

# for neg_log_prob[1]
-math.log((all_act_prob[1]*one_hot_acts[1].transpose())[1])
==> 0.7338653887995161

# for neg_log_prob[2]
-math.log((all_act_prob[2]*one_hot_acts[2].transpose())[0])
==> 0.6281156434747114
    
So, we can verify that the values are the same
[0.62964076, 0.7338653 , 0.62811565, 0.65259093, 
 0.6791367 , 0.7074002 , 0.7367828 , 0.76643014, 
 0.7952844 , 0.5788585 , 0.8118235 , 0.56644785, 0.8317267 ]
With TensorFlow, we can train this policy gradient and model based on the losss function
with tf.name_scope('train'):
            self.train_op = tf.train.AdamOptimizer(self.lr).minimize(loss)

In sum, by checking out the input data and dumping the temp data, I can more understand policy gradient method for reinforcement learning.

No comments: