https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/5-1-policy-gradient-softmax1/
https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/5-2-policy-gradient-softmax2/
The CartPole demo source code is in the link below:
https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/tree/master/contents/7_Policy_gradient_softmax
The most important part of it is the Policy Gradient algorithm:
Here I am going to represent some dump data from the demo example so that it may help somehow to understand Policy Gradients in this CartPole game.
An episode means the agent plays the game and through the end of the game.
So, Here I will give an example, episode 0:
The actions are done by the agent in this episode :
[0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0]
The state looks like:array([-0.03290773, -0.01920455, 0.00946775, 0.03302626]),
array([-0.03329182, 0.17578036, 0.01012827, -0.25665452]),
...
This episode has a total number of rewards is 13.('episode:', 0, ' reward:', 13)
The rewards are in details:[1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0]
We need to transfer rewards to discount and normalized rewards[12.2478977 , 11.36151283, 10.46617457, 9.5617925 ,
8.64827525, 7.72553056, 6.79346521, 5.85198506,
4.90099501, 3.940399 , 2.9701 , 1.99 , 1. ]
Here I draw the diagram to help more understand the source code in the demo for reference.Now, based on the tutorial, we know that the loss function for policy gradients as follows:
# 最大化 总体 reward (log_p * R) 就是在最小化 -(log_p * R), 而 tf 的功能里只有最小化 loss
neg_log_prob = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=all_act, labels=self.tf_acts) # 所选 action 的概率 -log 值
# 下面的方式是一样的:
# neg_log_prob = tf.reduce_sum(-tf.log(self.all_act_prob)*tf.one_hot(self.tf_acts, self.n_actions), axis=1)
loss = tf.reduce_mean(neg_log_prob * self.tf_vt) # (vt = 本reward + 衰减的未来reward) 引导参数的梯度下降
I dump neg_log_prob variable for this episode:
[0.62964076, 0.7338653 , 0.62811565, 0.65259093,
0.6791367 , 0.7074002 , 0.7367828 , 0.76643014,
0.7952844 , 0.5788585 , 0.8118235 , 0.56644785, 0.8317267 ]
# neg_log_prob = tf.reduce_sum(-tf.log(self.all_act_prob)*tf.one_hot(self.tf_acts, self.n_actions), axis=1)
import math import numpy as np one_hot_acts = np.array([[1., 0.], [0., 1.], [1., 0.], [1., 0.], [1., 0.], [1., 0.], [1., 0.], [1., 0.], [1., 0.], [0., 1.], [1., 0.], [0., 1.], [1., 0.]]) all_act_prob = np.array([[0.53278315, 0.46721685], [0.5199501 , 0.48004982], [0.53359634, 0.4664037 ], [0.520695 , 0.47930512], [0.50705457, 0.49294546], [0.49292403, 0.50707597], [0.47865143, 0.5213486 ], [0.4646689 , 0.53533113], [0.45145282, 0.5485472 ], [0.4394621 , 0.5605378 ], [0.44404763, 0.55595237], [0.4324621 , 0.56753784], [0.43529704, 0.56470305]]) # for neg_log_prob[0] -math.log((all_act_prob[0]*one_hot_acts[0].transpose())[0]) ==> 0.6296407856314258 # for neg_log_prob[1] -math.log((all_act_prob[1]*one_hot_acts[1].transpose())[1]) ==> 0.7338653887995161 # for neg_log_prob[2] -math.log((all_act_prob[2]*one_hot_acts[2].transpose())[0]) ==> 0.6281156434747114
So, we can verify that the values are the same
[0.62964076, 0.7338653 , 0.62811565, 0.65259093,
0.6791367 , 0.7074002 , 0.7367828 , 0.76643014,
0.7952844 , 0.5788585 , 0.8118235 , 0.56644785, 0.8317267 ]
With TensorFlow, we can train this policy gradient and model based on the losss functionwith tf.name_scope('train'):
self.train_op = tf.train.AdamOptimizer(self.lr).minimize(loss)
In sum, by checking out the input data and dumping the temp data, I can more understand policy gradient method for reinforcement learning.
No comments:
Post a Comment