Thursday, November 22, 2018

[Reinforcement Learning] Get started to learn Q-Learning for reinforcement learning

The previous post about reinforcement learning:
[Reinforcement Learning] Get started to learn gradient method for reinforcement learning

For the Q-Learning tutorial, I refer to these as follows: ( sorry, they are written in Chinese )
https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/2-2-A-q-learning/
https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/2-2-tabular-q1/
https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/2-3-tabular-q2/



And, the Q-Learning demo example is in GitHub:
https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/tree/master/contents/2_Q_Learning_maze

The most important part of it is the Q-Learning algorithm:


Here I am going to dump the data during the training process of Q-Learning so that we can get understand the algorithm quicker.

If I got the dump data as follows:
current state:'[45.0, 45.0, 75.0, 75.0]'
Action: 1
Reward: -1
Next state: 'terminal'

Current Q_table:
State \ Action            0    1    2    3
[5.0, 5.0, 35.0, 35.0]    0.0  0.0  0.0  0.0
[5.0, 45.0, 35.0, 75.0]   0.0  0.0  0.0  0.0
[45.0, 45.0, 75.0, 75.0]  0.0  0.0  0.0  0.0
terminal                  0.0  0.0  0.0  0.0
We don't need to pay too much attention about the state's value in the list because the state is generated by the environment. So, based on the data, how to calculate the Q-Learning algorithm?
    def learn(self, s, a, r, s_):
        self.check_state_exist(s_)  # 检测 q_table 中是否存在 s_ (见后面标题内容)
        q_predict = self.q_table.loc[s, a]
        if s_ != 'terminal':
            q_target = r + self.gamma * self.q_table.loc[s_, :].max()  # 下个 state 不是 终止符
        else:
            q_target = r  # 下个 state 是终止符
        self.q_table.loc[s, a] += self.lr * (q_target - q_predict)  # 更新对应的 state-action 值
First, we calculate q_target, q_predict, and Q_table using the Q-learning algorithm above:
q_predict = 0.0 
q_target = -1

self.q_table.loc[s, a] += 0.01 * (-1 - 0.0)
self.q_table.loc['[45.0, 45.0, 75.0, 75.0]', 1] = 0.0 - 0.01 = -0.01
So, we will get the new Q_table:
Current Q_table:
State \ Action            0     1    2    3
[5.0, 5.0, 35.0, 35.0]    0.0  0.00  0.0  0.0
[5.0, 45.0, 35.0, 75.0]   0.0  0.00  0.0  0.0
[45.0, 45.0, 75.0, 75.0]  0.0 -0.01  0.0  0.0
terminal                  0.0  0.00  0.0  0.0

No comments: