前言
最近在学习Reinforcement Learning,想起来之前最开始学习时候的Deeplizard,课程官方网站。
于是就照着这个来学习一下,其中涉及到了Frozen Lake这个例子,于是乎做一个mark。
正文
1. Frozen Lake
这里是关于Frozen Lake的一个简短介绍,Frozen Lake就是说你和你的小伙伴在一个冰面上玩飞盘,飞盘飞出去之后,你要去捡飞盘。但是拣取飞盘的途中,冰面上会存在许多的冰洞hole,如果你掉入这些冰洞的话,你就无法捡起飞盘了,因为你已经要冷死了。所以我们的任务就是在绕开这些冰洞的情况下,捡起飞盘。
对于一个Reinforcement Learning来说,其agent主要做的事情就是:
- 判断做哪一步actions
- 如果当前state不是hole 也不是飞盘,那么没有任何reward,游戏未结束,继续判断下一步
- 如果当前state是hole reward -1,游戏结束
- 如果当前state是飞盘 reward +1, 游戏结束
2. Python代码
由于Python代码写的很清楚,那么我就不做多解释了,因为代码里面都有
2.1 Python环境
from IPython.display import clear_output
import numpy as np
import random
import time
import gym
import sys
print("Python Version:", sys.version_info)
print("numpy:", np.__version__)
print("gym:", gym.__version__)
输出:
Python Version: sys.version_info(major=3, minor=8, micro=17, releaselevel='final', serial=0)
numpy: 1.24.4
gym: 0.26.2
2.2 当Slipper = False的时候
Slipper指的是冰面是否会滑,意思就是,如果冰面很滑的话,你即使想要左转,你可能也由于不稳而变成直行,对于agent而言,就是即使他是action是Left,他也只是70%是Left,可能也有其他30%分别是Right,Straight和Backward。当Slipper=False的时候,不存在此行为,当前就是Slipper=False的时候。
env = gym.make('FrozenLake-v1', render_mode="ansi", desc=None, map_name="4x4", is_slippery=False)
# Preview
# Actions: Up, Down, Right, Left
action_space_size = env.action_space.n
# States: 4x4 grids
state_space_size = env.observation_space.n
# action Q table
q_table = np.zeros((state_space_size, action_space_size))
print(q_table)
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
以下是Reinforcement Learning,我们Q-Learning的参数:
# Reinforcement Learning Related
num_episodes = 10000
max_steps_per_episode = 1000
# Alpha factor
learning_rate = 0.05
# Gamma factor
discount_rate = 0.99
# Exploration & Exploitation trade-off
exploration_rate = 1
max_exploration_rate = 1
min_exploration_rate = 0.01
exploration_decay_rate = 0.005
以下是我们的Q-Learning 具体Algorithm:
# See how our game's result changes every time
rewards_all_episodes = []
# Q-learning algorithm
for episode in range(num_episodes):
# Make agent back to the start point
(state, _) = env.reset()
done = False
rewards_current_episode = 0
for step in range(max_steps_per_episode):
# Exploration-exploitation trade-off
exploration_rate_threshold = random.uniform(0, 1)
if exploration_rate_threshold > exploration_rate:
# If exploitation: choose the max-reward route
action = np.argmax(q_table[state,:])
else:
# If exploration: sample an action from the action space
action = env.action_space.sample()
(new_state, reward, done, _, _) = env.step(action)
# Update Q-table for q(s,a)
q_table[state, action] = q_table[state, action] * (1 - learning_rate) + \
learning_rate * (reward + discount_rate * np.max(q_table[new_state, :]))
state = new_state
rewards_current_episode += reward
if done == True:
break
# Exploration rate decay
exploration_rate = min_exploration_rate + \
(max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate * episode)
# Record current reward
rewards_all_episodes.append(rewards_current_episode)
# Afterward, calculate and print the average reward per thousand episodes
rewards_per_thousand_episodes = np.split(np.array(rewards_all_episodes), num_episodes/1000)
count = 1000
print("**********Average reward per thousand episodes**********\n")
for r in rewards_per_thousand_episodes:
print(count, ":", str(sum(r/1000)))
count += 1000
# Print updated Q-table
print("**********Q-table**********\n")
print(q_table)
**********Average reward per thousand episodes**********
1000 : 0.7520000000000006
2000 : 0.9880000000000008
3000 : 0.9960000000000008
4000 : 0.9900000000000008
5000 : 0.9880000000000008
6000 : 0.9910000000000008
7000 : 0.9910000000000008
8000 : 0.9890000000000008
9000 : 0.9860000000000008
10000 : 0.9910000000000008
**********Q-table**********
[[8.06589460e-01 9.50990050e-01 5.67892119e-01 7.73781214e-01]
[8.22740851e-01 0.00000000e+00 8.47199256e-08 1.49560383e-02]
[1.16552527e-02 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[8.23207313e-01 9.60596010e-01 0.00000000e+00 7.42094769e-01]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[8.32759183e-01 0.00000000e+00 9.70299000e-01 7.42466200e-01]
[8.55055630e-01 9.80100000e-01 8.15494994e-01 0.00000000e+00]
[7.73686938e-02 9.79971383e-01 0.00000000e+00 0.00000000e+00]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[0.00000000e+00 8.24959400e-01 9.90000000e-01 8.84943244e-01]
[7.80887353e-01 8.69641342e-01 1.00000000e+00 7.78435427e-01]
[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]]
我们可以看出,我们抵达终点的概率(~99%)非常的高,这是由于我们在前行过程中,绝对不会发生Slipper,所以最后Agent的路线也十分的稳定,只走一条路就可以了。
2.3 当Slipper = True的时候
当我们Slipper = True的时候,我们相比较于False的情况下,我们代码只改变了参数
# Reinforcement Learning Related
num_episodes = 10000
max_steps_per_episode = 1000
# Alpha factor
learning_rate = 0.05
# Gamma factor
discount_rate = 0.99
# Exploration & Exploitation trade-off
exploration_rate = 1
max_exploration_rate = 1
min_exploration_rate = 0.01
exploration_decay_rate = 0.001
其他都没有改变,最后的结果是:
**********Average reward per thousand episodes**********
1000 : 0.027000000000000017
2000 : 0.08700000000000006
3000 : 0.2770000000000002
4000 : 0.5980000000000004
5000 : 0.6950000000000005
6000 : 0.7460000000000006
7000 : 0.7530000000000006
8000 : 0.7710000000000006
9000 : 0.7490000000000006
10000 : 0.7870000000000006
**********Q-table**********
[[0.56152929 0.52858721 0.52380312 0.52891231]
[0.25312457 0.25966285 0.19375211 0.51240278]
[0.37648444 0.38706739 0.39410148 0.47019449]
[0.26373164 0.29498732 0.28893064 0.4563159 ]
[0.5795818 0.46690116 0.45234356 0.36347191]
[0. 0. 0. 0. ]
[0.34992959 0.15112735 0.24753333 0.07792382]
[0. 0. 0. 0. ]
[0.39069635 0.4385058 0.37673981 0.61005618]
[0.48619616 0.66728611 0.5007645 0.43743515]
[0.57924865 0.45193664 0.30692979 0.27656299]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0.47638143 0.59430757 0.75652722 0.48823196]
[0.73234353 0.89196141 0.74558653 0.76624524]
[0. 0. 0. 0. ]]
2.4 Slipper = True 和 Slipper = False分析
可以看出来,False的情况并没有99% 都抵达终点的情况,是因为False是有可能打滑的,所以更难抵达终点。
而且我们可以看出来,参数上我们改变了exploration_decay_rate,对于True而言,exploration_decay_rate=0.005,然而对于False而言,exploration_decay_rate=0.001,这是因为True的情况对于Exploration的要求不高,因为他不容易出意外,很容易找到正确路线,对于这类的,我们只要高度追求Exploitation就可以了;但是对于False的情况,由于其会出意外,相比于True而言,他更难找到一个optimal的结果,所以我们需要让Exploration的时间多一点,去寻找一个更加正确的结果。
2.5 可视化
下面是text版本的可视化代码:
# Watch agent playing frozen lake
for episode in range(3):
(state, _) = env.reset()
done = False
print("**********Episode ", episode+1, "**********\n\n\n\n")
time.sleep(1)
for step in range(max_steps_per_episode):
clear_output(wait=True)
print(env.render())
time.sleep(0.3)
action = np.argmax(q_table[state,:])
(new_state, reward, done, _, _) = env.step(action)
if done:
clear_output(wait=True)
print(env.render())
if reward == 1:
print("*****You got the firsbee!*****")
time.sleep(3)
else:
print("*****You falled in to the hole! XD*****")
time.sleep(3)
clear_output(wait=True)
break
state = new_state
env.close()
输出
(Right)
SFFF
FHFH
FFFH
HFFG
*****You got the firsbee!*****
总结
简单复习了一下Reinforcement Learning,更加深入的了解了一下Q-Learning。
加油!继续学习!
参考
[1] Reinforcement Learning - Developing Intelligent Agents
[2] Watch Q-learning Agent Play Game with Python - Reinforcement Learning Code Project
[3] Github - gym/gym/envs/toy_text/frozen_lake.py
Q.E.D.