前言

最近在学习Reinforcement Learning，想起来之前最开始学习时候的Deeplizard，课程官方网站。

于是就照着这个来学习一下，其中涉及到了Frozen Lake这个例子，于是乎做一个mark。

正文

1. Frozen Lake

这里是关于Frozen Lake的一个简短介绍，Frozen Lake就是说你和你的小伙伴在一个冰面上玩飞盘，飞盘飞出去之后，你要去捡飞盘。但是拣取飞盘的途中，冰面上会存在许多的冰洞hole，如果你掉入这些冰洞的话，你就无法捡起飞盘了，因为你已经要冷死了。所以我们的任务就是在绕开这些冰洞的情况下，捡起飞盘。

对于一个Reinforcement Learning来说，其agent主要做的事情就是：

判断做哪一步actions
如果当前state不是hole 也不是飞盘，那么没有任何reward，游戏未结束，继续判断下一步
如果当前state是hole reward -1，游戏结束
如果当前state是飞盘 reward +1，游戏结束

2. Python代码

由于Python代码写的很清楚，那么我就不做多解释了，因为代码里面都有

2.1 Python环境

from IPython.display import clear_output
import numpy as np
import random
import time
import gym
import sys

print("Python Version:", sys.version_info)
print("numpy:", np.__version__)
print("gym:", gym.__version__)

输出：

Python Version: sys.version_info(major=3, minor=8, micro=17, releaselevel='final', serial=0)
numpy: 1.24.4
gym: 0.26.2

2.2 当Slipper = False的时候

Slipper指的是冰面是否会滑，意思就是，如果冰面很滑的话，你即使想要左转，你可能也由于不稳而变成直行，对于agent而言，就是即使他是action是Left，他也只是70%是Left，可能也有其他30%分别是Right，Straight和Backward。当Slipper=False的时候，不存在此行为，当前就是Slipper=False的时候。

env = gym.make('FrozenLake-v1', render_mode="ansi", desc=None, map_name="4x4", is_slippery=False)

# Preview
# Actions: Up, Down, Right, Left
action_space_size = env.action_space.n
# States: 4x4 grids
state_space_size = env.observation_space.n

# action Q table
q_table = np.zeros((state_space_size, action_space_size))
print(q_table)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]

以下是Reinforcement Learning，我们Q-Learning的参数：

# Reinforcement Learning Related
num_episodes = 10000
max_steps_per_episode = 1000

# Alpha factor
learning_rate = 0.05
# Gamma factor
discount_rate = 0.99

# Exploration & Exploitation trade-off
exploration_rate = 1
max_exploration_rate = 1
min_exploration_rate = 0.01
exploration_decay_rate = 0.005

以下是我们的Q-Learning 具体Algorithm：

# See how our game's result changes every time
rewards_all_episodes = []

# Q-learning algorithm
for episode in range(num_episodes):
    # Make agent back to the start point
    (state, _) = env.reset()
    done = False
    rewards_current_episode = 0
    
    for step in range(max_steps_per_episode):
        # Exploration-exploitation trade-off
        exploration_rate_threshold = random.uniform(0, 1)
        if exploration_rate_threshold > exploration_rate:
            # If exploitation: choose the max-reward route
            action = np.argmax(q_table[state,:])
        else:
            # If exploration: sample an action from the action space
            action = env.action_space.sample()
            
        (new_state, reward, done, _, _) = env.step(action)
        
        # Update Q-table for q(s,a)
        q_table[state, action] = q_table[state, action] * (1 - learning_rate) + \
            learning_rate * (reward + discount_rate * np.max(q_table[new_state, :]))
        
        state = new_state
        rewards_current_episode += reward
        
        if done == True:
            break
            
    # Exploration rate decay
    exploration_rate = min_exploration_rate + \
        (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate * episode)
    
    # Record current reward
    rewards_all_episodes.append(rewards_current_episode)

# Afterward, calculate and print the average reward per thousand episodes
rewards_per_thousand_episodes = np.split(np.array(rewards_all_episodes), num_episodes/1000)
count = 1000
print("**********Average reward per thousand episodes**********\n")
for r in rewards_per_thousand_episodes:
    print(count, ":", str(sum(r/1000)))
    count += 1000
    
# Print updated Q-table
print("**********Q-table**********\n")
print(q_table)

**********Average reward per thousand episodes**********

1000 : 0.7520000000000006
2000 : 0.9880000000000008
3000 : 0.9960000000000008
4000 : 0.9900000000000008
5000 : 0.9880000000000008
6000 : 0.9910000000000008
7000 : 0.9910000000000008
8000 : 0.9890000000000008
9000 : 0.9860000000000008
10000 : 0.9910000000000008
**********Q-table**********

[[8.06589460e-01 9.50990050e-01 5.67892119e-01 7.73781214e-01]
 [8.22740851e-01 0.00000000e+00 8.47199256e-08 1.49560383e-02]
 [1.16552527e-02 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [8.23207313e-01 9.60596010e-01 0.00000000e+00 7.42094769e-01]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [8.32759183e-01 0.00000000e+00 9.70299000e-01 7.42466200e-01]
 [8.55055630e-01 9.80100000e-01 8.15494994e-01 0.00000000e+00]
 [7.73686938e-02 9.79971383e-01 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 8.24959400e-01 9.90000000e-01 8.84943244e-01]
 [7.80887353e-01 8.69641342e-01 1.00000000e+00 7.78435427e-01]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]]

我们可以看出，我们抵达终点的概率（～99%）非常的高，这是由于我们在前行过程中，绝对不会发生Slipper，所以最后Agent的路线也十分的稳定，只走一条路就可以了。

2.3 当Slipper = True的时候

当我们Slipper = True的时候，我们相比较于False的情况下，我们代码只改变了参数

# Reinforcement Learning Related
num_episodes = 10000
max_steps_per_episode = 1000

# Alpha factor
learning_rate = 0.05
# Gamma factor
discount_rate = 0.99

# Exploration & Exploitation trade-off
exploration_rate = 1
max_exploration_rate = 1
min_exploration_rate = 0.01
exploration_decay_rate = 0.001

其他都没有改变，最后的结果是：

**********Average reward per thousand episodes**********

1000 : 0.027000000000000017
2000 : 0.08700000000000006
3000 : 0.2770000000000002
4000 : 0.5980000000000004
5000 : 0.6950000000000005
6000 : 0.7460000000000006
7000 : 0.7530000000000006
8000 : 0.7710000000000006
9000 : 0.7490000000000006
10000 : 0.7870000000000006
**********Q-table**********

[[0.56152929 0.52858721 0.52380312 0.52891231]
 [0.25312457 0.25966285 0.19375211 0.51240278]
 [0.37648444 0.38706739 0.39410148 0.47019449]
 [0.26373164 0.29498732 0.28893064 0.4563159 ]
 [0.5795818  0.46690116 0.45234356 0.36347191]
 [0.         0.         0.         0.        ]
 [0.34992959 0.15112735 0.24753333 0.07792382]
 [0.         0.         0.         0.        ]
 [0.39069635 0.4385058  0.37673981 0.61005618]
 [0.48619616 0.66728611 0.5007645  0.43743515]
 [0.57924865 0.45193664 0.30692979 0.27656299]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.47638143 0.59430757 0.75652722 0.48823196]
 [0.73234353 0.89196141 0.74558653 0.76624524]
 [0.         0.         0.         0.        ]]

2.4 Slipper = True 和 Slipper = False分析

可以看出来，False的情况并没有99% 都抵达终点的情况，是因为False是有可能打滑的，所以更难抵达终点。

而且我们可以看出来，参数上我们改变了exploration_decay_rate，对于True而言，exploration_decay_rate=0.005，然而对于False而言，exploration_decay_rate=0.001，这是因为True的情况对于Exploration的要求不高，因为他不容易出意外，很容易找到正确路线，对于这类的，我们只要高度追求Exploitation就可以了；但是对于False的情况，由于其会出意外，相比于True而言，他更难找到一个optimal的结果，所以我们需要让Exploration的时间多一点，去寻找一个更加正确的结果。

2.5 可视化

下面是text版本的可视化代码：

# Watch agent playing frozen lake
for episode in range(3):
    (state, _) = env.reset()
    done = False
    print("**********Episode ", episode+1, "**********\n\n\n\n")
    time.sleep(1)
    
    for step in range(max_steps_per_episode):
        clear_output(wait=True)
        print(env.render())
        time.sleep(0.3)
        
        action = np.argmax(q_table[state,:])
        (new_state, reward, done, _, _) = env.step(action)
        
        if done:
            clear_output(wait=True)
            print(env.render())
            if reward == 1:
                print("*****You got the firsbee!*****")
                time.sleep(3)
            else:
                print("*****You falled in to the hole! XD*****")
                time.sleep(3)
            clear_output(wait=True)
            break
            
        state = new_state
        
env.close()

输出

(Right)
SFFF
FHFH
FFFH
HFFG

*****You got the firsbee!*****

总结

简单复习了一下Reinforcement Learning，更加深入的了解了一下Q-Learning。

加油！继续学习！

参考

[1] Reinforcement Learning - Developing Intelligent Agents
[2] Watch Q-learning Agent Play Game with Python - Reinforcement Learning Code Project
[3] Github - gym/gym/envs/toy_text/frozen_lake.py

Q.E.D.

Sean ZOU

星星の小窝

[人工智能] 初识Reinforcement Learning之 Frozen Lake

前言

正文