Python 中的 Q-Learning

17 Mar 2025 | 5 分钟阅读

强化学习是一种学习过程中的模型，在该模型中，学习代理通过持续与环境互动，在特定环境中随着时间的推移以尽可能最好的方式进行开发。在学习过程中，代理会遇到其所处环境中的不同场景。它们被称为状态。处于该状态的代理可以从多种允许的操作中进行选择，这些操作可能导致各种奖励（或惩罚）。随着时间的推移而学习的代理会发展出最大化这些奖励的能力，以便在任何情况下都能表现得最好。

Q-Learning是一种基础的强化学习类型，它利用Q值（也称为动作值）来持续改进学习者的行为。

Q值，也称为动作值：Q值是针对动作和状态定义的。Q(S A, S) 是在 S 时间执行该动作的概率估计。Q(S A, S) 的估计通过使用我们将在后面章节中学到的TD更新规则进行迭代计算。
回合和奖励：代理在其整个生命周期中，从一个初始状态开始，根据其所交互的操作类型和环境，在其当前状态和下一个状态之间进行多次转换。在每次转换过程中，代理在转换状态下采取行动，受到周围环境的奖励，然后进入一个新状态。如果代理在某个时刻达到其中一个结束状态，则意味着不再有可行的转换。这被称为一个回合的结束。
时序差分或TD更新：时序差分（TD）更新规则可以表示如下：
Q(S,A)←Q(S,A)+ α(R+ γQ(S`,A`)-Q(S,A))
用于计算数量的更新规则在代理与其环境交互的每个阶段都会使用。下面解释了使用的术语：
- S：代理的当前状态。
- A：当前策略选择的当前动作。
- S`：代理将到达的下一个状态。
- A`：基于当前最新的Q值估计，选择的下一个最有效的选项，即选择在下一个状态具有最高Q值的动作。
- R：环境根据当前动作看到的当前奖励。
- γ（>0 且 <=1）：未来奖励的折扣因子。未来奖励的价值低于当前奖励。因此，它们应该被打折扣。因为Q值估计了特定状态的预期奖励，所以折扣规则也适用于这种情况。
- α：修改Q(S, A)的步长。

使用 ϵ-greedy 策略进行动作选择：ϵ-greedy 策略是一种基于当前最新的Q值估计来选择动作的简单方法。该策略遵循以下规则：
- 以 (1 - ϵ) 的概率，选择具有最高Q值的选项。
- 以高概率（ϵ），随机选择任何一个选项。

有了所有必要的知识，让我们举个例子。我们将使用OpenAI创建的gym环境来构建Q-Learning算法。

安装gym

我们可以使用以下命令来安装gym：

在开始这个例子之前，我们需要一个辅助代码来观察算法的过程。需要从我们的工作目录下载两个辅助文件。

步骤 1：导入所有必需的库和模块。

import gym as GYM
import itertools as IT
import matplotlib as MPLOT
import matplotlib.style as MPLOTS
import numpy as nmp
import pandas as pnd
import sys
from collections import defaultdict as DD

步骤 2：我们将实例化我们的环境。

env = gym.make("FrozenLake-v1")
n_observations1 = env.observation_space.n
n_actions1 = env.action_space.n

步骤 3：我们需要创建并初始化Q表为0。

def createEpsilonGreedyPolicy1(Q1, epsilon1, num_actions1):
	"""
	Here, we will create an epsilon-greedy policy 
          which is based on a given Q-function and epsilon.
	
	It will return the function which will takes the state
	as an input and then it will return the probabilities
	for each and every action in the form of the 
numpy array of length of the action space
(Set of possible actions).
	"""
	def policyFunction1(state):

		Action_probabilities1 = nmp.ones(num_actions1,
				dtype = float) * epsilon1 / num_actions1
				
		best_action = nmp.argmax(Q1[state])
		Action_probabilities[best_action] += (1.0 - epsilon1)
		return Action_probabilities1

	return policyFunction1

步骤 4：我们将构建Q-Learning模型。

def qLearning1(env, num_episodes1, discount_factor = 1.0,
							alpha = 0.6, epsilon1 = 0.1):
	"""
	The Q-Learning algorithm: is the Off-policy TD control.
	It is used for finding the optimal greedy policy 
while improving an epsilon-greedy policy"""
	
	# This will be the Action value function, which is the nested dictionary 
# that maps state -> (action -> action-value).
	Q1 = DD(lambda: nmp.zeros(env.action_space.n))

	# Keeps track of useful statistics
	stats = PLOTT.EpisodeStats(
		episode_lengths = nmp.zeros(num_episodes1),
		episode_rewards = nmp.zeros(num_episodes1))	
	
# Here, we will be creating an epsilon greedy policy function which would be #appropriate for environment action space
	policy = createEpsilonGreedyPolicy1(Q1, epsilon1, env.action_space.n)
	
	# For each and every episode
	for Kth_episode in range(num_episodes1):
		
		# Here, we will be resetting the environment and 
        #then we will be picking the first action
		state = env.reset()
		
		for J in itertools.count():
			
	# here, we will be getting probabilities of all actions from our current #state
			action_probabilities1 = policy(state)

# Now, we will be choosing the action according to the probability distribution
			action = nmp.random.choice(nmp.arange(
					len(action_probabilities1)),
					p = action_probabilities1)

			# Now, we will be taking the action and getting reward 
        # transit to next state
			next_state, reward, done, _ = env.step(action)

			# Now, we will be updating statistics
			stats.episode_rewards[Kth_episode] += reward
			stats.episode_lengths[Kth_episode] = J
			
			# TD Update
			best_next_action = nmp.argmax(Q1[next_state])	
			td_target = reward + discount_factor * Q1[next_state][best_next_action]
			td_delta = td_target - Q1[state][action]
			Q1[state][action] += alpha * td_delta

			# Now, here if done is True if episode terminated
			if done:
				break
				
			state = next_state
	
	return Q1, stats

步骤 5：我们将训练模型。

步骤 6：最后，我们将绘制重要的统计数据。

输出