Published
- 3 min read
Q-learning Algorithm
Introduction
Q-learning is a model-free, value-based, off-policy reinforcement learning algorithm that is used to find the optimal action-selection policy for any given finite Markov decision process (MDP).
- Model-free: Q-learning does not require a model of the environment, and it learns the optimal policy by interacting with the environment and observing the rewards.
- Value-based: Q-learning estimates the value of taking an action in a given state and uses these value estimates to select the best action in each state.
- Off-policy: Q-learning learns the optimal policy independently of the policy used to generate the data.
Q-learning Algorithm
This is the flowchart of the Q-learning algorithm:
The Q-learning algorithm is based on the Bellman equation for the optimal action-value function Q*(s, a), which represents the expected return when taking action a in state s and following the optimal policy thereafter. The Q-learning update rule is given by:
Q-Learning Example with FrozenLake
In this article, we explore a Q-learning example using the FrozenLake-v1
environment from the gymnasium
library.
import numpy as np
import gymnasium as gym
import matplotlib.pyplot as plt
import pickle
"""
States: P, V, A, AV
Actions : 0, 1
There will be infinitely possible values -> chopped it up in ranges
"""
# Initialize the environment. The 'FrozenLake-v1' environment is a grid-based game where the agent must navigate from start to goal.
# 'is_slippery=True' makes the environment stochastic, adding randomness to the agent's actions.
env = gym.make('FrozenLake-v1', map_name="8x8", is_slippery=True, render_mode='human') # Slippery -> doesn't always honor the step action
# Initialize the Q-table to zeros. The shape is (number of states, number of actions).
q_table = np.zeros((env.observation_space.n, env.action_space.n)) # init a 64 x 4 array
# Set the learning rate (alpha), discount factor (gamma), and number of episodes.
alpha = 0.9
gamma = 0.9
episodes = 5
# Array to store the total rewards per episode.
rewards_per_episode = np.zeros(episodes)
# Epsilon is the probability of exploring (taking a random action). It decays over time to favor exploitation (using learned actions).
epsilon = 1
epsilon_decay_rate = 0.001 # Epsilon decay rate - 1/0.0001 = 10,000
# Initialize a random number generator.
rng = np.random.default_rng()
# Training loop over the specified number of episodes.
for i in range(episodes):
state = env.reset()[0] # Reset the environment to the initial state.
terminated = False # Becomes true when the episode ends (goal reached or fell in a hole).
truncated = False # True when the episode is truncated (actions > 200).
# Loop until the episode is finished.
while (not terminated and not truncated):
# Epsilon-greedy action selection: either explore (random action) or exploit (best known action).
if rng.random() < epsilon:
action = env.action_space.sample() # Random action: 0-left, 1-down, 2-right, 3-up
else:
action = np.argmax(q_table[state, :])
# Take the action and observe the new state and reward.
new_state, reward, terminated, truncated, _ = env.step(action)
# Update the Q-value using the Q-learning formula.
q_table[state, action] = q_table[state, action] + alpha * (reward + gamma * np.max(q_table[new_state, :]) - q_table[state, action])
# Transition to the new state.
state = new_state
# Decay epsilon to reduce the probability of random actions as learning progresses.
epsilon = max(epsilon - epsilon_decay_rate, 0)
# Close the environment after training.
env.close()
# Calculate the cumulative sum of rewards for plotting.
sum_rewards = np.zeros(episodes)
for t in range(episodes):
sum_rewards[t] = np.sum(rewards_per_episode[0:t+1])
# Plot the cumulative rewards over episodes.
plt.plot(sum_rewards)
plt.savefig('frozen_lake8x8.png')
# Save the Q-table to a file for later use.
with open('frozen_lake_q_table.pkl', 'wb') as f:
pickle.dump(q_table, f)
print("Q Table saved")
'''