REINFORCE Algorithm

Introduction

REINFORCE algorithm, which stands for REward INcrement, is a foundational method in the field of deep reinforcement learning. REINFORCE is an on-policy algorithm. It aims to optimise policy directly, rather than relying on value functions to estimate the quality of states or actions. Unlike value-based methods (like Q-learning) Recall

on-policy: depends on the current policy

Key characteristics

There are three components for REINFORCE algorithm:

A parameterised policy
Objective functoin to be maximised
A method for updating the policy parameters

Policy $\pi_{\theta}$

Policy is a function that maps between the state and action probabilities. It simply takes state of the environment as an input and outputs a probability distribution over all possible actions. A good policy should maximise cumulative discounted rewards. We will use neural networks to approximate the policy function. At the end of the day neural networks are just the best function approximators. We will create a deep neural network with learnable parameters $\theta$ that represents the policy. Policy $\pi_{\theta}$ is parameterised by $\theta$ . This means that each set of possible values for $\theta$ represents separate policy. E.g.: $\pi_{\theta_{1}}, \pi_{\theta_{2}}, \pi_{\theta_{3}}.....$ . Therefore single neural network can represent many policies.

Objective function $J(\pi_{\theta})$

Objective function is the expected return over all complete trajectories generated by an agent.

< - Add - >

Trajectory of a game are the elements that are generated when the agent moves from one state to another, $\tau = s_{0}, a_{0}, r_{0},....,s_{3}, a_{3}, r_{3}$ . Also note that an episode is a trajectory that starts from the initial state of the task and ends at the terminal state, $\tau = s_{0}, a_{0}, r_{0},....,s_{T}, r_{T}$ . Return of a trajectory is defined as discounted sum of rewards from time step t to the end of a trajectory.

< - Add - >

Lets calculate the expected return for the following game:

< - Add - >

Policy Gradient $\nabla_{\theta}J(\pi_{\theta})$

Goal of policy gradient algorithm is to positively reinforce the actions that will result in good outcomes and the opposite for bad outcome actions. This is done by performing gradient ascent on the policy parameters $\theta$ . We will use use policy and objective function defied above to derive policy graident algorithm:

R_{t}(\tau) = \sum_{t=0}^{T}\gamma^{t}r_{t}

\nabla_{\theta}J(\pi_{\theta}) = E_{\tau \sim \pi_{\theta}}[R_{t}(\tau)]=E_{\tau \sim \pi_{\theta}}[\sum_{t=0}^{T}R_{t}(\tau)\nabla_{\theta} \log \pi_{\theta}(a_{t}| s_{t})]

\theta \leftarrow \theta - \alpha \nabla_{\theta}J(\pi_{\theta})

$R_{t}(\tau) > 0$ means $\pi_{\theta}(a_{t}|s_{t})$ is increased. $R_{t}(\tau) < 0$ means $\pi_{\theta}(a_{t}|s_{t})$ is decreased. $\pi_{\theta}(a_{t}|s_{t})$ means probability of the action $a_{t}$ taken by the agent at time step $t$

Tons of policy gradient algorithms have been proposed during recent years and there is no way for us to exhaust them. In this and future posts we will explore two fundamental policy gradient algorithms which are REINFORCE and Actor-Critic (combined value and policy algorithm).

Monte Carlo Sampling

Monte Carlo sampling is a name for any method that involves random sampling to generate data to approximate a function. Here we will use Monte Carlo policy gradients to estimate the gradient of the expected reward with respect to the policy parameters. We sample randomly from a distribution to eventually localise to some solution. As more trajectories $\tau$ are sampled using policy $\pi_{\theta}$ and averaged, it approaches the actual policy gradient $\nabla_{\theta}J(\pi_{\theta})$

< - Add - >

Observe the state of the environment
1. $a \sim \pi(s)$ : action sampled from action probabilities output by the policy network
Take an action and observe the reward
Continue taking actions until the episode ends
1. Also keep track of log probabilities of the actions and rewards observed
Update the policy
1. loss ← log probabilities scaled by reward
2. back-propagate through this to update the policy
Repeat the process

REINFORCE example

< - Add - >

Introduction

Key characteristics

Policy πθ\pi_{\theta}πθ​

Objective function J(πθ)J(\pi_{\theta})J(πθ​)

Policy Gradient ∇θJ(πθ)\nabla_{\theta}J(\pi_{\theta})∇θ​J(πθ​)

Monte Carlo Sampling