Alienpenguin

Published

- 3 min read

REINFORCE Algorithm

img of REINFORCE Algorithm

Introduction

REINFORCE algorithm, which stands for REward INcrement, is a foundational method in the field of deep reinforcement learning. REINFORCE is an on-policy algorithm. It aims to optimise policy directly, rather than relying on value functions to estimate the quality of states or actions. Unlike value-based methods (like Q-learning) Recall

  • on-policy: depends on the current policy

Key characteristics

There are three components for REINFORCE algorithm:

  1. A parameterised policy
  2. Objective functoin to be maximised
  3. A method for updating the policy parameters

Policy πθ\pi_{\theta}

Policy is a function that maps between the state and action probabilities. It simply takes state of the environment as an input and outputs a probability distribution over all possible actions. A good policy should maximise cumulative discounted rewards. We will use neural networks to approximate the policy function. At the end of the day neural networks are just the best function approximators. We will create a deep neural network with learnable parameters θ\theta that represents the policy. Policy πθ\pi_{\theta} is parameterised by θ\theta. This means that each set of possible values for θ\theta represents separate policy. E.g.: πθ1,πθ2,πθ3.....\pi_{\theta_{1}}, \pi_{\theta_{2}}, \pi_{\theta_{3}}...... Therefore single neural network can represent many policies.

Objective function J(πθ)J(\pi_{\theta})

Objective function is the expected return over all complete trajectories generated by an agent.

< - Add - >

Trajectory of a game are the elements that are generated when the agent moves from one state to another, τ=s0,a0,r0,....,s3,a3,r3\tau = s_{0}, a_{0}, r_{0},....,s_{3}, a_{3}, r_{3}. Also note that an episode is a trajectory that starts from the initial state of the task and ends at the terminal state, τ=s0,a0,r0,....,sT,rT\tau = s_{0}, a_{0}, r_{0},....,s_{T}, r_{T}. Return of a trajectory is defined as discounted sum of rewards from time step t to the end of a trajectory.

< - Add - >

Lets calculate the expected return for the following game:

< - Add - >

Policy Gradient θJ(πθ)\nabla_{\theta}J(\pi_{\theta})

Goal of policy gradient algorithm is to positively reinforce the actions that will result in good outcomes and the opposite for bad outcome actions. This is done by performing gradient ascent on the policy parameters θ\theta. We will use use policy and objective function defied above to derive policy graident algorithm:

Rt(τ)=t=0TγtrtR_{t}(\tau) = \sum_{t=0}^{T}\gamma^{t}r_{t}
θJ(πθ)=Eτπθ[Rt(τ)]=Eτπθ[t=0TRt(τ)θlogπθ(atst)]\nabla_{\theta}J(\pi_{\theta}) = E_{\tau \sim \pi_{\theta}}[R_{t}(\tau)]=E_{\tau \sim \pi_{\theta}}[\sum_{t=0}^{T}R_{t}(\tau)\nabla_{\theta} \log \pi_{\theta}(a_{t}| s_{t})]
θθαθJ(πθ)\theta \leftarrow \theta - \alpha \nabla_{\theta}J(\pi_{\theta})

Rt(τ)>0R_{t}(\tau) > 0 means πθ(atst)\pi_{\theta}(a_{t}|s_{t}) is increased. Rt(τ)<0R_{t}(\tau) < 0 means πθ(atst)\pi_{\theta}(a_{t}|s_{t}) is decreased. πθ(atst)\pi_{\theta}(a_{t}|s_{t}) means probability of the action ata_{t} taken by the agent at time step tt

Tons of policy gradient algorithms have been proposed during recent years and there is no way for us to exhaust them. In this and future posts we will explore two fundamental policy gradient algorithms which are REINFORCE and Actor-Critic (combined value and policy algorithm).

Monte Carlo Sampling

Monte Carlo sampling is a name for any method that involves random sampling to generate data to approximate a function. Here we will use Monte Carlo policy gradients to estimate the gradient of the expected reward with respect to the policy parameters. We sample randomly from a distribution to eventually localise to some solution. As more trajectories τ\tau are sampled using policy πθ\pi_{\theta} and averaged, it approaches the actual policy gradient θJ(πθ)\nabla_{\theta}J(\pi_{\theta})

REINFORCE Algorithm

< - Add - >

  1. Observe the state of the environment
    1. aπ(s)a \sim \pi(s) : action sampled from action probabilities output by the policy network
  2. Take an action and observe the reward
  3. Continue taking actions until the episode ends
    1. Also keep track of log probabilities of the actions and rewards observed
  4. Update the policy
    1. loss ← log probabilities scaled by reward
    2. back-propagate through this to update the policy
  5. Repeat the process

REINFORCE example

< - Add - >