Published
- 3 min read
REINFORCE Algorithm
Introduction
REINFORCE algorithm, which stands for REward INcrement, is a foundational method in the field of deep reinforcement learning. REINFORCE is an on-policy algorithm. It aims to optimise policy directly, rather than relying on value functions to estimate the quality of states or actions. Unlike value-based methods (like Q-learning) Recall
- on-policy: depends on the current policy
Key characteristics
There are three components for REINFORCE algorithm:
- A parameterised policy
- Objective functoin to be maximised
- A method for updating the policy parameters
Policy
Policy is a function that maps between the state and action probabilities. It simply takes state of the environment as an input and outputs a probability distribution over all possible actions. A good policy should maximise cumulative discounted rewards. We will use neural networks to approximate the policy function. At the end of the day neural networks are just the best function approximators. We will create a deep neural network with learnable parameters that represents the policy. Policy is parameterised by . This means that each set of possible values for represents separate policy. E.g.: . Therefore single neural network can represent many policies.
Objective function
Objective function is the expected return over all complete trajectories generated by an agent.
< - Add - >
Trajectory of a game are the elements that are generated when the agent moves from one state to another, . Also note that an episode is a trajectory that starts from the initial state of the task and ends at the terminal state, . Return of a trajectory is defined as discounted sum of rewards from time step t to the end of a trajectory.
< - Add - >
Lets calculate the expected return for the following game:
< - Add - >
Policy Gradient
Goal of policy gradient algorithm is to positively reinforce the actions that will result in good outcomes and the opposite for bad outcome actions. This is done by performing gradient ascent on the policy parameters . We will use use policy and objective function defied above to derive policy graident algorithm:
means is increased. means is decreased. means probability of the action taken by the agent at time step
Tons of policy gradient algorithms have been proposed during recent years and there is no way for us to exhaust them. In this and future posts we will explore two fundamental policy gradient algorithms which are REINFORCE and Actor-Critic (combined value and policy algorithm).
Monte Carlo Sampling
Monte Carlo sampling is a name for any method that involves random sampling to generate data to approximate a function. Here we will use Monte Carlo policy gradients to estimate the gradient of the expected reward with respect to the policy parameters. We sample randomly from a distribution to eventually localise to some solution. As more trajectories are sampled using policy and averaged, it approaches the actual policy gradient
REINFORCE Algorithm
< - Add - >
- Observe the state of the environment
- : action sampled from action probabilities output by the policy network
- Take an action and observe the reward
- Continue taking actions until the episode ends
- Also keep track of log probabilities of the actions and rewards observed
- Update the policy
- loss ← log probabilities scaled by reward
- back-propagate through this to update the policy
- Repeat the process
REINFORCE example
< - Add - >