Published
- 3 min read
DQN Algorithm
Introduction
DQN is a value based temporal difference algorithm that approximates the Q-function. It is an off-policy algorithm, which means that any policy can be used to generate training data for DQN, unlike SARSA which learns the Q-function for the current policy only. This means that optimal Q-function is learned so all potential actions available in the state is used, which intern improves stability and speed of learning. Note that it will only work in discrte action space Also there are some cases where DQN won’t learn the obtimal Q-function:
- Hypothesis space covered by the neural network might not contain the optimal Q-function.
- Non convex optimisation
- Time constraints and time limit on how long we can train We need to be aware of the two components of DQN that are not in SARSA:
- Boltzmann policy: good data-gathering policy
- Experience replay
Action selecting
There are three components for REINFORCE algorithm:
- A parameterised policy
- Objective functoin to be maximised
- A method for updating the policy parameters
Boltzmann policy
Experience replay
Objective function is the expected return over all complete trajectories generated by an agent.
< - Add - >
Trajectory of a game are the elements that are generated when the agent moves from one state to another, . Also note that an episode is a trajectory that starts from the initial state of the task and ends at the terminal state, . Return of a trajectory is defined as discounted sum of rewards from time step t to the end of a trajectory.
< - Add - >
Lets calculate the expected return for the following game:
< - Add - >
Policy Gradient
Goal of policy gradient algorithm is to positively reinforce the actions that will result in good outcomes and the opposite for bad outcome actions. This is done by performing gradient ascent on the policy parameters . We will use use policy and objective function defied above to derive policy graident algorithm:
means is increased. means is decreased. means probability of the action taken by the agent at time step
Tons of policy gradient algorithms have been proposed during recent years and there is no way for us to exhaust them. In this and future posts we will explore two fundamental policy gradient algorithms which are REINFORCE and Actor-Critic (combined value and policy algorithm).
DQN Algorithm
< - Add - >
- Observe the state of the environment
- : action sampled from action probabilities output by the policy network
- Take an action and observe the reward
- Continue taking actions until the episode ends
- Also keep track of log probabilities of the actions and rewards observed
- Update the policy
- loss ← log probabilities scaled by reward
- back-propagate through this to update the policy
- Repeat the process
DQN example
< - Add - >