Alienpenguin

Published

- 3 min read

DQN Algorithm

img of DQN Algorithm

Introduction

DQN is a value based temporal difference algorithm that approximates the Q-function. It is an off-policy algorithm, which means that any policy can be used to generate training data for DQN, unlike SARSA which learns the Q-function for the current policy only. This means that optimal Q-function is learned so all potential actions available in the state is used, which intern improves stability and speed of learning. Note that it will only work in discrte action space Also there are some cases where DQN won’t learn the obtimal Q-function:

  • Hypothesis space covered by the neural network might not contain the optimal Q-function.
  • Non convex optimisation
  • Time constraints and time limit on how long we can train We need to be aware of the two components of DQN that are not in SARSA:
  • Boltzmann policy: good data-gathering policy
  • Experience replay

Action selecting

There are three components for REINFORCE algorithm:

  1. A parameterised policy
  2. Objective functoin to be maximised
  3. A method for updating the policy parameters

Boltzmann policy

Experience replay

Objective function is the expected return over all complete trajectories generated by an agent.

< - Add - >

Trajectory of a game are the elements that are generated when the agent moves from one state to another, τ=s0,a0,r0,....,s3,a3,r3\tau = s_{0}, a_{0}, r_{0},....,s_{3}, a_{3}, r_{3}. Also note that an episode is a trajectory that starts from the initial state of the task and ends at the terminal state, τ=s0,a0,r0,....,sT,rT\tau = s_{0}, a_{0}, r_{0},....,s_{T}, r_{T}. Return of a trajectory is defined as discounted sum of rewards from time step t to the end of a trajectory.

< - Add - >

Lets calculate the expected return for the following game:

< - Add - >

Policy Gradient θJ(πθ)\nabla_{\theta}J(\pi_{\theta})

Goal of policy gradient algorithm is to positively reinforce the actions that will result in good outcomes and the opposite for bad outcome actions. This is done by performing gradient ascent on the policy parameters θ\theta. We will use use policy and objective function defied above to derive policy graident algorithm:

Rt(τ)=t=0TγtrtR_{t}(\tau) = \sum_{t=0}^{T}\gamma^{t}r_{t}
θJ(πθ)=Eτπθ[Rt(τ)]=Eτπθ[t=0TRt(τ)θlogπθ(atst)]\nabla_{\theta}J(\pi_{\theta}) = E_{\tau \sim \pi_{\theta}}[R_{t}(\tau)]=E_{\tau \sim \pi_{\theta}}[\sum_{t=0}^{T}R_{t}(\tau)\nabla_{\theta} \log \pi_{\theta}(a_{t}| s_{t})]
θθαθJ(πθ)\theta \leftarrow \theta - \alpha \nabla_{\theta}J(\pi_{\theta})

Rt(τ)>0R_{t}(\tau) > 0 means πθ(atst)\pi_{\theta}(a_{t}|s_{t}) is increased. Rt(τ)<0R_{t}(\tau) < 0 means πθ(atst)\pi_{\theta}(a_{t}|s_{t}) is decreased. πθ(atst)\pi_{\theta}(a_{t}|s_{t}) means probability of the action ata_{t} taken by the agent at time step tt

Tons of policy gradient algorithms have been proposed during recent years and there is no way for us to exhaust them. In this and future posts we will explore two fundamental policy gradient algorithms which are REINFORCE and Actor-Critic (combined value and policy algorithm).

DQN Algorithm

< - Add - >

  1. Observe the state of the environment
    1. aπ(s)a \sim \pi(s) : action sampled from action probabilities output by the policy network
  2. Take an action and observe the reward
  3. Continue taking actions until the episode ends
    1. Also keep track of log probabilities of the actions and rewards observed
  4. Update the policy
    1. loss ← log probabilities scaled by reward
    2. back-propagate through this to update the policy
  5. Repeat the process

DQN example

< - Add - >