Alienpenguin

Published

- 3 min read

SARSA Algorithm

img of SARSA Algorithm

Introduction

SARSA is a value based, on-policy algorithm. Recall the difference between policy-based and value-based algorithms. Policy-based algorithms built a representation of a policy π:sa\pi : s \rightarrow a. Value-based methods evaluate state-action pairs (s, a) by learning one of the following value functions:

  • Vπ(s)V^{\pi}(s) used by Actor-Critic algorithm
  • Qπ(s,a)Q^{\pi}(s,a) used by SARSA algorithm , DQN Also on-policy algorithms is where the information used to improve the current policy only depeds on the policy used to gather data. Two main ideas of SARSA:
  1. Learning the Q-function known as Temporal Difference (TD) learning
  2. Generating the actions using Q-function

Q vs V function

Q function Qπ(s,a)Q^{\pi}(s,a)

< - Add - >

Measures the value of state-action pairs (s, a) under a particular policy π\pi. Stores one-step lookahead value for every action a in every state s. This is the expected cumulative discounted reward from taking action a in state s, and then continuing to action a in the state s. This value given by Q function could be used as quantitive value for each action. For example, in chess this can be used to decide on the best move (action) to make in a paticular position (state). Advantage of this is that no need lookahead one-step. However the disadvantage is that we need data to cover all (s,a) pairs. This means that more data is needed to learn good Q-function estimate.

V function Vπ(s)V^{\pi}(s)

< - Add - >

Measures the value of the state s under a particular policy π\pi. For example, in chess this can be used to quantify the intuition of how good or bad the board position is. For each of the next states ss', that can be reached by choosing legal action, aa, we will calculate vπ(s) v^{\pi}(s'). We will then use vπ(s)v^{\pi}(s') to select the action that will leading to the best ss'. The problem with this method is that it is time-consuming and relies on knowing the transition function. If we were to use vπ(s)v^{\pi}(s) to choose actions, agents need to take each of the actions aAsa \in A_{s} in state ss. And observe the next state, ss', and reward, rr, to calculate vπ(s)v^{\pi}(s') If the action space is stochastic the agent need to repeat this process many times to get a good estimation of the expected value of taking paticular aciton. However an advantage of using V function is that for the same amount of data, the V function estimate will likely be better than the Q function estimate as it doesn’t need much data to learn.

Evalutation Actions: Temporal Difference Learning

Temporal difference (TD) learning is an iterative method that value besed RL algorithms learn vπ(s)v^{\pi}(s) or Qπ(s,a)Q^{\pi}(s,a)

SARSA Algorithm

< - Add - >

  1. Randomly initialise a neural network with parameters θ\theta to represent the Q-function Qπ(s,a:θ)Q^{\pi}(s, a:\theta)
  2. Repeat:
    1. Use Qπ(s,a:θ)Q^{\pi}(s, a:\theta) to act greedily in the environment.
    2. Store all Qπ(s,a,r,s)Q^{\pi}(s,a,r,s^{'}) experiences.
    3. Use the stored experiences to update Qπ(s,a:θ)Q^{\pi}(s, a : \theta) using the SARSA Bellman equation.
  3. Until convergence: agent stop improving i.e. total undiscounted cumulative rewards received during an episode stops changing.

SARSA Example