SARSA Algorithm

Introduction

SARSA is a value based, on-policy algorithm. Recall the difference between policy-based and value-based algorithms. Policy-based algorithms built a representation of a policy $\pi : s \rightarrow a$ . Value-based methods evaluate state-action pairs (s, a) by learning one of the following value functions:

$V^{\pi}(s)$ used by Actor-Critic algorithm
$Q^{\pi}(s,a)$ used by SARSA algorithm , DQN Also on-policy algorithms is where the information used to improve the current policy only depeds on the policy used to gather data. Two main ideas of SARSA:

Learning the Q-function known as Temporal Difference (TD) learning
Generating the actions using Q-function

Q vs V function

Q function $Q^{\pi}(s,a)$

< - Add - >

Measures the value of state-action pairs (s, a) under a particular policy $\pi$ . Stores one-step lookahead value for every action a in every state s. This is the expected cumulative discounted reward from taking action a in state s, and then continuing to action a in the state s. This value given by Q function could be used as quantitive value for each action. For example, in chess this can be used to decide on the best move (action) to make in a paticular position (state). Advantage of this is that no need lookahead one-step. However the disadvantage is that we need data to cover all (s,a) pairs. This means that more data is needed to learn good Q-function estimate.

V function $V^{\pi}(s)$

< - Add - >

Measures the value of the state s under a particular policy $\pi$ . For example, in chess this can be used to quantify the intuition of how good or bad the board position is. For each of the next states $s'$ , that can be reached by choosing legal action, $a$ , we will calculate $v^{\pi}(s')$ . We will then use $v^{\pi}(s')$ to select the action that will leading to the best $s'$ . The problem with this method is that it is time-consuming and relies on knowing the transition function. If we were to use $v^{\pi}(s)$ to choose actions, agents need to take each of the actions $a \in A_{s}$ in state $s$ . And observe the next state, $s'$ , and reward, $r$ , to calculate $v^{\pi}(s')$ If the action space is stochastic the agent need to repeat this process many times to get a good estimation of the expected value of taking paticular aciton. However an advantage of using V function is that for the same amount of data, the V function estimate will likely be better than the Q function estimate as it doesn’t need much data to learn.

Evalutation Actions: Temporal Difference Learning

Temporal difference (TD) learning is an iterative method that value besed RL algorithms learn $v^{\pi}(s)$ or $Q^{\pi}(s,a)$

< - Add - >

Randomly initialise a neural network with parameters $\theta$ to represent the Q-function $Q^{\pi}(s, a:\theta)$
Repeat:
1. Use $Q^{\pi}(s, a:\theta)$ to act greedily in the environment.
2. Store all $Q^{\pi}(s,a,r,s^{'})$ experiences.
3. Use the stored experiences to update $Q^{\pi}(s, a : \theta)$ using the SARSA Bellman equation.
Until convergence: agent stop improving i.e. total undiscounted cumulative rewards received during an episode stops changing.

Introduction

Q vs V function

Q function Qπ(s,a)Q^{\pi}(s,a)Qπ(s,a)

V function Vπ(s)V^{\pi}(s)Vπ(s)

Evalutation Actions: Temporal Difference Learning

SARSA Algorithm

SARSA Example

Q function $Q^{\pi}(s,a)$

V function $V^{\pi}(s)$