Published
- 3 min read
SARSA Algorithm
Introduction
SARSA is a value based, on-policy algorithm. Recall the difference between policy-based and value-based algorithms. Policy-based algorithms built a representation of a policy . Value-based methods evaluate state-action pairs (s, a) by learning one of the following value functions:
- used by Actor-Critic algorithm
- used by SARSA algorithm , DQN Also on-policy algorithms is where the information used to improve the current policy only depeds on the policy used to gather data. Two main ideas of SARSA:
- Learning the Q-function known as Temporal Difference (TD) learning
- Generating the actions using Q-function
Q vs V function
Q function
< - Add - >
Measures the value of state-action pairs (s, a) under a particular policy . Stores one-step lookahead value for every action a in every state s. This is the expected cumulative discounted reward from taking action a in state s, and then continuing to action a in the state s. This value given by Q function could be used as quantitive value for each action. For example, in chess this can be used to decide on the best move (action) to make in a paticular position (state). Advantage of this is that no need lookahead one-step. However the disadvantage is that we need data to cover all (s,a) pairs. This means that more data is needed to learn good Q-function estimate.
V function
< - Add - >
Measures the value of the state s under a particular policy . For example, in chess this can be used to quantify the intuition of how good or bad the board position is. For each of the next states , that can be reached by choosing legal action, , we will calculate . We will then use to select the action that will leading to the best . The problem with this method is that it is time-consuming and relies on knowing the transition function. If we were to use to choose actions, agents need to take each of the actions in state . And observe the next state, , and reward, , to calculate If the action space is stochastic the agent need to repeat this process many times to get a good estimation of the expected value of taking paticular aciton. However an advantage of using V function is that for the same amount of data, the V function estimate will likely be better than the Q function estimate as it doesn’t need much data to learn.
Evalutation Actions: Temporal Difference Learning
Temporal difference (TD) learning is an iterative method that value besed RL algorithms learn or
SARSA Algorithm
< - Add - >
- Randomly initialise a neural network with parameters to represent the Q-function
- Repeat:
- Use to act greedily in the environment.
- Store all experiences.
- Use the stored experiences to update using the SARSA Bellman equation.
- Until convergence: agent stop improving i.e. total undiscounted cumulative rewards received during an episode stops changing.