Reinforcement Learning

If we are trying to control a robot, or an autonomus helicopter, supervised learning doesn’t work. Reinforcement learning is the way to do it.

We have to tell it what to do it rather than how to do it and we use the reward function to display our response. like training a puppy.

Rover Over Mars

  • If we assign a reward value to each box so how to we get the rover to get to state 1
  • The terminal state are 1 and 6 that’s where the road ends for now
  • It could go left, left…. or it can go right, left, left, right… in any combination
  • s = state
  • a = action (left or right)
  • R = reward
  • s is the state associated with the reward given

Return

  • So we have to evaluate the reward for our steps
  • Discount factor if we set to 0.9 is multiplied to each step
  • So the discount decreases the true value of the reward as we keep moving

  • Let’s say we will always go to the left and if we start in each of the states the return is calculated and placed in the boxes below
  • If we reverse the movement the rewards are shown again

  • If we mix our movement to both L and R we get the following rewards for each state/box
  • So if we start in box 5 and go to the right we end up with 20

Policy

  • We can always go to the nearest reward
  • Go to larger reward
  • Always go left….
  • We can have a policy that takes a state s, maps a policy \(\pi\) to it and we get the reward.

MDP


Markov Decision Process - The future only depends on the current state not how you got here.

State Action Value


  • Take action a and follow the optimal policy after that
  • So if we have the good policy to be to go left from state 4 and go right from state 5 (if we think that’s the optimal policy/solution)

  • So if we clean up the image above we get to see what the best move would be in each state as shown in each box with the left and right numbers
  • For example if you look at state 4, going left gives us a return of 12.5
  • and going right gives us 10
  • So if we are at state 4 the optimal move would be left
  • So when you have Q() for each state, then when you arrive to every state you will have one optimal move

Bellman Equation


How do we compute the Q(s,a) values? We use the Bellman Equation.

  • Remember R(s) is the reward in that state
  • Let’s set factor \(\gamma\) = 0.5

Example

  • Note in the terminal state we get R(s) = Immediate reward
  • The second part of the equation is the return for behaving optimally starting from state s

import numpy as np
from utils import *

num_states = 6
num_actions = 2

terminal_left_reward = 100
terminal_right_reward = 40
each_step_reward = 0

# Discount factor
gamma = 0.5

# Probability of going in the wrong direction
misstep_prob = 0

generate_visualization(terminal_left_reward, terminal_right_reward, each_step_reward, gamma, misstep_prob)