Deep $Q$-learning

In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use $Q$-learning to train an agent to play a game called Cart-Pole. In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

Cart-Pole

We can simulate this game using OpenAI Gym. First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.

In [1]:
import gym
import numpy as np

# Create the Cart-Pole game environment
env = gym.make('CartPole-v1')

# Number of possible actions
print('Number of possible actions:', env.action_space.n)
Number of possible actions: 2

[2018-01-22 23:10:02,350] Making new env: CartPole-v1

Number of possible actions: 2

We interact with the simulation through env. You can see how many actions are possible from env.action_space.n, and to get a random action you can use env.action_space.sample(). Passing in an action as an integer to env.step will generate the next step in the simulation. This is general to all Gym games.

In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to interact with the environment.

In [2]:
actions = [] # actions that the agent selects
rewards = [] # obtained rewards
state = env.reset()

while True:
    action = env.action_space.sample()  # choose a random action
    state, reward, done, _ = env.step(action) 
    rewards.append(reward)
    actions.append(action)
    if done:
        break

We can look at the actions and rewards:

In [3]:
print('Actions:', actions)
print('Rewards:', rewards)
Actions: [0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0]
Rewards: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

Actions: [0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0] Rewards: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

The game resets after the pole has fallen past a certain angle. For each step while the game is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

$Q$-Network

To keep track of the action values, we'll use a neural network that accepts a state $s$ as input. The output will be $Q$-values for each available action $a$ (i.e., the output is all action values $Q(s,a)$ corresponding to the input state $s$).

For this Cart-Pole game, the state has four values: the position and velocity of the cart, and the position and velocity of the pole. Thus, the neural network has four inputs, one for each value in the state, and two outputs, one for each possible action.

As explored in the lesson, to get the training target, we'll first use the context provided by the state $s$ to choose an action $a$, then simulate the game using that action. This will get us the next state, $s'$, and the reward $r$. With that, we can calculate $\hat{Q}(s,a) = r + \gamma \max_{a'}{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$.

Below is one implementation of the $Q$-network. It uses two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.

In [4]:
import tensorflow as tf

class QNetwork:
    def __init__(self, learning_rate=0.01, state_size=4, 
                 action_size=2, hidden_size=10, 
                 name='QNetwork'):
        # state inputs to the Q-network
        with tf.variable_scope(name):
            self.inputs_ = tf.placeholder(tf.float32, [None, state_size], name='inputs')
            
            # One hot encode the actions to later choose the Q-value for the action
            self.actions_ = tf.placeholder(tf.int32, [None], name='actions')
            one_hot_actions = tf.one_hot(self.actions_, action_size)
            
            # Target Q values for training
            self.targetQs_ = tf.placeholder(tf.float32, [None], name='target')
            
            # ReLU hidden layers
            self.fc1 = tf.contrib.layers.fully_connected(self.inputs_, hidden_size)
            self.fc2 = tf.contrib.layers.fully_connected(self.fc1, hidden_size)

            # Linear output layer
            self.output = tf.contrib.layers.fully_connected(self.fc2, action_size, 
                                                            activation_fn=None)
            
            ### Train with loss (targetQ - Q)^2
            # output has length 2, for two actions. This next line chooses
            # one value from output (per row) according to the one-hot encoded actions.
            self.Q = tf.reduce_sum(tf.multiply(self.output, one_hot_actions), axis=1)
            
            self.loss = tf.reduce_mean(tf.square(self.targetQs_ - self.Q))
            self.opt = tf.train.AdamOptimizer(learning_rate).minimize(self.loss)

Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on.

Here, we'll create a Memory object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maximum capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a Memory object. If you're unfamiliar with deque, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.

In [5]:
from collections import deque

class Memory():
    def __init__(self, max_size=1000):
        self.buffer = deque(maxlen=max_size)
    
    def add(self, experience):
        self.buffer.append(experience)
            
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx]

$Q$-Learning training algorithm

We will use the below algorithm to train the network. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

  • Initialize the memory $D$
  • Initialize the action-value network $Q$ with random weights
  • For episode $\leftarrow 1$ to $M$ do
    • Observe $s_0$
    • For $t \leftarrow 0$ to $T-1$ do
      • With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s_t,a)$
      • Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
      • Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
      • Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
      • Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
      • Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
    • endfor
  • endfor

You are welcome (and encouraged!) to take the time to extend this code to implement some of the improvements that we discussed in the lesson, to include fixed $Q$ targets, double DQNs, prioritized replay, and/or dueling networks.

Hyperparameters

One of the more difficult aspects of reinforcement learning is the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.

In [6]:
train_episodes = 1000          # max number of episodes to learn from
max_steps = 200                # max steps in an episode
gamma = 0.99                   # future reward discount

# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
hidden_size = 64               # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
memory_size = 10000            # memory capacity
batch_size = 20                # experience mini-batch size
pretrain_length = batch_size   # number experiences to pretrain the memory
In [7]:
tf.reset_default_graph()
mainQN = QNetwork(name='main', hidden_size=hidden_size, learning_rate=learning_rate)

Populate the experience memory

Here we re-initialize the simulation and pre-populate the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.

In [8]:
# Initialize the simulation
env.reset()
# Take one random step to get the pole and cart moving
state, reward, done, _ = env.step(env.action_space.sample())

memory = Memory(max_size=memory_size)

# Make a bunch of random actions and store the experiences
for ii in range(pretrain_length):

    # Make a random action
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)

    if done:
        # The simulation fails so no next state
        next_state = np.zeros(state.shape)
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        
        # Start new episode
        env.reset()
        # Take one random step to get the pole and cart moving
        state, reward, done, _ = env.step(env.action_space.sample())
    else:
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        state = next_state

Training

Below we'll train our agent.

In [9]:
# Now train with experiences
saver = tf.train.Saver()
rewards_list = []
with tf.Session() as sess:
    # Initialize variables
    sess.run(tf.global_variables_initializer())
    
    step = 0
    for ep in range(1, train_episodes):
        total_reward = 0
        t = 0
        while t < max_steps:
            step += 1
            # Uncomment this next line to watch the training
            # env.render() 
            
            # Explore or Exploit
            explore_p = explore_stop + (explore_start - explore_stop)*np.exp(-decay_rate*step) 
            if explore_p > np.random.rand():
                # Make a random action
                action = env.action_space.sample()
            else:
                # Get action from Q-network
                feed = {mainQN.inputs_: state.reshape((1, *state.shape))}
                Qs = sess.run(mainQN.output, feed_dict=feed)
                action = np.argmax(Qs)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
    
            total_reward += reward
            
            if done:
                # the episode ends so no next state
                next_state = np.zeros(state.shape)
                t = max_steps
                
                print('Episode: {}'.format(ep),
                      'Total reward: {}'.format(total_reward),
                      'Training loss: {:.4f}'.format(loss),
                      'Explore P: {:.4f}'.format(explore_p))
                rewards_list.append((ep, total_reward))
                
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                
                # Start new episode
                env.reset()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                state = next_state
                t += 1
            
            # Sample mini-batch from memory
            batch = memory.sample(batch_size)
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            rewards = np.array([each[2] for each in batch])
            next_states = np.array([each[3] for each in batch])
            
            # Train network
            target_Qs = sess.run(mainQN.output, feed_dict={mainQN.inputs_: next_states})
            
            # Set target_Qs to 0 for states where episode ends
            episode_ends = (next_states == np.zeros(states[0].shape)).all(axis=1)
            target_Qs[episode_ends] = (0, 0)
            
            targets = rewards + gamma * np.max(target_Qs, axis=1)

            loss, _ = sess.run([mainQN.loss, mainQN.opt],
                                feed_dict={mainQN.inputs_: states,
                                           mainQN.targetQs_: targets,
                                           mainQN.actions_: actions})
        
    saver.save(sess, "checkpoints/cartpole.ckpt")
Episode: 1 Total reward: 6.0 Training loss: 1.1387 Explore P: 0.9994
Episode: 2 Total reward: 11.0 Training loss: 1.2013 Explore P: 0.9983
Episode: 3 Total reward: 19.0 Training loss: 1.2373 Explore P: 0.9964
Episode: 4 Total reward: 15.0 Training loss: 1.2343 Explore P: 0.9950
Episode: 5 Total reward: 25.0 Training loss: 1.2207 Explore P: 0.9925
Episode: 6 Total reward: 12.0 Training loss: 1.3273 Explore P: 0.9913
Episode: 7 Total reward: 15.0 Training loss: 1.2084 Explore P: 0.9899
Episode: 8 Total reward: 15.0 Training loss: 1.2850 Explore P: 0.9884
Episode: 9 Total reward: 27.0 Training loss: 1.2649 Explore P: 0.9857
Episode: 10 Total reward: 14.0 Training loss: 1.1658 Explore P: 0.9844
Episode: 11 Total reward: 15.0 Training loss: 1.2982 Explore P: 0.9829
Episode: 12 Total reward: 49.0 Training loss: 1.2688 Explore P: 0.9782
Episode: 13 Total reward: 27.0 Training loss: 1.5220 Explore P: 0.9756
Episode: 14 Total reward: 14.0 Training loss: 1.4908 Explore P: 0.9742
Episode: 15 Total reward: 8.0 Training loss: 1.5186 Explore P: 0.9734
Episode: 16 Total reward: 24.0 Training loss: 1.6087 Explore P: 0.9711
Episode: 17 Total reward: 27.0 Training loss: 1.8511 Explore P: 0.9685
Episode: 18 Total reward: 19.0 Training loss: 1.9817 Explore P: 0.9667
Episode: 19 Total reward: 10.0 Training loss: 1.7264 Explore P: 0.9658
Episode: 20 Total reward: 17.0 Training loss: 2.5957 Explore P: 0.9641
Episode: 21 Total reward: 22.0 Training loss: 2.1989 Explore P: 0.9620
Episode: 22 Total reward: 27.0 Training loss: 2.3368 Explore P: 0.9595
Episode: 23 Total reward: 32.0 Training loss: 2.6216 Explore P: 0.9564
Episode: 24 Total reward: 33.0 Training loss: 3.9113 Explore P: 0.9533
Episode: 25 Total reward: 16.0 Training loss: 4.2263 Explore P: 0.9518
Episode: 26 Total reward: 14.0 Training loss: 2.0959 Explore P: 0.9505
Episode: 27 Total reward: 12.0 Training loss: 2.8947 Explore P: 0.9494
Episode: 28 Total reward: 20.0 Training loss: 4.0879 Explore P: 0.9475
Episode: 29 Total reward: 10.0 Training loss: 8.9183 Explore P: 0.9466
Episode: 30 Total reward: 67.0 Training loss: 2.8243 Explore P: 0.9403
Episode: 31 Total reward: 14.0 Training loss: 4.5533 Explore P: 0.9390
Episode: 32 Total reward: 15.0 Training loss: 2.7835 Explore P: 0.9376
Episode: 33 Total reward: 9.0 Training loss: 2.8971 Explore P: 0.9368
Episode: 34 Total reward: 16.0 Training loss: 4.5465 Explore P: 0.9353
Episode: 35 Total reward: 24.0 Training loss: 3.9791 Explore P: 0.9331
Episode: 36 Total reward: 11.0 Training loss: 3.3499 Explore P: 0.9321
Episode: 37 Total reward: 16.0 Training loss: 5.9114 Explore P: 0.9306
Episode: 38 Total reward: 16.0 Training loss: 10.6974 Explore P: 0.9291
Episode: 39 Total reward: 28.0 Training loss: 13.1287 Explore P: 0.9265
Episode: 40 Total reward: 9.0 Training loss: 11.0986 Explore P: 0.9257
Episode: 41 Total reward: 12.0 Training loss: 7.3433 Explore P: 0.9246
Episode: 42 Total reward: 19.0 Training loss: 4.8519 Explore P: 0.9229
Episode: 43 Total reward: 11.0 Training loss: 3.8401 Explore P: 0.9219
Episode: 44 Total reward: 12.0 Training loss: 5.0834 Explore P: 0.9208
Episode: 45 Total reward: 14.0 Training loss: 4.7337 Explore P: 0.9195
Episode: 46 Total reward: 14.0 Training loss: 9.3819 Explore P: 0.9182
Episode: 47 Total reward: 16.0 Training loss: 4.9428 Explore P: 0.9168
Episode: 48 Total reward: 15.0 Training loss: 15.5070 Explore P: 0.9154
Episode: 49 Total reward: 35.0 Training loss: 24.8480 Explore P: 0.9123
Episode: 50 Total reward: 17.0 Training loss: 5.4612 Explore P: 0.9107
Episode: 51 Total reward: 13.0 Training loss: 35.1359 Explore P: 0.9096
Episode: 52 Total reward: 28.0 Training loss: 12.6045 Explore P: 0.9070
Episode: 53 Total reward: 11.0 Training loss: 5.8519 Explore P: 0.9061
Episode: 54 Total reward: 38.0 Training loss: 45.2712 Explore P: 0.9027
Episode: 55 Total reward: 14.0 Training loss: 19.3384 Explore P: 0.9014
Episode: 56 Total reward: 33.0 Training loss: 4.9964 Explore P: 0.8985
Episode: 57 Total reward: 13.0 Training loss: 6.8854 Explore P: 0.8973
Episode: 58 Total reward: 17.0 Training loss: 17.4773 Explore P: 0.8958
Episode: 59 Total reward: 39.0 Training loss: 5.6547 Explore P: 0.8924
Episode: 60 Total reward: 11.0 Training loss: 6.6163 Explore P: 0.8914
Episode: 61 Total reward: 20.0 Training loss: 59.5690 Explore P: 0.8896
Episode: 62 Total reward: 10.0 Training loss: 6.6274 Explore P: 0.8888
Episode: 63 Total reward: 24.0 Training loss: 92.5373 Explore P: 0.8866
Episode: 64 Total reward: 22.0 Training loss: 6.9516 Explore P: 0.8847
Episode: 65 Total reward: 16.0 Training loss: 43.4159 Explore P: 0.8833
Episode: 66 Total reward: 15.0 Training loss: 27.0670 Explore P: 0.8820
Episode: 67 Total reward: 16.0 Training loss: 38.7565 Explore P: 0.8806
Episode: 68 Total reward: 15.0 Training loss: 48.5991 Explore P: 0.8793
Episode: 69 Total reward: 19.0 Training loss: 27.3788 Explore P: 0.8777
Episode: 70 Total reward: 27.0 Training loss: 27.6228 Explore P: 0.8753
Episode: 71 Total reward: 27.0 Training loss: 8.1982 Explore P: 0.8730
Episode: 72 Total reward: 16.0 Training loss: 148.6657 Explore P: 0.8716
Episode: 73 Total reward: 12.0 Training loss: 9.4558 Explore P: 0.8706
Episode: 74 Total reward: 27.0 Training loss: 91.3385 Explore P: 0.8683
Episode: 75 Total reward: 26.0 Training loss: 53.7565 Explore P: 0.8660
Episode: 76 Total reward: 14.0 Training loss: 10.4252 Explore P: 0.8648
Episode: 77 Total reward: 27.0 Training loss: 23.1290 Explore P: 0.8625
Episode: 78 Total reward: 14.0 Training loss: 54.4627 Explore P: 0.8613
Episode: 79 Total reward: 20.0 Training loss: 44.6119 Explore P: 0.8596
Episode: 80 Total reward: 11.0 Training loss: 9.3377 Explore P: 0.8587
Episode: 81 Total reward: 14.0 Training loss: 12.0756 Explore P: 0.8575
Episode: 82 Total reward: 14.0 Training loss: 66.1741 Explore P: 0.8563
Episode: 83 Total reward: 14.0 Training loss: 101.3480 Explore P: 0.8551
Episode: 84 Total reward: 22.0 Training loss: 40.9549 Explore P: 0.8533
Episode: 85 Total reward: 26.0 Training loss: 11.6479 Explore P: 0.8511
Episode: 86 Total reward: 8.0 Training loss: 29.7001 Explore P: 0.8504
Episode: 87 Total reward: 7.0 Training loss: 14.6046 Explore P: 0.8498
Episode: 88 Total reward: 22.0 Training loss: 68.4965 Explore P: 0.8480
Episode: 89 Total reward: 24.0 Training loss: 9.7683 Explore P: 0.8460
Episode: 90 Total reward: 17.0 Training loss: 14.4485 Explore P: 0.8446
Episode: 91 Total reward: 14.0 Training loss: 79.8466 Explore P: 0.8434
Episode: 92 Total reward: 10.0 Training loss: 158.3719 Explore P: 0.8426
Episode: 93 Total reward: 11.0 Training loss: 13.6722 Explore P: 0.8416
Episode: 94 Total reward: 57.0 Training loss: 48.9059 Explore P: 0.8369
Episode: 95 Total reward: 13.0 Training loss: 103.8669 Explore P: 0.8358
Episode: 96 Total reward: 9.0 Training loss: 15.9803 Explore P: 0.8351
Episode: 97 Total reward: 41.0 Training loss: 303.9713 Explore P: 0.8317
Episode: 98 Total reward: 20.0 Training loss: 60.2845 Explore P: 0.8301
Episode: 99 Total reward: 24.0 Training loss: 11.5217 Explore P: 0.8281
Episode: 100 Total reward: 8.0 Training loss: 114.5364 Explore P: 0.8275
Episode: 101 Total reward: 17.0 Training loss: 39.8671 Explore P: 0.8261
Episode: 102 Total reward: 32.0 Training loss: 72.7662 Explore P: 0.8235
Episode: 103 Total reward: 24.0 Training loss: 50.9641 Explore P: 0.8215
Episode: 104 Total reward: 34.0 Training loss: 123.0512 Explore P: 0.8188
Episode: 105 Total reward: 9.0 Training loss: 13.5239 Explore P: 0.8180
Episode: 106 Total reward: 11.0 Training loss: 13.0276 Explore P: 0.8171
Episode: 107 Total reward: 16.0 Training loss: 153.6839 Explore P: 0.8159
Episode: 108 Total reward: 18.0 Training loss: 159.0586 Explore P: 0.8144
Episode: 109 Total reward: 14.0 Training loss: 14.3653 Explore P: 0.8133
Episode: 110 Total reward: 16.0 Training loss: 68.9230 Explore P: 0.8120
Episode: 111 Total reward: 12.0 Training loss: 13.6382 Explore P: 0.8110
Episode: 112 Total reward: 18.0 Training loss: 246.2749 Explore P: 0.8096
Episode: 113 Total reward: 15.0 Training loss: 358.5706 Explore P: 0.8084
Episode: 114 Total reward: 18.0 Training loss: 62.2344 Explore P: 0.8070
Episode: 115 Total reward: 22.0 Training loss: 318.1645 Explore P: 0.8052
Episode: 116 Total reward: 16.0 Training loss: 11.2065 Explore P: 0.8039
Episode: 117 Total reward: 24.0 Training loss: 12.3606 Explore P: 0.8020
Episode: 118 Total reward: 17.0 Training loss: 173.6949 Explore P: 0.8007
Episode: 119 Total reward: 14.0 Training loss: 13.8844 Explore P: 0.7996
Episode: 120 Total reward: 31.0 Training loss: 57.0642 Explore P: 0.7971
Episode: 121 Total reward: 24.0 Training loss: 74.7098 Explore P: 0.7953
Episode: 122 Total reward: 20.0 Training loss: 77.0540 Explore P: 0.7937
Episode: 123 Total reward: 17.0 Training loss: 115.4260 Explore P: 0.7924
Episode: 124 Total reward: 7.0 Training loss: 9.0179 Explore P: 0.7918
Episode: 125 Total reward: 19.0 Training loss: 90.5463 Explore P: 0.7903
Episode: 126 Total reward: 8.0 Training loss: 176.0065 Explore P: 0.7897
Episode: 127 Total reward: 17.0 Training loss: 72.9808 Explore P: 0.7884
Episode: 128 Total reward: 13.0 Training loss: 97.5244 Explore P: 0.7874
Episode: 129 Total reward: 17.0 Training loss: 68.5440 Explore P: 0.7860
Episode: 130 Total reward: 16.0 Training loss: 166.4149 Explore P: 0.7848
Episode: 131 Total reward: 17.0 Training loss: 8.0810 Explore P: 0.7835
Episode: 132 Total reward: 12.0 Training loss: 202.1730 Explore P: 0.7826
Episode: 133 Total reward: 16.0 Training loss: 98.1482 Explore P: 0.7813
Episode: 134 Total reward: 11.0 Training loss: 124.1973 Explore P: 0.7805
Episode: 135 Total reward: 23.0 Training loss: 93.7302 Explore P: 0.7787
Episode: 136 Total reward: 22.0 Training loss: 6.7072 Explore P: 0.7770
Episode: 137 Total reward: 14.0 Training loss: 5.6885 Explore P: 0.7759
Episode: 138 Total reward: 12.0 Training loss: 4.9377 Explore P: 0.7750
Episode: 139 Total reward: 10.0 Training loss: 6.2922 Explore P: 0.7743
Episode: 140 Total reward: 16.0 Training loss: 139.9118 Explore P: 0.7730
Episode: 141 Total reward: 13.0 Training loss: 4.8236 Explore P: 0.7720
Episode: 142 Total reward: 18.0 Training loss: 6.1760 Explore P: 0.7707
Episode: 143 Total reward: 19.0 Training loss: 4.0596 Explore P: 0.7692
Episode: 144 Total reward: 10.0 Training loss: 92.4380 Explore P: 0.7685
Episode: 145 Total reward: 37.0 Training loss: 281.7591 Explore P: 0.7657
Episode: 146 Total reward: 27.0 Training loss: 3.5860 Explore P: 0.7636
Episode: 147 Total reward: 15.0 Training loss: 105.0311 Explore P: 0.7625
Episode: 148 Total reward: 17.0 Training loss: 4.3904 Explore P: 0.7612
Episode: 149 Total reward: 9.0 Training loss: 296.8312 Explore P: 0.7605
Episode: 150 Total reward: 19.0 Training loss: 152.9619 Explore P: 0.7591
Episode: 151 Total reward: 11.0 Training loss: 89.5227 Explore P: 0.7583
Episode: 152 Total reward: 12.0 Training loss: 2.8611 Explore P: 0.7574
Episode: 153 Total reward: 9.0 Training loss: 260.1629 Explore P: 0.7567
Episode: 154 Total reward: 38.0 Training loss: 76.3871 Explore P: 0.7539
Episode: 155 Total reward: 9.0 Training loss: 76.4022 Explore P: 0.7532
Episode: 156 Total reward: 12.0 Training loss: 83.1049 Explore P: 0.7523
Episode: 157 Total reward: 19.0 Training loss: 83.7814 Explore P: 0.7509
Episode: 158 Total reward: 13.0 Training loss: 72.9796 Explore P: 0.7500
Episode: 159 Total reward: 12.0 Training loss: 85.7109 Explore P: 0.7491
Episode: 160 Total reward: 27.0 Training loss: 2.4435 Explore P: 0.7471
Episode: 161 Total reward: 8.0 Training loss: 3.0761 Explore P: 0.7465
Episode: 162 Total reward: 12.0 Training loss: 2.3322 Explore P: 0.7456
Episode: 163 Total reward: 53.0 Training loss: 59.9252 Explore P: 0.7417
Episode: 164 Total reward: 32.0 Training loss: 112.4796 Explore P: 0.7394
Episode: 165 Total reward: 21.0 Training loss: 1.8170 Explore P: 0.7379
Episode: 166 Total reward: 14.0 Training loss: 68.7368 Explore P: 0.7368
Episode: 167 Total reward: 27.0 Training loss: 1.7118 Explore P: 0.7349
Episode: 168 Total reward: 40.0 Training loss: 72.7503 Explore P: 0.7320
Episode: 169 Total reward: 15.0 Training loss: 108.1811 Explore P: 0.7309
Episode: 170 Total reward: 10.0 Training loss: 61.5947 Explore P: 0.7302
Episode: 171 Total reward: 13.0 Training loss: 112.5797 Explore P: 0.7292
Episode: 172 Total reward: 18.0 Training loss: 106.6816 Explore P: 0.7280
Episode: 173 Total reward: 21.0 Training loss: 56.3213 Explore P: 0.7264
Episode: 174 Total reward: 23.0 Training loss: 47.5368 Explore P: 0.7248
Episode: 175 Total reward: 11.0 Training loss: 1.1486 Explore P: 0.7240
Episode: 176 Total reward: 15.0 Training loss: 47.4509 Explore P: 0.7229
Episode: 177 Total reward: 9.0 Training loss: 0.8484 Explore P: 0.7223
Episode: 178 Total reward: 21.0 Training loss: 1.0979 Explore P: 0.7208
Episode: 179 Total reward: 10.0 Training loss: 1.2675 Explore P: 0.7201
Episode: 180 Total reward: 20.0 Training loss: 155.9581 Explore P: 0.7187
Episode: 181 Total reward: 9.0 Training loss: 104.7243 Explore P: 0.7180
Episode: 182 Total reward: 11.0 Training loss: 0.9699 Explore P: 0.7173
Episode: 183 Total reward: 12.0 Training loss: 40.7418 Explore P: 0.7164
Episode: 184 Total reward: 9.0 Training loss: 53.7699 Explore P: 0.7158
Episode: 185 Total reward: 8.0 Training loss: 45.5634 Explore P: 0.7152
Episode: 186 Total reward: 19.0 Training loss: 0.7347 Explore P: 0.7139
Episode: 187 Total reward: 14.0 Training loss: 1.0818 Explore P: 0.7129
Episode: 188 Total reward: 13.0 Training loss: 79.5866 Explore P: 0.7120
Episode: 189 Total reward: 14.0 Training loss: 1.2672 Explore P: 0.7110
Episode: 190 Total reward: 14.0 Training loss: 173.9336 Explore P: 0.7100
Episode: 191 Total reward: 16.0 Training loss: 36.5148 Explore P: 0.7089
Episode: 192 Total reward: 19.0 Training loss: 34.7869 Explore P: 0.7076
Episode: 193 Total reward: 13.0 Training loss: 34.4246 Explore P: 0.7067
Episode: 194 Total reward: 11.0 Training loss: 97.9411 Explore P: 0.7059
Episode: 195 Total reward: 24.0 Training loss: 1.2390 Explore P: 0.7042
Episode: 196 Total reward: 13.0 Training loss: 96.4879 Explore P: 0.7033
Episode: 197 Total reward: 14.0 Training loss: 1.3219 Explore P: 0.7024
Episode: 198 Total reward: 18.0 Training loss: 1.1630 Explore P: 0.7011
Episode: 199 Total reward: 19.0 Training loss: 31.4953 Explore P: 0.6998
Episode: 200 Total reward: 13.0 Training loss: 65.7182 Explore P: 0.6989
Episode: 201 Total reward: 13.0 Training loss: 29.7972 Explore P: 0.6980
Episode: 202 Total reward: 12.0 Training loss: 60.6254 Explore P: 0.6972
Episode: 203 Total reward: 10.0 Training loss: 56.3095 Explore P: 0.6965
Episode: 204 Total reward: 17.0 Training loss: 87.0896 Explore P: 0.6953
Episode: 205 Total reward: 9.0 Training loss: 31.1107 Explore P: 0.6947
Episode: 206 Total reward: 45.0 Training loss: 81.9446 Explore P: 0.6916
Episode: 207 Total reward: 16.0 Training loss: 134.4349 Explore P: 0.6906
Episode: 208 Total reward: 9.0 Training loss: 93.7610 Explore P: 0.6899
Episode: 209 Total reward: 9.0 Training loss: 97.8263 Explore P: 0.6893
Episode: 210 Total reward: 16.0 Training loss: 1.2216 Explore P: 0.6882
Episode: 211 Total reward: 38.0 Training loss: 27.7472 Explore P: 0.6857
Episode: 212 Total reward: 10.0 Training loss: 55.8109 Explore P: 0.6850
Episode: 213 Total reward: 21.0 Training loss: 1.4390 Explore P: 0.6836
Episode: 214 Total reward: 18.0 Training loss: 48.7403 Explore P: 0.6824
Episode: 215 Total reward: 20.0 Training loss: 58.0110 Explore P: 0.6810
Episode: 216 Total reward: 23.0 Training loss: 61.1624 Explore P: 0.6795
Episode: 217 Total reward: 17.0 Training loss: 63.1253 Explore P: 0.6783
Episode: 218 Total reward: 12.0 Training loss: 63.2139 Explore P: 0.6775
Episode: 219 Total reward: 14.0 Training loss: 1.7457 Explore P: 0.6766
Episode: 220 Total reward: 10.0 Training loss: 74.2714 Explore P: 0.6759
Episode: 221 Total reward: 11.0 Training loss: 24.6472 Explore P: 0.6752
Episode: 222 Total reward: 20.0 Training loss: 1.5005 Explore P: 0.6739
Episode: 223 Total reward: 8.0 Training loss: 212.6887 Explore P: 0.6734
Episode: 224 Total reward: 21.0 Training loss: 49.0527 Explore P: 0.6720
Episode: 225 Total reward: 17.0 Training loss: 107.9951 Explore P: 0.6708
Episode: 226 Total reward: 9.0 Training loss: 44.6592 Explore P: 0.6702
Episode: 227 Total reward: 17.0 Training loss: 75.2804 Explore P: 0.6691
Episode: 228 Total reward: 10.0 Training loss: 1.5486 Explore P: 0.6685
Episode: 229 Total reward: 23.0 Training loss: 77.4791 Explore P: 0.6669
Episode: 230 Total reward: 37.0 Training loss: 1.4560 Explore P: 0.6645
Episode: 231 Total reward: 12.0 Training loss: 24.0227 Explore P: 0.6637
Episode: 232 Total reward: 10.0 Training loss: 64.6423 Explore P: 0.6631
Episode: 233 Total reward: 18.0 Training loss: 99.3664 Explore P: 0.6619
Episode: 234 Total reward: 20.0 Training loss: 0.8861 Explore P: 0.6606
Episode: 235 Total reward: 12.0 Training loss: 84.8819 Explore P: 0.6598
Episode: 236 Total reward: 13.0 Training loss: 1.5716 Explore P: 0.6590
Episode: 237 Total reward: 16.0 Training loss: 114.6781 Explore P: 0.6579
Episode: 238 Total reward: 20.0 Training loss: 47.8078 Explore P: 0.6566
Episode: 239 Total reward: 9.0 Training loss: 22.5753 Explore P: 0.6561
Episode: 240 Total reward: 14.0 Training loss: 20.1552 Explore P: 0.6552
Episode: 241 Total reward: 23.0 Training loss: 160.0203 Explore P: 0.6537
Episode: 242 Total reward: 11.0 Training loss: 40.8776 Explore P: 0.6530
Episode: 243 Total reward: 12.0 Training loss: 1.4787 Explore P: 0.6522
Episode: 244 Total reward: 12.0 Training loss: 1.1535 Explore P: 0.6514
Episode: 245 Total reward: 15.0 Training loss: 1.2797 Explore P: 0.6505
Episode: 246 Total reward: 21.0 Training loss: 20.8654 Explore P: 0.6491
Episode: 247 Total reward: 13.0 Training loss: 1.4394 Explore P: 0.6483
Episode: 248 Total reward: 10.0 Training loss: 64.5682 Explore P: 0.6477
Episode: 249 Total reward: 15.0 Training loss: 1.5086 Explore P: 0.6467
Episode: 250 Total reward: 21.0 Training loss: 22.9507 Explore P: 0.6454
Episode: 251 Total reward: 17.0 Training loss: 37.1998 Explore P: 0.6443
Episode: 252 Total reward: 10.0 Training loss: 43.1350 Explore P: 0.6437
Episode: 253 Total reward: 11.0 Training loss: 57.2139 Explore P: 0.6430
Episode: 254 Total reward: 17.0 Training loss: 1.3840 Explore P: 0.6419
Episode: 255 Total reward: 22.0 Training loss: 21.3623 Explore P: 0.6405
Episode: 256 Total reward: 16.0 Training loss: 1.2095 Explore P: 0.6395
Episode: 257 Total reward: 16.0 Training loss: 87.6040 Explore P: 0.6385
Episode: 258 Total reward: 9.0 Training loss: 0.9032 Explore P: 0.6379
Episode: 259 Total reward: 12.0 Training loss: 70.8250 Explore P: 0.6372
Episode: 260 Total reward: 27.0 Training loss: 33.9968 Explore P: 0.6355
Episode: 261 Total reward: 19.0 Training loss: 1.3007 Explore P: 0.6343
Episode: 262 Total reward: 12.0 Training loss: 69.3933 Explore P: 0.6335
Episode: 263 Total reward: 13.0 Training loss: 22.9803 Explore P: 0.6327
Episode: 264 Total reward: 14.0 Training loss: 1.1484 Explore P: 0.6319
Episode: 265 Total reward: 16.0 Training loss: 30.5876 Explore P: 0.6309
Episode: 266 Total reward: 17.0 Training loss: 19.6197 Explore P: 0.6298
Episode: 267 Total reward: 7.0 Training loss: 36.4002 Explore P: 0.6294
Episode: 268 Total reward: 32.0 Training loss: 0.8536 Explore P: 0.6274
Episode: 269 Total reward: 31.0 Training loss: 62.2378 Explore P: 0.6255
Episode: 270 Total reward: 7.0 Training loss: 0.9168 Explore P: 0.6251
Episode: 271 Total reward: 15.0 Training loss: 28.9445 Explore P: 0.6241
Episode: 272 Total reward: 17.0 Training loss: 27.3975 Explore P: 0.6231
Episode: 273 Total reward: 12.0 Training loss: 32.5440 Explore P: 0.6224
Episode: 274 Total reward: 20.0 Training loss: 1.2314 Explore P: 0.6211
Episode: 275 Total reward: 30.0 Training loss: 26.6066 Explore P: 0.6193
Episode: 276 Total reward: 16.0 Training loss: 17.9376 Explore P: 0.6183
Episode: 277 Total reward: 24.0 Training loss: 53.8057 Explore P: 0.6169
Episode: 278 Total reward: 28.0 Training loss: 15.8050 Explore P: 0.6152
Episode: 279 Total reward: 36.0 Training loss: 24.7406 Explore P: 0.6130
Episode: 280 Total reward: 9.0 Training loss: 30.5545 Explore P: 0.6125
Episode: 281 Total reward: 19.0 Training loss: 22.8018 Explore P: 0.6113
Episode: 282 Total reward: 21.0 Training loss: 40.6639 Explore P: 0.6100
Episode: 283 Total reward: 22.0 Training loss: 15.2307 Explore P: 0.6087
Episode: 284 Total reward: 60.0 Training loss: 0.6390 Explore P: 0.6051
Episode: 285 Total reward: 55.0 Training loss: 0.6185 Explore P: 0.6019
Episode: 286 Total reward: 74.0 Training loss: 0.8345 Explore P: 0.5975
Episode: 287 Total reward: 30.0 Training loss: 18.9515 Explore P: 0.5958
Episode: 288 Total reward: 20.0 Training loss: 51.8657 Explore P: 0.5946
Episode: 289 Total reward: 29.0 Training loss: 0.7666 Explore P: 0.5929
Episode: 290 Total reward: 20.0 Training loss: 0.8019 Explore P: 0.5917
Episode: 291 Total reward: 47.0 Training loss: 17.1135 Explore P: 0.5890
Episode: 292 Total reward: 21.0 Training loss: 0.8164 Explore P: 0.5878
Episode: 293 Total reward: 34.0 Training loss: 13.7301 Explore P: 0.5858
Episode: 294 Total reward: 43.0 Training loss: 43.7469 Explore P: 0.5834
Episode: 295 Total reward: 45.0 Training loss: 17.0710 Explore P: 0.5808
Episode: 296 Total reward: 21.0 Training loss: 48.7797 Explore P: 0.5796
Episode: 297 Total reward: 22.0 Training loss: 12.2816 Explore P: 0.5783
Episode: 298 Total reward: 55.0 Training loss: 0.9589 Explore P: 0.5752
Episode: 299 Total reward: 39.0 Training loss: 28.3409 Explore P: 0.5730
Episode: 300 Total reward: 21.0 Training loss: 13.8164 Explore P: 0.5718
Episode: 301 Total reward: 14.0 Training loss: 0.9789 Explore P: 0.5710
Episode: 302 Total reward: 23.0 Training loss: 12.4849 Explore P: 0.5698
Episode: 303 Total reward: 19.0 Training loss: 24.8881 Explore P: 0.5687
Episode: 304 Total reward: 42.0 Training loss: 12.7454 Explore P: 0.5664
Episode: 305 Total reward: 40.0 Training loss: 1.0806 Explore P: 0.5641
Episode: 306 Total reward: 28.0 Training loss: 37.3317 Explore P: 0.5626
Episode: 307 Total reward: 48.0 Training loss: 13.4385 Explore P: 0.5599
Episode: 308 Total reward: 37.0 Training loss: 12.2602 Explore P: 0.5579
Episode: 309 Total reward: 26.0 Training loss: 14.5040 Explore P: 0.5565
Episode: 310 Total reward: 28.0 Training loss: 1.0628 Explore P: 0.5550
Episode: 311 Total reward: 13.0 Training loss: 26.3329 Explore P: 0.5542
Episode: 312 Total reward: 45.0 Training loss: 21.1833 Explore P: 0.5518
Episode: 313 Total reward: 42.0 Training loss: 0.8648 Explore P: 0.5495
Episode: 314 Total reward: 15.0 Training loss: 11.8648 Explore P: 0.5487
Episode: 315 Total reward: 61.0 Training loss: 24.5170 Explore P: 0.5454
Episode: 316 Total reward: 37.0 Training loss: 11.9717 Explore P: 0.5435
Episode: 317 Total reward: 46.0 Training loss: 10.9650 Explore P: 0.5410
Episode: 318 Total reward: 45.0 Training loss: 22.0438 Explore P: 0.5386
Episode: 319 Total reward: 11.0 Training loss: 9.8664 Explore P: 0.5381
Episode: 320 Total reward: 24.0 Training loss: 0.7313 Explore P: 0.5368
Episode: 321 Total reward: 12.0 Training loss: 0.6706 Explore P: 0.5362
Episode: 322 Total reward: 23.0 Training loss: 1.1406 Explore P: 0.5350
Episode: 323 Total reward: 14.0 Training loss: 22.1524 Explore P: 0.5342
Episode: 324 Total reward: 23.0 Training loss: 23.1386 Explore P: 0.5330
Episode: 325 Total reward: 90.0 Training loss: 1.1859 Explore P: 0.5283
Episode: 326 Total reward: 41.0 Training loss: 13.5423 Explore P: 0.5262
Episode: 327 Total reward: 55.0 Training loss: 10.0944 Explore P: 0.5234
Episode: 328 Total reward: 24.0 Training loss: 43.2777 Explore P: 0.5221
Episode: 329 Total reward: 105.0 Training loss: 1.0654 Explore P: 0.5168
Episode: 330 Total reward: 23.0 Training loss: 12.5466 Explore P: 0.5156
Episode: 331 Total reward: 25.0 Training loss: 17.5919 Explore P: 0.5144
Episode: 332 Total reward: 51.0 Training loss: 10.5448 Explore P: 0.5118
Episode: 333 Total reward: 17.0 Training loss: 36.8923 Explore P: 0.5109
Episode: 334 Total reward: 60.0 Training loss: 35.1556 Explore P: 0.5080
Episode: 335 Total reward: 59.0 Training loss: 10.2744 Explore P: 0.5050
Episode: 336 Total reward: 22.0 Training loss: 9.2883 Explore P: 0.5039
Episode: 337 Total reward: 67.0 Training loss: 1.5024 Explore P: 0.5006
Episode: 338 Total reward: 68.0 Training loss: 0.7419 Explore P: 0.4973
Episode: 339 Total reward: 35.0 Training loss: 45.8844 Explore P: 0.4956
Episode: 340 Total reward: 15.0 Training loss: 1.1310 Explore P: 0.4949
Episode: 341 Total reward: 40.0 Training loss: 1.3678 Explore P: 0.4929
Episode: 342 Total reward: 39.0 Training loss: 19.6442 Explore P: 0.4911
Episode: 343 Total reward: 78.0 Training loss: 1.0643 Explore P: 0.4873
Episode: 344 Total reward: 28.0 Training loss: 26.2047 Explore P: 0.4860
Episode: 345 Total reward: 54.0 Training loss: 26.6397 Explore P: 0.4834
Episode: 346 Total reward: 47.0 Training loss: 11.6520 Explore P: 0.4812
Episode: 347 Total reward: 58.0 Training loss: 16.9289 Explore P: 0.4785
Episode: 348 Total reward: 41.0 Training loss: 1.3338 Explore P: 0.4766
Episode: 349 Total reward: 38.0 Training loss: 1.3062 Explore P: 0.4748
Episode: 350 Total reward: 31.0 Training loss: 39.7934 Explore P: 0.4734
Episode: 351 Total reward: 60.0 Training loss: 1.1360 Explore P: 0.4706
Episode: 352 Total reward: 50.0 Training loss: 12.1699 Explore P: 0.4683
Episode: 353 Total reward: 29.0 Training loss: 34.2566 Explore P: 0.4670
Episode: 354 Total reward: 25.0 Training loss: 31.3179 Explore P: 0.4658
Episode: 355 Total reward: 107.0 Training loss: 1.2605 Explore P: 0.4610
Episode: 356 Total reward: 71.0 Training loss: 11.1077 Explore P: 0.4578
Episode: 357 Total reward: 52.0 Training loss: 1.4389 Explore P: 0.4555
Episode: 358 Total reward: 132.0 Training loss: 49.9257 Explore P: 0.4496
Episode: 359 Total reward: 183.0 Training loss: 1.2168 Explore P: 0.4416
Episode: 360 Total reward: 58.0 Training loss: 1.3403 Explore P: 0.4391
Episode: 361 Total reward: 34.0 Training loss: 1.3896 Explore P: 0.4377
Episode: 362 Total reward: 47.0 Training loss: 86.0726 Explore P: 0.4357
Episode: 363 Total reward: 47.0 Training loss: 22.6297 Explore P: 0.4337
Episode: 364 Total reward: 34.0 Training loss: 35.7950 Explore P: 0.4323
Episode: 365 Total reward: 24.0 Training loss: 39.7296 Explore P: 0.4312
Episode: 366 Total reward: 79.0 Training loss: 1.1521 Explore P: 0.4279
Episode: 367 Total reward: 73.0 Training loss: 22.8703 Explore P: 0.4249
Episode: 368 Total reward: 49.0 Training loss: 20.8242 Explore P: 0.4229
Episode: 369 Total reward: 29.0 Training loss: 1.0323 Explore P: 0.4217
Episode: 370 Total reward: 62.0 Training loss: 47.5436 Explore P: 0.4191
Episode: 371 Total reward: 65.0 Training loss: 18.2278 Explore P: 0.4165
Episode: 372 Total reward: 25.0 Training loss: 64.9173 Explore P: 0.4155
Episode: 373 Total reward: 26.0 Training loss: 92.7751 Explore P: 0.4144
Episode: 374 Total reward: 32.0 Training loss: 1.1315 Explore P: 0.4131
Episode: 375 Total reward: 25.0 Training loss: 1.2099 Explore P: 0.4121
Episode: 376 Total reward: 72.0 Training loss: 1.0920 Explore P: 0.4092
Episode: 377 Total reward: 60.0 Training loss: 1.4395 Explore P: 0.4068
Episode: 378 Total reward: 43.0 Training loss: 27.1649 Explore P: 0.4051
Episode: 379 Total reward: 56.0 Training loss: 24.6124 Explore P: 0.4029
Episode: 380 Total reward: 38.0 Training loss: 0.8332 Explore P: 0.4014
Episode: 381 Total reward: 34.0 Training loss: 31.2533 Explore P: 0.4001
Episode: 382 Total reward: 50.0 Training loss: 60.6818 Explore P: 0.3982
Episode: 383 Total reward: 78.0 Training loss: 16.9956 Explore P: 0.3951
Episode: 384 Total reward: 40.0 Training loss: 21.6409 Explore P: 0.3936
Episode: 385 Total reward: 49.0 Training loss: 28.8969 Explore P: 0.3917
Episode: 386 Total reward: 105.0 Training loss: 39.4656 Explore P: 0.3877
Episode: 387 Total reward: 58.0 Training loss: 1.0702 Explore P: 0.3856
Episode: 388 Total reward: 36.0 Training loss: 1.6365 Explore P: 0.3842
Episode: 389 Total reward: 66.0 Training loss: 23.7121 Explore P: 0.3817
Episode: 390 Total reward: 98.0 Training loss: 45.1203 Explore P: 0.3781
Episode: 391 Total reward: 44.0 Training loss: 1.6504 Explore P: 0.3765
Episode: 392 Total reward: 117.0 Training loss: 1.3644 Explore P: 0.3722
Episode: 393 Total reward: 35.0 Training loss: 1.5169 Explore P: 0.3710
Episode: 394 Total reward: 55.0 Training loss: 22.6399 Explore P: 0.3690
Episode: 395 Total reward: 41.0 Training loss: 29.8454 Explore P: 0.3675
Episode: 396 Total reward: 38.0 Training loss: 1.9906 Explore P: 0.3662
Episode: 397 Total reward: 39.0 Training loss: 63.6314 Explore P: 0.3648
Episode: 398 Total reward: 41.0 Training loss: 29.4477 Explore P: 0.3633
Episode: 399 Total reward: 116.0 Training loss: 1.3461 Explore P: 0.3593
Episode: 400 Total reward: 133.0 Training loss: 36.9494 Explore P: 0.3546
Episode: 401 Total reward: 85.0 Training loss: 26.3764 Explore P: 0.3517
Episode: 402 Total reward: 106.0 Training loss: 2.1950 Explore P: 0.3481
Episode: 403 Total reward: 93.0 Training loss: 1.6908 Explore P: 0.3450
Episode: 404 Total reward: 83.0 Training loss: 47.5147 Explore P: 0.3422
Episode: 405 Total reward: 80.0 Training loss: 1.7274 Explore P: 0.3396
Episode: 406 Total reward: 43.0 Training loss: 2.4234 Explore P: 0.3382
Episode: 407 Total reward: 56.0 Training loss: 1.4740 Explore P: 0.3363
Episode: 408 Total reward: 49.0 Training loss: 3.7124 Explore P: 0.3347
Episode: 409 Total reward: 65.0 Training loss: 39.2932 Explore P: 0.3326
Episode: 410 Total reward: 36.0 Training loss: 146.6746 Explore P: 0.3315
Episode: 411 Total reward: 47.0 Training loss: 21.7338 Explore P: 0.3300
Episode: 412 Total reward: 54.0 Training loss: 1.2765 Explore P: 0.3282
Episode: 413 Total reward: 55.0 Training loss: 2.4019 Explore P: 0.3265
Episode: 414 Total reward: 123.0 Training loss: 32.6547 Explore P: 0.3226
Episode: 415 Total reward: 28.0 Training loss: 2.3248 Explore P: 0.3218
Episode: 416 Total reward: 103.0 Training loss: 2.7918 Explore P: 0.3186
Episode: 417 Total reward: 165.0 Training loss: 1.6300 Explore P: 0.3135
Episode: 418 Total reward: 66.0 Training loss: 1.4146 Explore P: 0.3115
Episode: 419 Total reward: 47.0 Training loss: 11.5925 Explore P: 0.3101
Episode: 420 Total reward: 73.0 Training loss: 45.0749 Explore P: 0.3079
Episode: 421 Total reward: 131.0 Training loss: 2.6746 Explore P: 0.3040
Episode: 422 Total reward: 108.0 Training loss: 40.4880 Explore P: 0.3009
Episode: 423 Total reward: 64.0 Training loss: 0.8858 Explore P: 0.2990
Episode: 424 Total reward: 76.0 Training loss: 1.5335 Explore P: 0.2968
Episode: 425 Total reward: 137.0 Training loss: 87.9757 Explore P: 0.2929
Episode: 426 Total reward: 62.0 Training loss: 1.9621 Explore P: 0.2912
Episode: 427 Total reward: 76.0 Training loss: 42.6316 Explore P: 0.2891
Episode: 428 Total reward: 115.0 Training loss: 2.5584 Explore P: 0.2859
Episode: 429 Total reward: 76.0 Training loss: 2.3628 Explore P: 0.2838
Episode: 430 Total reward: 82.0 Training loss: 1.8800 Explore P: 0.2815
Episode: 431 Total reward: 99.0 Training loss: 12.6375 Explore P: 0.2789
Episode: 432 Total reward: 105.0 Training loss: 1.2985 Explore P: 0.2761
Episode: 433 Total reward: 76.0 Training loss: 1.6480 Explore P: 0.2740
Episode: 435 Total reward: 24.0 Training loss: 1.7097 Explore P: 0.2682
Episode: 436 Total reward: 49.0 Training loss: 19.4165 Explore P: 0.2669
Episode: 437 Total reward: 72.0 Training loss: 58.6929 Explore P: 0.2651
Episode: 438 Total reward: 79.0 Training loss: 46.7626 Explore P: 0.2631
Episode: 439 Total reward: 116.0 Training loss: 1.5448 Explore P: 0.2602
Episode: 440 Total reward: 177.0 Training loss: 1.3924 Explore P: 0.2558
Episode: 441 Total reward: 115.0 Training loss: 29.7960 Explore P: 0.2530
Episode: 442 Total reward: 183.0 Training loss: 2.8165 Explore P: 0.2486
Episode: 443 Total reward: 45.0 Training loss: 2.2988 Explore P: 0.2475
Episode: 444 Total reward: 133.0 Training loss: 2.0468 Explore P: 0.2443
Episode: 445 Total reward: 145.0 Training loss: 2.2157 Explore P: 0.2410
Episode: 446 Total reward: 119.0 Training loss: 2.6986 Explore P: 0.2382
Episode: 447 Total reward: 146.0 Training loss: 0.5917 Explore P: 0.2349
Episode: 449 Total reward: 181.0 Training loss: 1.7708 Explore P: 0.2265
Episode: 450 Total reward: 97.0 Training loss: 1.0074 Explore P: 0.2244
Episode: 451 Total reward: 56.0 Training loss: 2.0266 Explore P: 0.2232
Episode: 452 Total reward: 108.0 Training loss: 0.9192 Explore P: 0.2209
Episode: 453 Total reward: 109.0 Training loss: 129.6474 Explore P: 0.2187
Episode: 454 Total reward: 104.0 Training loss: 1.4420 Explore P: 0.2165
Episode: 455 Total reward: 96.0 Training loss: 47.8020 Explore P: 0.2145
Episode: 456 Total reward: 100.0 Training loss: 157.8227 Explore P: 0.2125
Episode: 457 Total reward: 86.0 Training loss: 287.3430 Explore P: 0.2108
Episode: 458 Total reward: 59.0 Training loss: 1.8444 Explore P: 0.2096
Episode: 459 Total reward: 77.0 Training loss: 1.0973 Explore P: 0.2080
Episode: 460 Total reward: 115.0 Training loss: 1.6887 Explore P: 0.2058
Episode: 461 Total reward: 170.0 Training loss: 1.3495 Explore P: 0.2025
Episode: 462 Total reward: 58.0 Training loss: 60.7734 Explore P: 0.2014
Episode: 463 Total reward: 82.0 Training loss: 63.9114 Explore P: 0.1998
Episode: 464 Total reward: 112.0 Training loss: 56.2753 Explore P: 0.1977
Episode: 465 Total reward: 91.0 Training loss: 1.6639 Explore P: 0.1960
Episode: 466 Total reward: 67.0 Training loss: 1.3043 Explore P: 0.1948
Episode: 467 Total reward: 198.0 Training loss: 1.1603 Explore P: 0.1911
Episode: 468 Total reward: 176.0 Training loss: 1.5790 Explore P: 0.1880
Episode: 469 Total reward: 145.0 Training loss: 1.3807 Explore P: 0.1854
Episode: 470 Total reward: 190.0 Training loss: 1.8027 Explore P: 0.1821
Episode: 471 Total reward: 80.0 Training loss: 150.5503 Explore P: 0.1807
Episode: 472 Total reward: 89.0 Training loss: 75.9849 Explore P: 0.1792
Episode: 473 Total reward: 77.0 Training loss: 0.8801 Explore P: 0.1779
Episode: 474 Total reward: 99.0 Training loss: 0.9483 Explore P: 0.1763
Episode: 475 Total reward: 57.0 Training loss: 1.5641 Explore P: 0.1753
Episode: 476 Total reward: 65.0 Training loss: 1.0448 Explore P: 0.1743
Episode: 477 Total reward: 57.0 Training loss: 1.7952 Explore P: 0.1733
Episode: 478 Total reward: 151.0 Training loss: 61.0750 Explore P: 0.1709
Episode: 479 Total reward: 82.0 Training loss: 221.0070 Explore P: 0.1696
Episode: 480 Total reward: 70.0 Training loss: 1.5740 Explore P: 0.1684
Episode: 481 Total reward: 50.0 Training loss: 137.1074 Explore P: 0.1677
Episode: 482 Total reward: 57.0 Training loss: 138.6258 Explore P: 0.1668
Episode: 483 Total reward: 105.0 Training loss: 1.8258 Explore P: 0.1651
Episode: 484 Total reward: 54.0 Training loss: 1.5417 Explore P: 0.1643
Episode: 485 Total reward: 72.0 Training loss: 174.4070 Explore P: 0.1632
Episode: 486 Total reward: 62.0 Training loss: 1.0475 Explore P: 0.1622
Episode: 487 Total reward: 81.0 Training loss: 0.8770 Explore P: 0.1610
Episode: 488 Total reward: 61.0 Training loss: 90.5182 Explore P: 0.1601
Episode: 489 Total reward: 73.0 Training loss: 1.1320 Explore P: 0.1590
Episode: 490 Total reward: 69.0 Training loss: 0.8124 Explore P: 0.1580
Episode: 491 Total reward: 76.0 Training loss: 1.1775 Explore P: 0.1568
Episode: 492 Total reward: 71.0 Training loss: 1.3814 Explore P: 0.1558
Episode: 493 Total reward: 131.0 Training loss: 71.6750 Explore P: 0.1539
Episode: 494 Total reward: 124.0 Training loss: 2.4675 Explore P: 0.1521
Episode: 495 Total reward: 196.0 Training loss: 1.2396 Explore P: 0.1494
Episode: 497 Total reward: 42.0 Training loss: 1.6223 Explore P: 0.1460
Episode: 499 Total reward: 92.0 Training loss: 2.2810 Explore P: 0.1421
Episode: 500 Total reward: 112.0 Training loss: 0.9054 Explore P: 0.1407
Episode: 501 Total reward: 123.0 Training loss: 1.3876 Explore P: 0.1391
Episode: 502 Total reward: 175.0 Training loss: 1.4707 Explore P: 0.1368
Episode: 503 Total reward: 83.0 Training loss: 1.3259 Explore P: 0.1358
Episode: 504 Total reward: 56.0 Training loss: 0.8514 Explore P: 0.1351
Episode: 505 Total reward: 142.0 Training loss: 1.0626 Explore P: 0.1333
Episode: 506 Total reward: 153.0 Training loss: 0.5246 Explore P: 0.1314
Episode: 508 Total reward: 118.0 Training loss: 1.0166 Explore P: 0.1276
Episode: 510 Total reward: 21.0 Training loss: 0.7353 Explore P: 0.1251
Episode: 512 Total reward: 54.0 Training loss: 135.0852 Explore P: 0.1222
Episode: 513 Total reward: 134.0 Training loss: 1.0268 Explore P: 0.1207
Episode: 515 Total reward: 19.0 Training loss: 0.7410 Explore P: 0.1183
Episode: 516 Total reward: 118.0 Training loss: 0.5163 Explore P: 0.1170
Episode: 519 Total reward: 4.0 Training loss: 0.8928 Explore P: 0.1128
Episode: 521 Total reward: 22.0 Training loss: 1.0069 Explore P: 0.1105
Episode: 523 Total reward: 34.0 Training loss: 0.9730 Explore P: 0.1082
Episode: 524 Total reward: 135.0 Training loss: 0.3510 Explore P: 0.1069
Episode: 525 Total reward: 139.0 Training loss: 0.5986 Explore P: 0.1055
Episode: 527 Total reward: 110.0 Training loss: 0.5984 Explore P: 0.1026
Episode: 530 Total reward: 98.0 Training loss: 0.3360 Explore P: 0.0981
Episode: 531 Total reward: 161.0 Training loss: 0.6806 Explore P: 0.0967
Episode: 532 Total reward: 168.0 Training loss: 0.7826 Explore P: 0.0953
Episode: 534 Total reward: 87.0 Training loss: 0.2465 Explore P: 0.0929
Episode: 536 Total reward: 5.0 Training loss: 0.4531 Explore P: 0.0912
Episode: 538 Total reward: 34.0 Training loss: 0.5407 Explore P: 0.0893
Episode: 540 Total reward: 38.0 Training loss: 0.2771 Explore P: 0.0874
Episode: 541 Total reward: 159.0 Training loss: 0.3383 Explore P: 0.0862
Episode: 542 Total reward: 156.0 Training loss: 0.3379 Explore P: 0.0850
Episode: 543 Total reward: 157.0 Training loss: 0.6217 Explore P: 0.0839
Episode: 544 Total reward: 147.0 Training loss: 0.3310 Explore P: 0.0828
Episode: 545 Total reward: 169.0 Training loss: 0.4353 Explore P: 0.0816
Episode: 546 Total reward: 147.0 Training loss: 0.2057 Explore P: 0.0805
Episode: 548 Total reward: 56.0 Training loss: 0.5273 Explore P: 0.0787
Episode: 550 Total reward: 28.0 Training loss: 0.2389 Explore P: 0.0772
Episode: 552 Total reward: 72.0 Training loss: 0.4221 Explore P: 0.0754
Episode: 554 Total reward: 2.0 Training loss: 0.2235 Explore P: 0.0741
Episode: 557 Total reward: 99.0 Training loss: 0.3120 Explore P: 0.0710
Episode: 560 Total reward: 99.0 Training loss: 13.9660 Explore P: 0.0680
Episode: 563 Total reward: 99.0 Training loss: 0.3912 Explore P: 0.0652
Episode: 566 Total reward: 1.0 Training loss: 0.2408 Explore P: 0.0630
Episode: 569 Total reward: 31.0 Training loss: 0.3153 Explore P: 0.0608
Episode: 571 Total reward: 191.0 Training loss: 0.2866 Explore P: 0.0588
Episode: 573 Total reward: 157.0 Training loss: 0.2586 Explore P: 0.0571
Episode: 575 Total reward: 180.0 Training loss: 0.2640 Explore P: 0.0554
Episode: 577 Total reward: 185.0 Training loss: 0.2582 Explore P: 0.0536
Episode: 580 Total reward: 39.0 Training loss: 0.0927 Explore P: 0.0518
Episode: 582 Total reward: 85.0 Training loss: 0.2413 Explore P: 0.0506
Episode: 585 Total reward: 99.0 Training loss: 0.1683 Explore P: 0.0486
Episode: 588 Total reward: 99.0 Training loss: 0.1337 Explore P: 0.0467
Episode: 590 Total reward: 187.0 Training loss: 0.1451 Explore P: 0.0453
Episode: 593 Total reward: 99.0 Training loss: 0.1293 Explore P: 0.0436
Episode: 596 Total reward: 64.0 Training loss: 0.0895 Explore P: 0.0421
Episode: 599 Total reward: 99.0 Training loss: 0.1937 Explore P: 0.0405
Episode: 602 Total reward: 99.0 Training loss: 0.0591 Explore P: 0.0390
Episode: 604 Total reward: 190.0 Training loss: 0.1655 Explore P: 0.0379
Episode: 607 Total reward: 42.0 Training loss: 0.0920 Explore P: 0.0367
Episode: 610 Total reward: 16.0 Training loss: 12.9225 Explore P: 0.0356
Episode: 613 Total reward: 31.0 Training loss: 0.1913 Explore P: 0.0346
Episode: 616 Total reward: 39.0 Training loss: 0.1018 Explore P: 0.0335
Episode: 619 Total reward: 64.0 Training loss: 0.1359 Explore P: 0.0324
Episode: 621 Total reward: 119.0 Training loss: 0.2111 Explore P: 0.0317
Episode: 623 Total reward: 161.0 Training loss: 0.1244 Explore P: 0.0310
Episode: 626 Total reward: 99.0 Training loss: 0.1015 Explore P: 0.0299
Episode: 629 Total reward: 66.0 Training loss: 0.1291 Explore P: 0.0290
Episode: 631 Total reward: 78.0 Training loss: 0.1448 Explore P: 0.0285
Episode: 634 Total reward: 55.0 Training loss: 0.0770 Explore P: 0.0277
Episode: 637 Total reward: 74.0 Training loss: 0.1036 Explore P: 0.0269
Episode: 639 Total reward: 157.0 Training loss: 0.1069 Explore P: 0.0263
Episode: 641 Total reward: 92.0 Training loss: 0.0868 Explore P: 0.0258
Episode: 643 Total reward: 154.0 Training loss: 0.1335 Explore P: 0.0253
Episode: 645 Total reward: 110.0 Training loss: 0.3041 Explore P: 0.0248
Episode: 647 Total reward: 32.0 Training loss: 0.1299 Explore P: 0.0245
Episode: 649 Total reward: 170.0 Training loss: 0.1660 Explore P: 0.0239
Episode: 651 Total reward: 23.0 Training loss: 0.4568 Explore P: 0.0236
Episode: 653 Total reward: 8.0 Training loss: 70.9714 Explore P: 0.0233
Episode: 655 Total reward: 51.0 Training loss: 0.1018 Explore P: 0.0230
Episode: 657 Total reward: 68.0 Training loss: 0.3788 Explore P: 0.0227
Episode: 658 Total reward: 191.0 Training loss: 0.1898 Explore P: 0.0224
Episode: 660 Total reward: 16.0 Training loss: 0.1112 Explore P: 0.0222
Episode: 662 Total reward: 10.0 Training loss: 0.2301 Explore P: 0.0219
Episode: 663 Total reward: 153.0 Training loss: 0.3646 Explore P: 0.0217
Episode: 664 Total reward: 135.0 Training loss: 0.2004 Explore P: 0.0216
Episode: 665 Total reward: 128.0 Training loss: 0.1092 Explore P: 0.0214
Episode: 666 Total reward: 156.0 Training loss: 0.2850 Explore P: 0.0212
Episode: 667 Total reward: 141.0 Training loss: 0.2363 Explore P: 0.0211
Episode: 668 Total reward: 150.0 Training loss: 0.3727 Explore P: 0.0209
Episode: 669 Total reward: 20.0 Training loss: 0.2531 Explore P: 0.0209
Episode: 670 Total reward: 109.0 Training loss: 194.9313 Explore P: 0.0208
Episode: 671 Total reward: 120.0 Training loss: 0.2359 Explore P: 0.0207
Episode: 672 Total reward: 23.0 Training loss: 0.3019 Explore P: 0.0206
Episode: 673 Total reward: 117.0 Training loss: 0.2973 Explore P: 0.0205
Episode: 674 Total reward: 108.0 Training loss: 0.3829 Explore P: 0.0204
Episode: 675 Total reward: 20.0 Training loss: 0.2249 Explore P: 0.0204
Episode: 676 Total reward: 114.0 Training loss: 0.3772 Explore P: 0.0203
Episode: 677 Total reward: 22.0 Training loss: 0.2850 Explore P: 0.0202
Episode: 678 Total reward: 17.0 Training loss: 0.2365 Explore P: 0.0202
Episode: 679 Total reward: 114.0 Training loss: 0.2449 Explore P: 0.0201
Episode: 680 Total reward: 129.0 Training loss: 0.2844 Explore P: 0.0200
Episode: 681 Total reward: 94.0 Training loss: 0.3728 Explore P: 0.0199
Episode: 682 Total reward: 105.0 Training loss: 0.1877 Explore P: 0.0198
Episode: 683 Total reward: 29.0 Training loss: 239.2706 Explore P: 0.0197
Episode: 684 Total reward: 102.0 Training loss: 0.3954 Explore P: 0.0196
Episode: 685 Total reward: 23.0 Training loss: 0.1637 Explore P: 0.0196
Episode: 686 Total reward: 88.0 Training loss: 0.3507 Explore P: 0.0195
Episode: 687 Total reward: 21.0 Training loss: 0.3079 Explore P: 0.0195
Episode: 688 Total reward: 105.0 Training loss: 0.3115 Explore P: 0.0194
Episode: 689 Total reward: 33.0 Training loss: 0.1647 Explore P: 0.0194
Episode: 690 Total reward: 117.0 Training loss: 0.1787 Explore P: 0.0193
Episode: 691 Total reward: 112.0 Training loss: 0.2234 Explore P: 0.0192
Episode: 692 Total reward: 122.0 Training loss: 0.0658 Explore P: 0.0191
Episode: 693 Total reward: 111.0 Training loss: 0.2356 Explore P: 0.0190
Episode: 694 Total reward: 111.0 Training loss: 95.4112 Explore P: 0.0189
Episode: 695 Total reward: 80.0 Training loss: 0.2037 Explore P: 0.0188
Episode: 696 Total reward: 144.0 Training loss: 0.3725 Explore P: 0.0187
Episode: 697 Total reward: 148.0 Training loss: 0.2163 Explore P: 0.0185
Episode: 698 Total reward: 89.0 Training loss: 0.4122 Explore P: 0.0185
Episode: 699 Total reward: 121.0 Training loss: 0.2926 Explore P: 0.0184
Episode: 700 Total reward: 107.0 Training loss: 0.3590 Explore P: 0.0183
Episode: 701 Total reward: 103.0 Training loss: 0.1444 Explore P: 0.0182
Episode: 702 Total reward: 120.0 Training loss: 0.3014 Explore P: 0.0181
Episode: 703 Total reward: 97.0 Training loss: 0.1769 Explore P: 0.0180
Episode: 704 Total reward: 110.0 Training loss: 0.3429 Explore P: 0.0179
Episode: 705 Total reward: 120.0 Training loss: 0.3564 Explore P: 0.0178
Episode: 706 Total reward: 182.0 Training loss: 0.3298 Explore P: 0.0177
Episode: 708 Total reward: 23.0 Training loss: 0.6545 Explore P: 0.0175
Episode: 709 Total reward: 173.0 Training loss: 0.3012 Explore P: 0.0174
Episode: 710 Total reward: 72.0 Training loss: 32.6551 Explore P: 0.0173
Episode: 711 Total reward: 59.0 Training loss: 0.2249 Explore P: 0.0173
Episode: 714 Total reward: 99.0 Training loss: 0.3136 Explore P: 0.0169
Episode: 715 Total reward: 115.0 Training loss: 0.1990 Explore P: 0.0169
Episode: 717 Total reward: 19.0 Training loss: 0.2099 Explore P: 0.0167
Episode: 719 Total reward: 77.0 Training loss: 0.3851 Explore P: 0.0165
Episode: 722 Total reward: 99.0 Training loss: 0.2740 Explore P: 0.0162
Episode: 724 Total reward: 171.0 Training loss: 0.2369 Explore P: 0.0160
Episode: 726 Total reward: 70.0 Training loss: 0.1993 Explore P: 0.0158
Episode: 728 Total reward: 107.0 Training loss: 0.2841 Explore P: 0.0157
Episode: 731 Total reward: 88.0 Training loss: 0.2364 Explore P: 0.0154
Episode: 732 Total reward: 125.0 Training loss: 44.3792 Explore P: 0.0153
Episode: 735 Total reward: 99.0 Training loss: 37.4648 Explore P: 0.0151
Episode: 738 Total reward: 29.0 Training loss: 0.2645 Explore P: 0.0148
Episode: 739 Total reward: 154.0 Training loss: 0.3200 Explore P: 0.0148
Episode: 740 Total reward: 89.0 Training loss: 0.4510 Explore P: 0.0147
Episode: 743 Total reward: 99.0 Training loss: 0.2747 Explore P: 0.0145
Episode: 745 Total reward: 176.0 Training loss: 0.2086 Explore P: 0.0143
Episode: 747 Total reward: 181.0 Training loss: 0.2194 Explore P: 0.0142
Episode: 748 Total reward: 95.0 Training loss: 0.3165 Explore P: 0.0141
Episode: 749 Total reward: 130.0 Training loss: 0.2426 Explore P: 0.0141
Episode: 750 Total reward: 98.0 Training loss: 0.2175 Explore P: 0.0140
Episode: 753 Total reward: 5.0 Training loss: 0.1845 Explore P: 0.0139
Episode: 756 Total reward: 90.0 Training loss: 0.2009 Explore P: 0.0137
Episode: 759 Total reward: 99.0 Training loss: 0.1291 Explore P: 0.0135
Episode: 761 Total reward: 173.0 Training loss: 31.9495 Explore P: 0.0134
Episode: 763 Total reward: 176.0 Training loss: 0.1235 Explore P: 0.0133
Episode: 765 Total reward: 1.0 Training loss: 0.1320 Explore P: 0.0132
Episode: 766 Total reward: 123.0 Training loss: 28.5099 Explore P: 0.0132
Episode: 767 Total reward: 175.0 Training loss: 0.1042 Explore P: 0.0131
Episode: 768 Total reward: 80.0 Training loss: 0.1515 Explore P: 0.0131
Episode: 769 Total reward: 165.0 Training loss: 0.1613 Explore P: 0.0130
Episode: 770 Total reward: 159.0 Training loss: 0.2139 Explore P: 0.0130
Episode: 771 Total reward: 87.0 Training loss: 0.0934 Explore P: 0.0130
Episode: 772 Total reward: 62.0 Training loss: 0.1585 Explore P: 0.0129
Episode: 773 Total reward: 71.0 Training loss: 0.4366 Explore P: 0.0129
Episode: 774 Total reward: 61.0 Training loss: 0.2759 Explore P: 0.0129
Episode: 775 Total reward: 27.0 Training loss: 0.2224 Explore P: 0.0129
Episode: 776 Total reward: 59.0 Training loss: 0.2889 Explore P: 0.0129
Episode: 777 Total reward: 87.0 Training loss: 0.2248 Explore P: 0.0128
Episode: 778 Total reward: 71.0 Training loss: 0.2960 Explore P: 0.0128
Episode: 779 Total reward: 111.0 Training loss: 0.2196 Explore P: 0.0128
Episode: 780 Total reward: 92.0 Training loss: 0.1363 Explore P: 0.0128
Episode: 781 Total reward: 109.0 Training loss: 0.1535 Explore P: 0.0127
Episode: 782 Total reward: 108.0 Training loss: 0.1038 Explore P: 0.0127
Episode: 783 Total reward: 102.0 Training loss: 0.8576 Explore P: 0.0127
Episode: 784 Total reward: 98.0 Training loss: 0.1379 Explore P: 0.0127
Episode: 786 Total reward: 4.0 Training loss: 0.2088 Explore P: 0.0126
Episode: 788 Total reward: 103.0 Training loss: 0.1200 Explore P: 0.0125
Episode: 790 Total reward: 152.0 Training loss: 0.1336 Explore P: 0.0124
Episode: 793 Total reward: 99.0 Training loss: 0.1233 Explore P: 0.0123
Episode: 796 Total reward: 99.0 Training loss: 0.2121 Explore P: 0.0122
Episode: 797 Total reward: 160.0 Training loss: 0.2136 Explore P: 0.0122
Episode: 800 Total reward: 99.0 Training loss: 0.2207 Explore P: 0.0121
Episode: 802 Total reward: 62.0 Training loss: 0.1164 Explore P: 0.0120
Episode: 805 Total reward: 99.0 Training loss: 0.0791 Explore P: 0.0119
Episode: 808 Total reward: 99.0 Training loss: 0.1899 Explore P: 0.0118
Episode: 811 Total reward: 99.0 Training loss: 0.1971 Explore P: 0.0117
Episode: 814 Total reward: 54.0 Training loss: 0.1965 Explore P: 0.0117
Episode: 816 Total reward: 161.0 Training loss: 2.6983 Explore P: 0.0116
Episode: 819 Total reward: 99.0 Training loss: 0.0919 Explore P: 0.0115
Episode: 822 Total reward: 99.0 Training loss: 0.1213 Explore P: 0.0114
Episode: 825 Total reward: 39.0 Training loss: 0.2334 Explore P: 0.0114
Episode: 827 Total reward: 194.0 Training loss: 0.1064 Explore P: 0.0113
Episode: 829 Total reward: 141.0 Training loss: 0.2325 Explore P: 0.0113
Episode: 831 Total reward: 137.0 Training loss: 0.3233 Explore P: 0.0112
Episode: 833 Total reward: 131.0 Training loss: 0.1438 Explore P: 0.0112
Episode: 836 Total reward: 99.0 Training loss: 0.0831 Explore P: 0.0111
Episode: 839 Total reward: 99.0 Training loss: 0.0977 Explore P: 0.0111
Episode: 842 Total reward: 99.0 Training loss: 0.0716 Explore P: 0.0110
Episode: 845 Total reward: 99.0 Training loss: 0.0761 Explore P: 0.0110
Episode: 848 Total reward: 99.0 Training loss: 0.2727 Explore P: 0.0109
Episode: 851 Total reward: 99.0 Training loss: 0.1141 Explore P: 0.0109
Episode: 854 Total reward: 99.0 Training loss: 0.0524 Explore P: 0.0108
Episode: 857 Total reward: 99.0 Training loss: 0.2143 Explore P: 0.0108
Episode: 860 Total reward: 99.0 Training loss: 0.0634 Explore P: 0.0108
Episode: 863 Total reward: 99.0 Training loss: 0.0815 Explore P: 0.0107
Episode: 866 Total reward: 99.0 Training loss: 0.0371 Explore P: 0.0107
Episode: 869 Total reward: 99.0 Training loss: 0.1138 Explore P: 0.0107
Episode: 872 Total reward: 99.0 Training loss: 0.1220 Explore P: 0.0106
Episode: 875 Total reward: 99.0 Training loss: 0.0836 Explore P: 0.0106
Episode: 878 Total reward: 99.0 Training loss: 0.0765 Explore P: 0.0106
Episode: 881 Total reward: 99.0 Training loss: 0.1283 Explore P: 0.0105
Episode: 884 Total reward: 99.0 Training loss: 0.0442 Explore P: 0.0105
Episode: 887 Total reward: 99.0 Training loss: 0.0542 Explore P: 0.0105
Episode: 890 Total reward: 99.0 Training loss: 0.1142 Explore P: 0.0105
Episode: 893 Total reward: 99.0 Training loss: 0.1113 Explore P: 0.0104
Episode: 896 Total reward: 99.0 Training loss: 0.0647 Explore P: 0.0104
Episode: 899 Total reward: 99.0 Training loss: 0.0557 Explore P: 0.0104
Episode: 902 Total reward: 99.0 Training loss: 0.1452 Explore P: 0.0104
Episode: 905 Total reward: 99.0 Training loss: 0.0683 Explore P: 0.0104
Episode: 908 Total reward: 99.0 Training loss: 0.1025 Explore P: 0.0103
Episode: 911 Total reward: 81.0 Training loss: 0.0442 Explore P: 0.0103
Episode: 914 Total reward: 99.0 Training loss: 0.0459 Explore P: 0.0103
Episode: 917 Total reward: 99.0 Training loss: 0.1007 Explore P: 0.0103
Episode: 920 Total reward: 99.0 Training loss: 0.0952 Explore P: 0.0103
Episode: 923 Total reward: 99.0 Training loss: 0.0556 Explore P: 0.0103
Episode: 926 Total reward: 99.0 Training loss: 0.0998 Explore P: 0.0103
Episode: 929 Total reward: 99.0 Training loss: 0.0980 Explore P: 0.0102
Episode: 932 Total reward: 99.0 Training loss: 0.0960 Explore P: 0.0102
Episode: 935 Total reward: 99.0 Training loss: 0.0682 Explore P: 0.0102
Episode: 938 Total reward: 99.0 Training loss: 0.0566 Explore P: 0.0102
Episode: 941 Total reward: 99.0 Training loss: 0.1344 Explore P: 0.0102
Episode: 944 Total reward: 99.0 Training loss: 0.1308 Explore P: 0.0102
Episode: 947 Total reward: 99.0 Training loss: 0.1866 Explore P: 0.0102
Episode: 950 Total reward: 99.0 Training loss: 154.5262 Explore P: 0.0102
Episode: 953 Total reward: 99.0 Training loss: 0.1494 Explore P: 0.0102
Episode: 956 Total reward: 99.0 Training loss: 0.1616 Explore P: 0.0102
Episode: 959 Total reward: 99.0 Training loss: 0.1093 Explore P: 0.0101
Episode: 962 Total reward: 99.0 Training loss: 0.1271 Explore P: 0.0101
Episode: 965 Total reward: 99.0 Training loss: 0.0980 Explore P: 0.0101
Episode: 968 Total reward: 99.0 Training loss: 0.0972 Explore P: 0.0101
Episode: 969 Total reward: 26.0 Training loss: 0.1646 Explore P: 0.0101
Episode: 972 Total reward: 99.0 Training loss: 0.1114 Explore P: 0.0101
Episode: 975 Total reward: 99.0 Training loss: 0.0857 Explore P: 0.0101
Episode: 978 Total reward: 99.0 Training loss: 0.1527 Explore P: 0.0101
Episode: 981 Total reward: 99.0 Training loss: 0.0602 Explore P: 0.0101
Episode: 984 Total reward: 99.0 Training loss: 0.1246 Explore P: 0.0101
Episode: 987 Total reward: 99.0 Training loss: 182.0812 Explore P: 0.0101
Episode: 990 Total reward: 99.0 Training loss: 0.1364 Explore P: 0.0101
Episode: 993 Total reward: 99.0 Training loss: 0.1976 Explore P: 0.0101
Episode: 996 Total reward: 99.0 Training loss: 0.0921 Explore P: 0.0101
Episode: 999 Total reward: 99.0 Training loss: 0.1152 Explore P: 0.0101
In [ ]:
Episode: 1 Total reward: 13.0 Training loss: 1.0202 Explore P: 0.9987
Episode: 2 Total reward: 13.0 Training loss: 1.0752 Explore P: 0.9974
Episode: 3 Total reward: 9.0 Training loss: 1.0600 Explore P: 0.9965
Episode: 4 Total reward: 17.0 Training loss: 1.0429 Explore P: 0.9949
Episode: 5 Total reward: 16.0 Training loss: 1.0519 Explore P: 0.9933
Episode: 6 Total reward: 15.0 Training loss: 1.0574 Explore P: 0.9918
Episode: 7 Total reward: 12.0 Training loss: 1.0889 Explore P: 0.9906
Episode: 8 Total reward: 27.0 Training loss: 1.0859 Explore P: 0.9880
Episode: 9 Total reward: 24.0 Training loss: 1.2007 Explore P: 0.9857
Episode: 10 Total reward: 17.0 Training loss: 1.1116 Explore P: 0.9840
Episode: 11 Total reward: 12.0 Training loss: 1.0739 Explore P: 0.9828
Episode: 12 Total reward: 25.0 Training loss: 1.0805 Explore P: 0.9804
Episode: 13 Total reward: 23.0 Training loss: 1.0628 Explore P: 0.9782
Episode: 14 Total reward: 31.0 Training loss: 1.0248 Explore P: 0.9752
Episode: 15 Total reward: 15.0 Training loss: 0.9859 Explore P: 0.9737
Episode: 16 Total reward: 12.0 Training loss: 1.0983 Explore P: 0.9726
Episode: 17 Total reward: 16.0 Training loss: 1.4343 Explore P: 0.9710
Episode: 18 Total reward: 21.0 Training loss: 1.2696 Explore P: 0.9690
Episode: 19 Total reward: 15.0 Training loss: 1.3542 Explore P: 0.9676
Episode: 20 Total reward: 15.0 Training loss: 1.2635 Explore P: 0.9661
Episode: 21 Total reward: 16.0 Training loss: 1.3648 Explore P: 0.9646
Episode: 22 Total reward: 43.0 Training loss: 1.6088 Explore P: 0.9605
Episode: 23 Total reward: 7.0 Training loss: 1.5027 Explore P: 0.9599
Episode: 24 Total reward: 13.0 Training loss: 1.7275 Explore P: 0.9586
Episode: 25 Total reward: 18.0 Training loss: 1.3902 Explore P: 0.9569
Episode: 26 Total reward: 27.0 Training loss: 2.5874 Explore P: 0.9544
Episode: 27 Total reward: 32.0 Training loss: 1.5907 Explore P: 0.9513
Episode: 28 Total reward: 17.0 Training loss: 2.1144 Explore P: 0.9497
Episode: 29 Total reward: 34.0 Training loss: 1.7340 Explore P: 0.9466
Episode: 30 Total reward: 18.0 Training loss: 2.5100 Explore P: 0.9449
Episode: 31 Total reward: 15.0 Training loss: 2.0166 Explore P: 0.9435
Episode: 32 Total reward: 11.0 Training loss: 1.8675 Explore P: 0.9424
Episode: 33 Total reward: 18.0 Training loss: 4.0481 Explore P: 0.9408
Episode: 34 Total reward: 10.0 Training loss: 4.0895 Explore P: 0.9398
Episode: 35 Total reward: 15.0 Training loss: 2.1252 Explore P: 0.9384
Episode: 36 Total reward: 14.0 Training loss: 4.7765 Explore P: 0.9371
Episode: 37 Total reward: 16.0 Training loss: 3.3848 Explore P: 0.9357
Episode: 38 Total reward: 21.0 Training loss: 3.9125 Explore P: 0.9337
Episode: 39 Total reward: 16.0 Training loss: 2.6183 Explore P: 0.9322
Episode: 40 Total reward: 20.0 Training loss: 5.4929 Explore P: 0.9304
Episode: 41 Total reward: 18.0 Training loss: 3.6606 Explore P: 0.9287
Episode: 42 Total reward: 17.0 Training loss: 4.5812 Explore P: 0.9272
Episode: 43 Total reward: 10.0 Training loss: 3.7633 Explore P: 0.9263
Episode: 44 Total reward: 8.0 Training loss: 4.6176 Explore P: 0.9255
Episode: 45 Total reward: 39.0 Training loss: 4.2732 Explore P: 0.9220
Episode: 46 Total reward: 18.0 Training loss: 4.0041 Explore P: 0.9203
Episode: 47 Total reward: 11.0 Training loss: 4.4035 Explore P: 0.9193
Episode: 48 Total reward: 25.0 Training loss: 5.4287 Explore P: 0.9171
Episode: 49 Total reward: 19.0 Training loss: 9.6972 Explore P: 0.9153
Episode: 50 Total reward: 11.0 Training loss: 16.3460 Explore P: 0.9143
Episode: 51 Total reward: 11.0 Training loss: 13.4854 Explore P: 0.9133
Episode: 52 Total reward: 12.0 Training loss: 12.8016 Explore P: 0.9123
Episode: 53 Total reward: 13.0 Training loss: 5.8589 Explore P: 0.9111
Episode: 54 Total reward: 12.0 Training loss: 8.5924 Explore P: 0.9100
Episode: 55 Total reward: 19.0 Training loss: 8.6204 Explore P: 0.9083
Episode: 56 Total reward: 36.0 Training loss: 14.2701 Explore P: 0.9051
Episode: 57 Total reward: 9.0 Training loss: 4.5481 Explore P: 0.9043
Episode: 58 Total reward: 22.0 Training loss: 12.9695 Explore P: 0.9023
Episode: 59 Total reward: 36.0 Training loss: 11.2639 Explore P: 0.8991
Episode: 60 Total reward: 16.0 Training loss: 7.7648 Explore P: 0.8977
Episode: 61 Total reward: 31.0 Training loss: 4.6997 Explore P: 0.8949
Episode: 62 Total reward: 13.0 Training loss: 5.9755 Explore P: 0.8938
Episode: 63 Total reward: 10.0 Training loss: 39.1040 Explore P: 0.8929
Episode: 64 Total reward: 14.0 Training loss: 23.2767 Explore P: 0.8917
Episode: 65 Total reward: 12.0 Training loss: 9.3477 Explore P: 0.8906
Episode: 66 Total reward: 20.0 Training loss: 6.4336 Explore P: 0.8888
Episode: 67 Total reward: 29.0 Training loss: 17.1522 Explore P: 0.8863
Episode: 68 Total reward: 13.0 Training loss: 39.3250 Explore P: 0.8852
Episode: 69 Total reward: 20.0 Training loss: 6.2099 Explore P: 0.8834
Episode: 70 Total reward: 15.0 Training loss: 20.9229 Explore P: 0.8821
Episode: 71 Total reward: 27.0 Training loss: 24.7817 Explore P: 0.8797
Episode: 72 Total reward: 12.0 Training loss: 20.7842 Explore P: 0.8787
Episode: 73 Total reward: 15.0 Training loss: 12.3202 Explore P: 0.8774
Episode: 74 Total reward: 31.0 Training loss: 9.2270 Explore P: 0.8747
Episode: 75 Total reward: 13.0 Training loss: 19.8264 Explore P: 0.8736
Episode: 76 Total reward: 20.0 Training loss: 72.9411 Explore P: 0.8719
Episode: 77 Total reward: 27.0 Training loss: 5.2214 Explore P: 0.8695
Episode: 78 Total reward: 14.0 Training loss: 39.3913 Explore P: 0.8683
Episode: 79 Total reward: 16.0 Training loss: 7.9491 Explore P: 0.8670
Episode: 80 Total reward: 18.0 Training loss: 10.8364 Explore P: 0.8654
Episode: 81 Total reward: 16.0 Training loss: 22.2031 Explore P: 0.8641
Episode: 82 Total reward: 21.0 Training loss: 23.6590 Explore P: 0.8623
Episode: 83 Total reward: 13.0 Training loss: 8.4819 Explore P: 0.8612
Episode: 84 Total reward: 10.0 Training loss: 13.3548 Explore P: 0.8603
Episode: 85 Total reward: 13.0 Training loss: 18.0272 Explore P: 0.8592
Episode: 86 Total reward: 24.0 Training loss: 42.1243 Explore P: 0.8572
Episode: 87 Total reward: 9.0 Training loss: 30.8526 Explore P: 0.8564
Episode: 88 Total reward: 22.0 Training loss: 36.6084 Explore P: 0.8546
Episode: 89 Total reward: 7.0 Training loss: 10.5430 Explore P: 0.8540
Episode: 90 Total reward: 12.0 Training loss: 25.5808 Explore P: 0.8529
Episode: 91 Total reward: 17.0 Training loss: 47.3073 Explore P: 0.8515
Episode: 92 Total reward: 21.0 Training loss: 7.9998 Explore P: 0.8498
Episode: 93 Total reward: 15.0 Training loss: 66.6464 Explore P: 0.8485
Episode: 94 Total reward: 17.0 Training loss: 95.6354 Explore P: 0.8471
Episode: 95 Total reward: 23.0 Training loss: 57.4714 Explore P: 0.8451
Episode: 96 Total reward: 11.0 Training loss: 40.7717 Explore P: 0.8442
Episode: 97 Total reward: 13.0 Training loss: 43.3380 Explore P: 0.8431
Episode: 98 Total reward: 9.0 Training loss: 10.8368 Explore P: 0.8424
Episode: 99 Total reward: 21.0 Training loss: 57.7325 Explore P: 0.8406
Episode: 100 Total reward: 11.0 Training loss: 9.7291 Explore P: 0.8397
Episode: 101 Total reward: 10.0 Training loss: 10.4052 Explore P: 0.8389
Episode: 102 Total reward: 26.0 Training loss: 60.4829 Explore P: 0.8368
Episode: 103 Total reward: 34.0 Training loss: 9.0924 Explore P: 0.8339
Episode: 104 Total reward: 30.0 Training loss: 178.0664 Explore P: 0.8315
Episode: 105 Total reward: 14.0 Training loss: 9.0423 Explore P: 0.8303
Episode: 106 Total reward: 18.0 Training loss: 126.9380 Explore P: 0.8289
Episode: 107 Total reward: 11.0 Training loss: 58.9921 Explore P: 0.8280
Episode: 108 Total reward: 20.0 Training loss: 9.1945 Explore P: 0.8263
Episode: 109 Total reward: 9.0 Training loss: 9.2887 Explore P: 0.8256
Episode: 110 Total reward: 29.0 Training loss: 20.7970 Explore P: 0.8232
Episode: 111 Total reward: 17.0 Training loss: 144.6258 Explore P: 0.8218
Episode: 112 Total reward: 15.0 Training loss: 82.4089 Explore P: 0.8206
Episode: 113 Total reward: 15.0 Training loss: 39.9963 Explore P: 0.8194
Episode: 114 Total reward: 8.0 Training loss: 9.8394 Explore P: 0.8188
Episode: 115 Total reward: 29.0 Training loss: 76.9930 Explore P: 0.8164
Episode: 116 Total reward: 21.0 Training loss: 25.0172 Explore P: 0.8147
Episode: 117 Total reward: 24.0 Training loss: 143.5481 Explore P: 0.8128
Episode: 118 Total reward: 35.0 Training loss: 86.5429 Explore P: 0.8100
Episode: 119 Total reward: 28.0 Training loss: 8.4315 Explore P: 0.8078
Episode: 120 Total reward: 13.0 Training loss: 25.7062 Explore P: 0.8067
Episode: 121 Total reward: 9.0 Training loss: 6.5005 Explore P: 0.8060
Episode: 122 Total reward: 32.0 Training loss: 90.7984 Explore P: 0.8035
Episode: 123 Total reward: 21.0 Training loss: 130.2779 Explore P: 0.8018
Episode: 124 Total reward: 15.0 Training loss: 167.6294 Explore P: 0.8006
Episode: 125 Total reward: 15.0 Training loss: 74.7611 Explore P: 0.7994
Episode: 126 Total reward: 20.0 Training loss: 119.3178 Explore P: 0.7978
Episode: 127 Total reward: 18.0 Training loss: 196.5175 Explore P: 0.7964
Episode: 128 Total reward: 8.0 Training loss: 45.2131 Explore P: 0.7958
Episode: 129 Total reward: 24.0 Training loss: 86.0374 Explore P: 0.7939
Episode: 130 Total reward: 11.0 Training loss: 7.8129 Explore P: 0.7931
Episode: 131 Total reward: 11.0 Training loss: 76.8442 Explore P: 0.7922
Episode: 132 Total reward: 28.0 Training loss: 196.6863 Explore P: 0.7900
Episode: 133 Total reward: 9.0 Training loss: 45.7586 Explore P: 0.7893
Episode: 134 Total reward: 21.0 Training loss: 5.8484 Explore P: 0.7877
Episode: 135 Total reward: 10.0 Training loss: 7.3919 Explore P: 0.7869
Episode: 136 Total reward: 17.0 Training loss: 12.3142 Explore P: 0.7856
Episode: 137 Total reward: 16.0 Training loss: 75.7170 Explore P: 0.7843
Episode: 138 Total reward: 12.0 Training loss: 145.3568 Explore P: 0.7834
Episode: 139 Total reward: 27.0 Training loss: 121.1114 Explore P: 0.7813
Episode: 140 Total reward: 26.0 Training loss: 7.3243 Explore P: 0.7793
Episode: 141 Total reward: 40.0 Training loss: 10.6523 Explore P: 0.7762
Episode: 142 Total reward: 14.0 Training loss: 6.9482 Explore P: 0.7752
Episode: 143 Total reward: 24.0 Training loss: 137.7784 Explore P: 0.7733
Episode: 144 Total reward: 12.0 Training loss: 98.8381 Explore P: 0.7724
Episode: 145 Total reward: 8.0 Training loss: 14.1739 Explore P: 0.7718
Episode: 146 Total reward: 51.0 Training loss: 69.1545 Explore P: 0.7679
Episode: 147 Total reward: 29.0 Training loss: 249.9989 Explore P: 0.7657
Episode: 148 Total reward: 9.0 Training loss: 140.8663 Explore P: 0.7651
Episode: 149 Total reward: 13.0 Training loss: 141.5930 Explore P: 0.7641
Episode: 150 Total reward: 19.0 Training loss: 12.6228 Explore P: 0.7627
Episode: 151 Total reward: 19.0 Training loss: 136.3315 Explore P: 0.7612
Episode: 152 Total reward: 10.0 Training loss: 110.3699 Explore P: 0.7605
Episode: 153 Total reward: 18.0 Training loss: 8.3900 Explore P: 0.7591
Episode: 154 Total reward: 18.0 Training loss: 96.3717 Explore P: 0.7578
Episode: 155 Total reward: 7.0 Training loss: 6.0889 Explore P: 0.7573
Episode: 156 Total reward: 15.0 Training loss: 126.7419 Explore P: 0.7561
Episode: 157 Total reward: 15.0 Training loss: 67.2544 Explore P: 0.7550
Episode: 158 Total reward: 26.0 Training loss: 12.2839 Explore P: 0.7531
Episode: 159 Total reward: 20.0 Training loss: 5.8118 Explore P: 0.7516
Episode: 160 Total reward: 10.0 Training loss: 96.4570 Explore P: 0.7509
Episode: 161 Total reward: 23.0 Training loss: 7.6207 Explore P: 0.7492
Episode: 162 Total reward: 18.0 Training loss: 66.5249 Explore P: 0.7478
Episode: 163 Total reward: 18.0 Training loss: 111.3273 Explore P: 0.7465
Episode: 164 Total reward: 21.0 Training loss: 11.5292 Explore P: 0.7450
Episode: 165 Total reward: 17.0 Training loss: 6.3130 Explore P: 0.7437
Episode: 166 Total reward: 22.0 Training loss: 153.8167 Explore P: 0.7421
Episode: 167 Total reward: 17.0 Training loss: 7.0915 Explore P: 0.7408
Episode: 168 Total reward: 34.0 Training loss: 228.3831 Explore P: 0.7384
Episode: 169 Total reward: 13.0 Training loss: 8.5996 Explore P: 0.7374
Episode: 170 Total reward: 13.0 Training loss: 90.8898 Explore P: 0.7365
Episode: 171 Total reward: 20.0 Training loss: 4.8179 Explore P: 0.7350
Episode: 172 Total reward: 9.0 Training loss: 6.2508 Explore P: 0.7344
Episode: 173 Total reward: 14.0 Training loss: 5.2401 Explore P: 0.7334
Episode: 174 Total reward: 12.0 Training loss: 3.9268 Explore P: 0.7325
Episode: 175 Total reward: 12.0 Training loss: 5.6376 Explore P: 0.7316
Episode: 176 Total reward: 11.0 Training loss: 44.5308 Explore P: 0.7308
Episode: 177 Total reward: 12.0 Training loss: 4.9717 Explore P: 0.7300
Episode: 178 Total reward: 9.0 Training loss: 181.0085 Explore P: 0.7293
Episode: 179 Total reward: 11.0 Training loss: 73.1134 Explore P: 0.7285
Episode: 180 Total reward: 13.0 Training loss: 87.3085 Explore P: 0.7276
Episode: 181 Total reward: 12.0 Training loss: 121.6627 Explore P: 0.7267
Episode: 182 Total reward: 12.0 Training loss: 58.1967 Explore P: 0.7259
Episode: 183 Total reward: 13.0 Training loss: 85.1540 Explore P: 0.7249
Episode: 184 Total reward: 16.0 Training loss: 5.1214 Explore P: 0.7238
Episode: 185 Total reward: 19.0 Training loss: 69.1839 Explore P: 0.7224
Episode: 186 Total reward: 7.0 Training loss: 63.2256 Explore P: 0.7219
Episode: 187 Total reward: 17.0 Training loss: 73.2788 Explore P: 0.7207
Episode: 188 Total reward: 15.0 Training loss: 78.6213 Explore P: 0.7197
Episode: 189 Total reward: 11.0 Training loss: 88.5211 Explore P: 0.7189
Episode: 190 Total reward: 14.0 Training loss: 60.1332 Explore P: 0.7179
Episode: 191 Total reward: 15.0 Training loss: 135.7724 Explore P: 0.7168
Episode: 192 Total reward: 15.0 Training loss: 156.9691 Explore P: 0.7158
Episode: 193 Total reward: 17.0 Training loss: 93.3756 Explore P: 0.7146
Episode: 194 Total reward: 12.0 Training loss: 3.0462 Explore P: 0.7137
Episode: 195 Total reward: 9.0 Training loss: 119.2650 Explore P: 0.7131
Episode: 196 Total reward: 13.0 Training loss: 66.6383 Explore P: 0.7122
Episode: 197 Total reward: 9.0 Training loss: 113.7849 Explore P: 0.7116
Episode: 198 Total reward: 13.0 Training loss: 54.6072 Explore P: 0.7106
Episode: 199 Total reward: 19.0 Training loss: 54.8980 Explore P: 0.7093
Episode: 200 Total reward: 20.0 Training loss: 155.5480 Explore P: 0.7079
Episode: 201 Total reward: 10.0 Training loss: 45.8685 Explore P: 0.7072
Episode: 202 Total reward: 14.0 Training loss: 53.5145 Explore P: 0.7062
Episode: 203 Total reward: 8.0 Training loss: 107.9623 Explore P: 0.7057
Episode: 204 Total reward: 21.0 Training loss: 40.2749 Explore P: 0.7042
Episode: 205 Total reward: 32.0 Training loss: 43.2627 Explore P: 0.7020
Episode: 206 Total reward: 9.0 Training loss: 55.5398 Explore P: 0.7014
Episode: 207 Total reward: 15.0 Training loss: 1.9959 Explore P: 0.7004
Episode: 208 Total reward: 12.0 Training loss: 105.3751 Explore P: 0.6995
Episode: 209 Total reward: 11.0 Training loss: 40.8319 Explore P: 0.6988
Episode: 210 Total reward: 10.0 Training loss: 89.7147 Explore P: 0.6981
Episode: 211 Total reward: 10.0 Training loss: 1.1946 Explore P: 0.6974
Episode: 212 Total reward: 10.0 Training loss: 80.6916 Explore P: 0.6967
Episode: 213 Total reward: 24.0 Training loss: 88.0977 Explore P: 0.6951
Episode: 214 Total reward: 14.0 Training loss: 46.4105 Explore P: 0.6941
Episode: 215 Total reward: 11.0 Training loss: 40.3726 Explore P: 0.6933
Episode: 216 Total reward: 11.0 Training loss: 3.0770 Explore P: 0.6926
Episode: 217 Total reward: 8.0 Training loss: 1.7495 Explore P: 0.6921
Episode: 218 Total reward: 16.0 Training loss: 1.5615 Explore P: 0.6910
Episode: 219 Total reward: 17.0 Training loss: 2.0250 Explore P: 0.6898
Episode: 220 Total reward: 18.0 Training loss: 37.8432 Explore P: 0.6886
Episode: 221 Total reward: 17.0 Training loss: 1.9049 Explore P: 0.6874
Episode: 222 Total reward: 16.0 Training loss: 1.9652 Explore P: 0.6863
Episode: 223 Total reward: 16.0 Training loss: 1.4384 Explore P: 0.6853
Episode: 224 Total reward: 27.0 Training loss: 66.0615 Explore P: 0.6834
Episode: 225 Total reward: 9.0 Training loss: 0.8478 Explore P: 0.6828
Episode: 226 Total reward: 14.0 Training loss: 1.0319 Explore P: 0.6819
Episode: 227 Total reward: 17.0 Training loss: 97.6957 Explore P: 0.6808
Episode: 228 Total reward: 8.0 Training loss: 68.0521 Explore P: 0.6802
Episode: 229 Total reward: 9.0 Training loss: 110.8437 Explore P: 0.6796
Episode: 230 Total reward: 19.0 Training loss: 1.6856 Explore P: 0.6783
Episode: 231 Total reward: 9.0 Training loss: 2.0634 Explore P: 0.6777
Episode: 232 Total reward: 11.0 Training loss: 32.0714 Explore P: 0.6770
Episode: 233 Total reward: 9.0 Training loss: 2.0387 Explore P: 0.6764
Episode: 234 Total reward: 13.0 Training loss: 66.9349 Explore P: 0.6755
Episode: 235 Total reward: 14.0 Training loss: 110.6725 Explore P: 0.6746
Episode: 236 Total reward: 18.0 Training loss: 1.0585 Explore P: 0.6734
Episode: 237 Total reward: 11.0 Training loss: 117.0101 Explore P: 0.6727
Episode: 238 Total reward: 7.0 Training loss: 2.6115 Explore P: 0.6722
Episode: 239 Total reward: 10.0 Training loss: 124.7320 Explore P: 0.6716
Episode: 240 Total reward: 18.0 Training loss: 2.5475 Explore P: 0.6704
Episode: 241 Total reward: 37.0 Training loss: 2.1454 Explore P: 0.6679
Episode: 242 Total reward: 11.0 Training loss: 23.6042 Explore P: 0.6672
Episode: 243 Total reward: 32.0 Training loss: 1.4344 Explore P: 0.6651
Episode: 244 Total reward: 9.0 Training loss: 1.5328 Explore P: 0.6645
Episode: 245 Total reward: 14.0 Training loss: 84.7870 Explore P: 0.6636
Episode: 246 Total reward: 12.0 Training loss: 2.7292 Explore P: 0.6628
Episode: 247 Total reward: 26.0 Training loss: 40.6692 Explore P: 0.6611
Episode: 248 Total reward: 12.0 Training loss: 22.0901 Explore P: 0.6603
Episode: 249 Total reward: 15.0 Training loss: 37.9304 Explore P: 0.6594
Episode: 250 Total reward: 20.0 Training loss: 1.4137 Explore P: 0.6581
Episode: 251 Total reward: 16.0 Training loss: 1.7831 Explore P: 0.6570
Episode: 252 Total reward: 9.0 Training loss: 38.0640 Explore P: 0.6565
Episode: 253 Total reward: 17.0 Training loss: 21.7703 Explore P: 0.6554
Episode: 254 Total reward: 24.0 Training loss: 40.3204 Explore P: 0.6538
Episode: 255 Total reward: 30.0 Training loss: 43.4179 Explore P: 0.6519
Episode: 256 Total reward: 11.0 Training loss: 60.9330 Explore P: 0.6512
Episode: 257 Total reward: 14.0 Training loss: 66.6886 Explore P: 0.6503
Episode: 258 Total reward: 15.0 Training loss: 2.5639 Explore P: 0.6493
Episode: 259 Total reward: 19.0 Training loss: 2.6969 Explore P: 0.6481
Episode: 260 Total reward: 10.0 Training loss: 2.6837 Explore P: 0.6475
Episode: 261 Total reward: 30.0 Training loss: 20.6603 Explore P: 0.6456
Episode: 262 Total reward: 17.0 Training loss: 32.1585 Explore P: 0.6445
Episode: 263 Total reward: 15.0 Training loss: 1.0833 Explore P: 0.6435
Episode: 264 Total reward: 13.0 Training loss: 81.4551 Explore P: 0.6427
Episode: 265 Total reward: 17.0 Training loss: 3.3823 Explore P: 0.6416
Episode: 266 Total reward: 11.0 Training loss: 36.3942 Explore P: 0.6409
Episode: 267 Total reward: 11.0 Training loss: 1.6628 Explore P: 0.6402
Episode: 268 Total reward: 33.0 Training loss: 26.9925 Explore P: 0.6382
Episode: 269 Total reward: 18.0 Training loss: 45.8608 Explore P: 0.6370
Episode: 270 Total reward: 20.0 Training loss: 2.7911 Explore P: 0.6358
Episode: 271 Total reward: 10.0 Training loss: 35.9215 Explore P: 0.6352
Episode: 272 Total reward: 14.0 Training loss: 2.5923 Explore P: 0.6343
Episode: 273 Total reward: 16.0 Training loss: 41.2339 Explore P: 0.6333
Episode: 274 Total reward: 18.0 Training loss: 46.7318 Explore P: 0.6322
Episode: 275 Total reward: 14.0 Training loss: 2.7245 Explore P: 0.6313
Episode: 276 Total reward: 8.0 Training loss: 16.2681 Explore P: 0.6308
Episode: 277 Total reward: 10.0 Training loss: 21.6856 Explore P: 0.6302
Episode: 278 Total reward: 12.0 Training loss: 1.7879 Explore P: 0.6294
Episode: 279 Total reward: 10.0 Training loss: 97.1567 Explore P: 0.6288
Episode: 280 Total reward: 16.0 Training loss: 3.4710 Explore P: 0.6278
Episode: 281 Total reward: 14.0 Training loss: 65.8457 Explore P: 0.6270
Episode: 282 Total reward: 21.0 Training loss: 32.4442 Explore P: 0.6257
Episode: 283 Total reward: 17.0 Training loss: 48.0136 Explore P: 0.6246
Episode: 284 Total reward: 11.0 Training loss: 2.8833 Explore P: 0.6239
Episode: 285 Total reward: 16.0 Training loss: 92.6062 Explore P: 0.6230
Episode: 286 Total reward: 16.0 Training loss: 19.1051 Explore P: 0.6220
Episode: 287 Total reward: 7.0 Training loss: 1.8220 Explore P: 0.6216
Episode: 288 Total reward: 16.0 Training loss: 41.3844 Explore P: 0.6206
Episode: 289 Total reward: 18.0 Training loss: 50.0580 Explore P: 0.6195
Episode: 290 Total reward: 13.0 Training loss: 83.2142 Explore P: 0.6187
Episode: 291 Total reward: 14.0 Training loss: 70.2605 Explore P: 0.6178
Episode: 292 Total reward: 16.0 Training loss: 53.9664 Explore P: 0.6169
Episode: 293 Total reward: 17.0 Training loss: 3.2764 Explore P: 0.6158
Episode: 294 Total reward: 18.0 Training loss: 17.7963 Explore P: 0.6147
Episode: 295 Total reward: 17.0 Training loss: 32.3772 Explore P: 0.6137
Episode: 296 Total reward: 32.0 Training loss: 18.3755 Explore P: 0.6118
Episode: 297 Total reward: 29.0 Training loss: 17.1377 Explore P: 0.6100
Episode: 298 Total reward: 12.0 Training loss: 14.2922 Explore P: 0.6093
Episode: 299 Total reward: 14.0 Training loss: 29.2226 Explore P: 0.6085
Episode: 300 Total reward: 17.0 Training loss: 38.9089 Explore P: 0.6075
Episode: 301 Total reward: 9.0 Training loss: 62.2483 Explore P: 0.6069
Episode: 302 Total reward: 22.0 Training loss: 2.3240 Explore P: 0.6056
Episode: 303 Total reward: 16.0 Training loss: 0.9979 Explore P: 0.6047
Episode: 304 Total reward: 8.0 Training loss: 67.9231 Explore P: 0.6042
Episode: 305 Total reward: 13.0 Training loss: 33.0928 Explore P: 0.6034
Episode: 306 Total reward: 20.0 Training loss: 1.3173 Explore P: 0.6022
Episode: 307 Total reward: 23.0 Training loss: 50.2106 Explore P: 0.6009
Episode: 308 Total reward: 17.0 Training loss: 52.5245 Explore P: 0.5999
Episode: 309 Total reward: 20.0 Training loss: 32.5832 Explore P: 0.5987
Episode: 310 Total reward: 19.0 Training loss: 29.0224 Explore P: 0.5976
Episode: 311 Total reward: 19.0 Training loss: 29.8863 Explore P: 0.5965
Episode: 312 Total reward: 27.0 Training loss: 34.4016 Explore P: 0.5949
Episode: 313 Total reward: 9.0 Training loss: 1.1433 Explore P: 0.5944
Episode: 314 Total reward: 20.0 Training loss: 28.8137 Explore P: 0.5932
Episode: 315 Total reward: 24.0 Training loss: 48.5379 Explore P: 0.5918
Episode: 316 Total reward: 28.0 Training loss: 45.2671 Explore P: 0.5902
Episode: 317 Total reward: 13.0 Training loss: 45.9822 Explore P: 0.5894
Episode: 318 Total reward: 12.0 Training loss: 86.3972 Explore P: 0.5887
Episode: 319 Total reward: 10.0 Training loss: 11.2909 Explore P: 0.5881
Episode: 320 Total reward: 11.0 Training loss: 36.5474 Explore P: 0.5875
Episode: 321 Total reward: 13.0 Training loss: 1.1439 Explore P: 0.5867
Episode: 322 Total reward: 8.0 Training loss: 12.6978 Explore P: 0.5863
Episode: 323 Total reward: 20.0 Training loss: 31.7664 Explore P: 0.5851
Episode: 324 Total reward: 8.0 Training loss: 29.4243 Explore P: 0.5847
Episode: 325 Total reward: 13.0 Training loss: 12.2373 Explore P: 0.5839
Episode: 326 Total reward: 19.0 Training loss: 24.2228 Explore P: 0.5828
Episode: 327 Total reward: 68.0 Training loss: 0.7256 Explore P: 0.5790
Episode: 328 Total reward: 11.0 Training loss: 1.2313 Explore P: 0.5783
Episode: 329 Total reward: 15.0 Training loss: 1.3319 Explore P: 0.5775
Episode: 330 Total reward: 53.0 Training loss: 9.9350 Explore P: 0.5745
Episode: 331 Total reward: 74.0 Training loss: 1.4366 Explore P: 0.5703
Episode: 332 Total reward: 16.0 Training loss: 11.2724 Explore P: 0.5694
Episode: 333 Total reward: 34.0 Training loss: 10.6128 Explore P: 0.5675
Episode: 334 Total reward: 27.0 Training loss: 14.9559 Explore P: 0.5660
Episode: 335 Total reward: 31.0 Training loss: 16.6541 Explore P: 0.5643
Episode: 336 Total reward: 49.0 Training loss: 23.3966 Explore P: 0.5616
Episode: 337 Total reward: 40.0 Training loss: 45.3419 Explore P: 0.5594
Episode: 338 Total reward: 71.0 Training loss: 0.8244 Explore P: 0.5555
Episode: 339 Total reward: 56.0 Training loss: 41.4562 Explore P: 0.5525
Episode: 340 Total reward: 18.0 Training loss: 1.2548 Explore P: 0.5515
Episode: 341 Total reward: 56.0 Training loss: 1.5400 Explore P: 0.5485
Episode: 342 Total reward: 34.0 Training loss: 12.0206 Explore P: 0.5466
Episode: 343 Total reward: 67.0 Training loss: 1.4189 Explore P: 0.5430
Episode: 344 Total reward: 27.0 Training loss: 1.3138 Explore P: 0.5416
Episode: 345 Total reward: 42.0 Training loss: 1.1650 Explore P: 0.5394
Episode: 346 Total reward: 23.0 Training loss: 23.1743 Explore P: 0.5382
Episode: 347 Total reward: 54.0 Training loss: 0.6971 Explore P: 0.5353
Episode: 348 Total reward: 34.0 Training loss: 27.2789 Explore P: 0.5335
Episode: 349 Total reward: 25.0 Training loss: 37.4133 Explore P: 0.5322
Episode: 350 Total reward: 20.0 Training loss: 1.6443 Explore P: 0.5312
Episode: 351 Total reward: 26.0 Training loss: 12.6839 Explore P: 0.5298
Episode: 352 Total reward: 40.0 Training loss: 13.3593 Explore P: 0.5278
Episode: 353 Total reward: 18.0 Training loss: 1.7079 Explore P: 0.5268
Episode: 354 Total reward: 47.0 Training loss: 32.5788 Explore P: 0.5244
Episode: 355 Total reward: 20.0 Training loss: 1.6101 Explore P: 0.5234
Episode: 356 Total reward: 53.0 Training loss: 2.5321 Explore P: 0.5207
Episode: 357 Total reward: 15.0 Training loss: 1.6396 Explore P: 0.5199
Episode: 358 Total reward: 76.0 Training loss: 20.8058 Explore P: 0.5160
Episode: 359 Total reward: 12.0 Training loss: 13.0315 Explore P: 0.5154
Episode: 360 Total reward: 42.0 Training loss: 10.1313 Explore P: 0.5133
Episode: 361 Total reward: 53.0 Training loss: 25.4319 Explore P: 0.5106
Episode: 362 Total reward: 33.0 Training loss: 26.4256 Explore P: 0.5090
Episode: 363 Total reward: 85.0 Training loss: 20.2429 Explore P: 0.5048
Episode: 364 Total reward: 23.0 Training loss: 16.1083 Explore P: 0.5036
Episode: 365 Total reward: 30.0 Training loss: 1.6888 Explore P: 0.5022
Episode: 366 Total reward: 66.0 Training loss: 2.0408 Explore P: 0.4989
Episode: 367 Total reward: 37.0 Training loss: 18.6438 Explore P: 0.4971
Episode: 368 Total reward: 50.0 Training loss: 20.1544 Explore P: 0.4947
Episode: 369 Total reward: 78.0 Training loss: 23.8497 Explore P: 0.4909
Episode: 370 Total reward: 83.0 Training loss: 20.6897 Explore P: 0.4869
Episode: 371 Total reward: 44.0 Training loss: 25.4317 Explore P: 0.4849
Episode: 372 Total reward: 44.0 Training loss: 1.5212 Explore P: 0.4828
Episode: 373 Total reward: 14.0 Training loss: 1.5019 Explore P: 0.4821
Episode: 374 Total reward: 31.0 Training loss: 1.8348 Explore P: 0.4806
Episode: 375 Total reward: 25.0 Training loss: 19.7533 Explore P: 0.4795
Episode: 376 Total reward: 51.0 Training loss: 1.5433 Explore P: 0.4771
Episode: 377 Total reward: 23.0 Training loss: 12.9174 Explore P: 0.4760
Episode: 378 Total reward: 67.0 Training loss: 27.2318 Explore P: 0.4729
Episode: 379 Total reward: 26.0 Training loss: 1.9319 Explore P: 0.4717
Episode: 380 Total reward: 35.0 Training loss: 43.2445 Explore P: 0.4701
Episode: 381 Total reward: 33.0 Training loss: 1.5195 Explore P: 0.4686
Episode: 382 Total reward: 30.0 Training loss: 15.4622 Explore P: 0.4672
Episode: 383 Total reward: 12.0 Training loss: 1.8349 Explore P: 0.4666
Episode: 384 Total reward: 25.0 Training loss: 47.7600 Explore P: 0.4655
Episode: 385 Total reward: 36.0 Training loss: 29.6753 Explore P: 0.4639
Episode: 386 Total reward: 50.0 Training loss: 1.1244 Explore P: 0.4616
Episode: 387 Total reward: 35.0 Training loss: 1.0955 Explore P: 0.4600
Episode: 388 Total reward: 52.0 Training loss: 24.9624 Explore P: 0.4577
Episode: 389 Total reward: 52.0 Training loss: 28.2028 Explore P: 0.4554
Episode: 390 Total reward: 132.0 Training loss: 30.5190 Explore P: 0.4495
Episode: 391 Total reward: 25.0 Training loss: 10.3908 Explore P: 0.4484
Episode: 392 Total reward: 56.0 Training loss: 14.1483 Explore P: 0.4460
Episode: 393 Total reward: 110.0 Training loss: 2.0169 Explore P: 0.4412
Episode: 394 Total reward: 78.0 Training loss: 1.2122 Explore P: 0.4379
Episode: 395 Total reward: 44.0 Training loss: 56.4728 Explore P: 0.4360
Episode: 396 Total reward: 90.0 Training loss: 65.6667 Explore P: 0.4322
Episode: 397 Total reward: 36.0 Training loss: 0.9032 Explore P: 0.4307
Episode: 398 Total reward: 40.0 Training loss: 0.8414 Explore P: 0.4290
Episode: 399 Total reward: 109.0 Training loss: 42.5467 Explore P: 0.4244
Episode: 400 Total reward: 37.0 Training loss: 2.6053 Explore P: 0.4229
Episode: 401 Total reward: 62.0 Training loss: 1.2301 Explore P: 0.4203
Episode: 402 Total reward: 42.0 Training loss: 1.1384 Explore P: 0.4186
Episode: 403 Total reward: 71.0 Training loss: 1.6765 Explore P: 0.4157
Episode: 404 Total reward: 88.0 Training loss: 2.4000 Explore P: 0.4122
Episode: 405 Total reward: 55.0 Training loss: 24.7748 Explore P: 0.4100
Episode: 406 Total reward: 33.0 Training loss: 13.5934 Explore P: 0.4087
Episode: 407 Total reward: 44.0 Training loss: 14.6865 Explore P: 0.4069
Episode: 408 Total reward: 40.0 Training loss: 2.0898 Explore P: 0.4053
Episode: 409 Total reward: 98.0 Training loss: 2.1043 Explore P: 0.4015
Episode: 410 Total reward: 63.0 Training loss: 11.3562 Explore P: 0.3990
Episode: 411 Total reward: 50.0 Training loss: 14.1151 Explore P: 0.3971
Episode: 412 Total reward: 44.0 Training loss: 1.2370 Explore P: 0.3954
Episode: 413 Total reward: 56.0 Training loss: 2.1136 Explore P: 0.3932
Episode: 414 Total reward: 61.0 Training loss: 2.2578 Explore P: 0.3909
Episode: 415 Total reward: 49.0 Training loss: 1.3966 Explore P: 0.3890
Episode: 416 Total reward: 55.0 Training loss: 10.2836 Explore P: 0.3869
Episode: 417 Total reward: 121.0 Training loss: 2.2477 Explore P: 0.3824
Episode: 418 Total reward: 46.0 Training loss: 2.3118 Explore P: 0.3807
Episode: 419 Total reward: 70.0 Training loss: 27.3952 Explore P: 0.3781
Episode: 420 Total reward: 72.0 Training loss: 45.7570 Explore P: 0.3755
Episode: 421 Total reward: 41.0 Training loss: 59.1887 Explore P: 0.3740
Episode: 422 Total reward: 67.0 Training loss: 27.0998 Explore P: 0.3716
Episode: 423 Total reward: 46.0 Training loss: 43.1971 Explore P: 0.3699
Episode: 424 Total reward: 52.0 Training loss: 2.0718 Explore P: 0.3680
Episode: 425 Total reward: 92.0 Training loss: 96.7074 Explore P: 0.3647
Episode: 426 Total reward: 60.0 Training loss: 2.0684 Explore P: 0.3626
Episode: 427 Total reward: 106.0 Training loss: 54.1831 Explore P: 0.3589
Episode: 428 Total reward: 76.0 Training loss: 1.9612 Explore P: 0.3563
Episode: 429 Total reward: 42.0 Training loss: 1.6153 Explore P: 0.3548
Episode: 430 Total reward: 77.0 Training loss: 3.9801 Explore P: 0.3522
Episode: 431 Total reward: 123.0 Training loss: 2.0505 Explore P: 0.3480
Episode: 432 Total reward: 150.0 Training loss: 19.4217 Explore P: 0.3430
Episode: 433 Total reward: 57.0 Training loss: 1.5850 Explore P: 0.3411
Episode: 434 Total reward: 74.0 Training loss: 2.4292 Explore P: 0.3386
Episode: 435 Total reward: 97.0 Training loss: 23.5709 Explore P: 0.3354
Episode: 436 Total reward: 99.0 Training loss: 2.0727 Explore P: 0.3322
Episode: 437 Total reward: 101.0 Training loss: 22.3250 Explore P: 0.3290
Episode: 438 Total reward: 46.0 Training loss: 2.0320 Explore P: 0.3275
Episode: 439 Total reward: 51.0 Training loss: 4.8099 Explore P: 0.3259
Episode: 440 Total reward: 111.0 Training loss: 68.3524 Explore P: 0.3224
Episode: 441 Total reward: 167.0 Training loss: 2.3045 Explore P: 0.3173
Episode: 442 Total reward: 80.0 Training loss: 0.8798 Explore P: 0.3148
Episode: 443 Total reward: 170.0 Training loss: 48.6270 Explore P: 0.3097
Episode: 444 Total reward: 77.0 Training loss: 2.2555 Explore P: 0.3074
Episode: 445 Total reward: 84.0 Training loss: 3.0428 Explore P: 0.3049
Episode: 447 Total reward: 12.0 Training loss: 2.8022 Explore P: 0.2987
Episode: 448 Total reward: 66.0 Training loss: 120.3442 Explore P: 0.2968
Episode: 449 Total reward: 152.0 Training loss: 6.2880 Explore P: 0.2925
Episode: 450 Total reward: 141.0 Training loss: 2.8015 Explore P: 0.2885
Episode: 451 Total reward: 99.0 Training loss: 64.0921 Explore P: 0.2858
Episode: 452 Total reward: 79.0 Training loss: 1.5581 Explore P: 0.2836
Episode: 453 Total reward: 49.0 Training loss: 113.2557 Explore P: 0.2823
Episode: 455 Total reward: 106.0 Training loss: 210.4934 Explore P: 0.2741
Episode: 456 Total reward: 109.0 Training loss: 81.6662 Explore P: 0.2712
Episode: 457 Total reward: 56.0 Training loss: 3.2287 Explore P: 0.2697
Episode: 458 Total reward: 138.0 Training loss: 2.5795 Explore P: 0.2662
Episode: 459 Total reward: 93.0 Training loss: 3.4260 Explore P: 0.2638
Episode: 460 Total reward: 71.0 Training loss: 139.3341 Explore P: 0.2620
Episode: 461 Total reward: 106.0 Training loss: 2.6074 Explore P: 0.2594
Episode: 462 Total reward: 63.0 Training loss: 2.8252 Explore P: 0.2578
Episode: 463 Total reward: 71.0 Training loss: 25.8917 Explore P: 0.2560
Episode: 464 Total reward: 79.0 Training loss: 3.8067 Explore P: 0.2541
Episode: 465 Total reward: 86.0 Training loss: 1.6050 Explore P: 0.2520
Episode: 466 Total reward: 88.0 Training loss: 44.2827 Explore P: 0.2499
Episode: 467 Total reward: 72.0 Training loss: 0.7160 Explore P: 0.2482
Episode: 468 Total reward: 152.0 Training loss: 75.7239 Explore P: 0.2446
Episode: 469 Total reward: 122.0 Training loss: 7.4345 Explore P: 0.2417
Episode: 470 Total reward: 81.0 Training loss: 101.0922 Explore P: 0.2399
Episode: 471 Total reward: 38.0 Training loss: 1.6301 Explore P: 0.2390
Episode: 472 Total reward: 79.0 Training loss: 72.4920 Explore P: 0.2372
Episode: 473 Total reward: 190.0 Training loss: 1.3869 Explore P: 0.2329
Episode: 474 Total reward: 197.0 Training loss: 1.5386 Explore P: 0.2286
Episode: 476 Total reward: 42.0 Training loss: 0.8364 Explore P: 0.2233
Episode: 477 Total reward: 134.0 Training loss: 88.3979 Explore P: 0.2205
Episode: 478 Total reward: 128.0 Training loss: 94.1007 Explore P: 0.2178
Episode: 479 Total reward: 79.0 Training loss: 103.1366 Explore P: 0.2162
Episode: 480 Total reward: 169.0 Training loss: 58.8788 Explore P: 0.2127
Episode: 481 Total reward: 160.0 Training loss: 1.1934 Explore P: 0.2095
Episode: 482 Total reward: 81.0 Training loss: 2.9244 Explore P: 0.2079
Episode: 484 Total reward: 68.0 Training loss: 77.9688 Explore P: 0.2027
Episode: 486 Total reward: 17.0 Training loss: 0.6864 Explore P: 0.1985
Episode: 488 Total reward: 62.0 Training loss: 1.9978 Explore P: 0.1937
Episode: 489 Total reward: 178.0 Training loss: 228.6335 Explore P: 0.1904
Episode: 490 Total reward: 114.0 Training loss: 0.4453 Explore P: 0.1884
Episode: 491 Total reward: 127.0 Training loss: 1.6523 Explore P: 0.1861
Episode: 492 Total reward: 124.0 Training loss: 120.2207 Explore P: 0.1840
Episode: 493 Total reward: 184.0 Training loss: 0.5913 Explore P: 0.1808
Episode: 494 Total reward: 129.0 Training loss: 56.3829 Explore P: 0.1786
Episode: 495 Total reward: 95.0 Training loss: 1.9883 Explore P: 0.1770
Episode: 496 Total reward: 129.0 Training loss: 1.2513 Explore P: 0.1749
Episode: 497 Total reward: 176.0 Training loss: 1.0322 Explore P: 0.1720
Episode: 498 Total reward: 132.0 Training loss: 0.9320 Explore P: 0.1699
Episode: 499 Total reward: 146.0 Training loss: 289.4379 Explore P: 0.1675
Episode: 500 Total reward: 147.0 Training loss: 0.5124 Explore P: 0.1652
Episode: 501 Total reward: 166.0 Training loss: 0.9444 Explore P: 0.1627
Episode: 503 Total reward: 31.0 Training loss: 1.4756 Explore P: 0.1592
Episode: 504 Total reward: 129.0 Training loss: 0.6077 Explore P: 0.1573
Episode: 505 Total reward: 127.0 Training loss: 1.2508 Explore P: 0.1554
Episode: 506 Total reward: 123.0 Training loss: 0.8265 Explore P: 0.1537
Episode: 507 Total reward: 159.0 Training loss: 260.4604 Explore P: 0.1514
Episode: 508 Total reward: 136.0 Training loss: 0.9311 Explore P: 0.1495
Episode: 509 Total reward: 198.0 Training loss: 0.9262 Explore P: 0.1467
Episode: 511 Total reward: 40.0 Training loss: 2.3126 Explore P: 0.1435
Episode: 512 Total reward: 130.0 Training loss: 1.2985 Explore P: 0.1418
Episode: 514 Total reward: 25.0 Training loss: 1.1655 Explore P: 0.1388
Episode: 515 Total reward: 149.0 Training loss: 214.9246 Explore P: 0.1369
Episode: 516 Total reward: 200.0 Training loss: 0.8085 Explore P: 0.1344
Episode: 517 Total reward: 172.0 Training loss: 24.9451 Explore P: 0.1323
Episode: 519 Total reward: 52.0 Training loss: 61.7503 Explore P: 0.1293
Episode: 520 Total reward: 176.0 Training loss: 0.4361 Explore P: 0.1272
Episode: 521 Total reward: 160.0 Training loss: 1.8377 Explore P: 0.1253
Episode: 522 Total reward: 180.0 Training loss: 1.5684 Explore P: 0.1233
Episode: 523 Total reward: 174.0 Training loss: 58.6258 Explore P: 0.1213
Episode: 525 Total reward: 10.0 Training loss: 222.1836 Explore P: 0.1190
Episode: 527 Total reward: 32.0 Training loss: 0.4007 Explore P: 0.1165
Episode: 529 Total reward: 33.0 Training loss: 0.3054 Explore P: 0.1140
Episode: 530 Total reward: 185.0 Training loss: 0.7425 Explore P: 0.1121
Episode: 531 Total reward: 171.0 Training loss: 0.3441 Explore P: 0.1104
Episode: 532 Total reward: 199.0 Training loss: 0.2333 Explore P: 0.1084
Episode: 534 Total reward: 95.0 Training loss: 0.4929 Explore P: 0.1056
Episode: 536 Total reward: 8.0 Training loss: 0.5416 Explore P: 0.1036
Episode: 538 Total reward: 42.0 Training loss: 163.3946 Explore P: 0.1014
Episode: 539 Total reward: 180.0 Training loss: 0.2803 Explore P: 0.0997
Episode: 540 Total reward: 193.0 Training loss: 0.4929 Explore P: 0.0980
Episode: 542 Total reward: 36.0 Training loss: 0.7983 Explore P: 0.0960
Episode: 544 Total reward: 152.0 Training loss: 41.4165 Explore P: 0.0930
Episode: 546 Total reward: 30.0 Training loss: 0.7570 Explore P: 0.0911
Episode: 548 Total reward: 50.0 Training loss: 0.3215 Explore P: 0.0891
Episode: 550 Total reward: 79.0 Training loss: 0.5349 Explore P: 0.0869
Episode: 552 Total reward: 38.0 Training loss: 0.3635 Explore P: 0.0851
Episode: 553 Total reward: 131.0 Training loss: 0.3965 Explore P: 0.0841
Episode: 554 Total reward: 135.0 Training loss: 0.2453 Explore P: 0.0831
Episode: 555 Total reward: 111.0 Training loss: 0.9434 Explore P: 0.0823
Episode: 556 Total reward: 136.0 Training loss: 0.7058 Explore P: 0.0814
Episode: 557 Total reward: 106.0 Training loss: 0.4755 Explore P: 0.0806
Episode: 558 Total reward: 98.0 Training loss: 0.4107 Explore P: 0.0799
Episode: 559 Total reward: 102.0 Training loss: 62.3874 Explore P: 0.0792
Episode: 560 Total reward: 92.0 Training loss: 0.4026 Explore P: 0.0786
Episode: 561 Total reward: 86.0 Training loss: 0.3649 Explore P: 0.0780
Episode: 562 Total reward: 83.0 Training loss: 0.5843 Explore P: 0.0774
Episode: 563 Total reward: 100.0 Training loss: 0.1493 Explore P: 0.0768
Episode: 564 Total reward: 102.0 Training loss: 0.4021 Explore P: 0.0761
Episode: 565 Total reward: 60.0 Training loss: 0.3445 Explore P: 0.0757
Episode: 566 Total reward: 63.0 Training loss: 0.2461 Explore P: 0.0753
Episode: 567 Total reward: 59.0 Training loss: 0.2115 Explore P: 0.0749
Episode: 568 Total reward: 73.0 Training loss: 3.2738 Explore P: 0.0744
Episode: 569 Total reward: 70.0 Training loss: 0.4267 Explore P: 0.0740
Episode: 570 Total reward: 53.0 Training loss: 0.8779 Explore P: 0.0736
Episode: 571 Total reward: 88.0 Training loss: 187.5536 Explore P: 0.0731
Episode: 572 Total reward: 54.0 Training loss: 0.3208 Explore P: 0.0727
Episode: 573 Total reward: 87.0 Training loss: 0.2894 Explore P: 0.0722
Episode: 574 Total reward: 58.0 Training loss: 0.2578 Explore P: 0.0718
Episode: 575 Total reward: 85.0 Training loss: 0.3401 Explore P: 0.0713
Episode: 576 Total reward: 73.0 Training loss: 0.2245 Explore P: 0.0709
Episode: 577 Total reward: 114.0 Training loss: 0.3640 Explore P: 0.0702
Episode: 578 Total reward: 94.0 Training loss: 0.7954 Explore P: 0.0696
Episode: 579 Total reward: 114.0 Training loss: 0.2615 Explore P: 0.0689
Episode: 580 Total reward: 80.0 Training loss: 0.4812 Explore P: 0.0685
Episode: 581 Total reward: 125.0 Training loss: 0.8818 Explore P: 0.0677
Episode: 582 Total reward: 109.0 Training loss: 0.2953 Explore P: 0.0671
Episode: 583 Total reward: 98.0 Training loss: 0.4371 Explore P: 0.0665
Episode: 584 Total reward: 119.0 Training loss: 0.4685 Explore P: 0.0659
Episode: 585 Total reward: 96.0 Training loss: 0.3440 Explore P: 0.0653
Episode: 586 Total reward: 172.0 Training loss: 0.1414 Explore P: 0.0644
Episode: 587 Total reward: 104.0 Training loss: 0.3309 Explore P: 0.0638
Episode: 588 Total reward: 85.0 Training loss: 0.2262 Explore P: 0.0634
Episode: 590 Total reward: 104.0 Training loss: 0.3231 Explore P: 0.0618
Episode: 591 Total reward: 148.0 Training loss: 0.4431 Explore P: 0.0610
Episode: 592 Total reward: 135.0 Training loss: 0.1894 Explore P: 0.0603
Episode: 595 Total reward: 99.0 Training loss: 0.2376 Explore P: 0.0579
Episode: 597 Total reward: 172.0 Training loss: 0.4083 Explore P: 0.0561
Episode: 600 Total reward: 99.0 Training loss: 0.1152 Explore P: 0.0539
Episode: 603 Total reward: 99.0 Training loss: 0.3594 Explore P: 0.0518
Episode: 604 Total reward: 149.0 Training loss: 0.1398 Explore P: 0.0511
Episode: 607 Total reward: 99.0 Training loss: 0.3337 Explore P: 0.0491
Episode: 608 Total reward: 180.0 Training loss: 7.4786 Explore P: 0.0484
Episode: 611 Total reward: 28.0 Training loss: 0.1953 Explore P: 0.0468
Episode: 614 Total reward: 38.0 Training loss: 0.3152 Explore P: 0.0453
Episode: 617 Total reward: 50.0 Training loss: 0.4420 Explore P: 0.0437
Episode: 620 Total reward: 9.0 Training loss: 0.3347 Explore P: 0.0423
Episode: 623 Total reward: 99.0 Training loss: 0.1405 Explore P: 0.0408
Episode: 626 Total reward: 78.0 Training loss: 0.1488 Explore P: 0.0393
Episode: 628 Total reward: 198.0 Training loss: 0.9185 Explore P: 0.0382
Episode: 631 Total reward: 99.0 Training loss: 0.3505 Explore P: 0.0368
Episode: 633 Total reward: 142.0 Training loss: 0.1654 Explore P: 0.0359
Episode: 635 Total reward: 134.0 Training loss: 0.3178 Explore P: 0.0351
Episode: 637 Total reward: 104.0 Training loss: 0.3331 Explore P: 0.0343
Episode: 639 Total reward: 73.0 Training loss: 66.8497 Explore P: 0.0337
Episode: 641 Total reward: 35.0 Training loss: 0.1411 Explore P: 0.0331
Episode: 643 Total reward: 83.0 Training loss: 0.2136 Explore P: 0.0325
Episode: 645 Total reward: 62.0 Training loss: 0.2303 Explore P: 0.0319
Episode: 647 Total reward: 28.0 Training loss: 0.2552 Explore P: 0.0314
Episode: 649 Total reward: 4.0 Training loss: 0.1967 Explore P: 0.0310
Episode: 650 Total reward: 194.0 Training loss: 0.1424 Explore P: 0.0306
Episode: 651 Total reward: 170.0 Training loss: 0.1509 Explore P: 0.0302
Episode: 652 Total reward: 150.0 Training loss: 0.2699 Explore P: 0.0299
Episode: 653 Total reward: 161.0 Training loss: 0.2821 Explore P: 0.0296
Episode: 654 Total reward: 148.0 Training loss: 0.4859 Explore P: 0.0293
Episode: 655 Total reward: 142.0 Training loss: 0.1541 Explore P: 0.0290
Episode: 656 Total reward: 140.0 Training loss: 0.0963 Explore P: 0.0288
Episode: 657 Total reward: 144.0 Training loss: 0.3165 Explore P: 0.0285
Episode: 658 Total reward: 160.0 Training loss: 0.2059 Explore P: 0.0282
Episode: 659 Total reward: 127.0 Training loss: 0.0918 Explore P: 0.0280
Episode: 660 Total reward: 124.0 Training loss: 431.4700 Explore P: 0.0278
Episode: 661 Total reward: 127.0 Training loss: 0.2660 Explore P: 0.0275
Episode: 662 Total reward: 133.0 Training loss: 0.4122 Explore P: 0.0273
Episode: 663 Total reward: 119.0 Training loss: 0.2070 Explore P: 0.0271
Episode: 664 Total reward: 114.0 Training loss: 0.3453 Explore P: 0.0269
Episode: 665 Total reward: 130.0 Training loss: 0.3865 Explore P: 0.0267
Episode: 666 Total reward: 125.0 Training loss: 0.2518 Explore P: 0.0265
Episode: 667 Total reward: 138.0 Training loss: 0.1668 Explore P: 0.0263
Episode: 669 Total reward: 42.0 Training loss: 0.3241 Explore P: 0.0259
Episode: 671 Total reward: 105.0 Training loss: 0.1787 Explore P: 0.0254
Episode: 674 Total reward: 99.0 Training loss: 0.2393 Explore P: 0.0246
Episode: 677 Total reward: 99.0 Training loss: 0.2190 Explore P: 0.0239
Episode: 680 Total reward: 99.0 Training loss: 2.5996 Explore P: 0.0232
Episode: 683 Total reward: 99.0 Training loss: 0.3376 Explore P: 0.0226
Episode: 686 Total reward: 99.0 Training loss: 0.5884 Explore P: 0.0220
Episode: 689 Total reward: 99.0 Training loss: 0.1356 Explore P: 0.0214
Episode: 692 Total reward: 99.0 Training loss: 0.0920 Explore P: 0.0208
Episode: 695 Total reward: 99.0 Training loss: 0.1880 Explore P: 0.0203
Episode: 698 Total reward: 99.0 Training loss: 0.0951 Explore P: 0.0198
Episode: 701 Total reward: 99.0 Training loss: 0.1050 Explore P: 0.0193
Episode: 704 Total reward: 99.0 Training loss: 0.1234 Explore P: 0.0189
Episode: 707 Total reward: 99.0 Training loss: 0.1407 Explore P: 0.0185
Episode: 709 Total reward: 160.0 Training loss: 0.0913 Explore P: 0.0182
Episode: 712 Total reward: 99.0 Training loss: 0.1815 Explore P: 0.0178
Episode: 715 Total reward: 99.0 Training loss: 0.1191 Explore P: 0.0174
Episode: 718 Total reward: 99.0 Training loss: 0.1073 Explore P: 0.0170
Episode: 721 Total reward: 99.0 Training loss: 0.1133 Explore P: 0.0167
Episode: 724 Total reward: 99.0 Training loss: 0.0898 Explore P: 0.0164
Episode: 727 Total reward: 99.0 Training loss: 0.1217 Explore P: 0.0160
Episode: 730 Total reward: 99.0 Training loss: 0.2150 Explore P: 0.0158
Episode: 733 Total reward: 99.0 Training loss: 0.0678 Explore P: 0.0155
Episode: 736 Total reward: 99.0 Training loss: 0.0872 Explore P: 0.0152
Episode: 739 Total reward: 99.0 Training loss: 0.1330 Explore P: 0.0150
Episode: 742 Total reward: 99.0 Training loss: 0.1116 Explore P: 0.0147
Episode: 745 Total reward: 99.0 Training loss: 0.1611 Explore P: 0.0145
Episode: 748 Total reward: 99.0 Training loss: 0.1307 Explore P: 0.0143
Episode: 751 Total reward: 99.0 Training loss: 0.0875 Explore P: 0.0141
Episode: 754 Total reward: 99.0 Training loss: 0.1344 Explore P: 0.0139
Episode: 757 Total reward: 99.0 Training loss: 0.0911 Explore P: 0.0137
Episode: 760 Total reward: 99.0 Training loss: 0.1224 Explore P: 0.0135
Episode: 763 Total reward: 99.0 Training loss: 0.0572 Explore P: 0.0133
Episode: 766 Total reward: 99.0 Training loss: 0.0757 Explore P: 0.0132
Episode: 769 Total reward: 99.0 Training loss: 0.0381 Explore P: 0.0130
Episode: 772 Total reward: 99.0 Training loss: 0.1698 Explore P: 0.0129
Episode: 775 Total reward: 99.0 Training loss: 0.0365 Explore P: 0.0127
Episode: 778 Total reward: 99.0 Training loss: 0.1805 Explore P: 0.0126
Episode: 781 Total reward: 99.0 Training loss: 0.1017 Explore P: 0.0125
Episode: 784 Total reward: 99.0 Training loss: 0.1112 Explore P: 0.0123
Episode: 787 Total reward: 99.0 Training loss: 0.0930 Explore P: 0.0122
Episode: 790 Total reward: 99.0 Training loss: 0.0693 Explore P: 0.0121
Episode: 793 Total reward: 99.0 Training loss: 0.0431 Explore P: 0.0120
Episode: 796 Total reward: 99.0 Training loss: 0.1168 Explore P: 0.0119
Episode: 799 Total reward: 99.0 Training loss: 0.1071 Explore P: 0.0118
Episode: 802 Total reward: 99.0 Training loss: 0.1360 Explore P: 0.0117
Episode: 805 Total reward: 99.0 Training loss: 0.0872 Explore P: 0.0117
Episode: 808 Total reward: 99.0 Training loss: 0.1197 Explore P: 0.0116
Episode: 811 Total reward: 99.0 Training loss: 0.0848 Explore P: 0.0115
Episode: 814 Total reward: 99.0 Training loss: 0.0515 Explore P: 0.0114
Episode: 817 Total reward: 99.0 Training loss: 0.1590 Explore P: 0.0114
Episode: 820 Total reward: 99.0 Training loss: 0.2080 Explore P: 0.0113
Episode: 823 Total reward: 99.0 Training loss: 0.1532 Explore P: 0.0112
Episode: 826 Total reward: 99.0 Training loss: 0.0622 Explore P: 0.0112
Episode: 829 Total reward: 99.0 Training loss: 0.0553 Explore P: 0.0111
Episode: 832 Total reward: 99.0 Training loss: 0.0746 Explore P: 0.0111
Episode: 835 Total reward: 99.0 Training loss: 0.1045 Explore P: 0.0110
Episode: 838 Total reward: 99.0 Training loss: 0.0929 Explore P: 0.0110
Episode: 841 Total reward: 99.0 Training loss: 0.1053 Explore P: 0.0109
Episode: 844 Total reward: 99.0 Training loss: 0.0877 Explore P: 0.0109
Episode: 847 Total reward: 99.0 Training loss: 0.0783 Explore P: 0.0108
Episode: 850 Total reward: 99.0 Training loss: 0.0724 Explore P: 0.0108
Episode: 853 Total reward: 99.0 Training loss: 0.1745 Explore P: 0.0107
Episode: 856 Total reward: 99.0 Training loss: 0.0334 Explore P: 0.0107
Episode: 859 Total reward: 99.0 Training loss: 0.0205 Explore P: 0.0107
Episode: 862 Total reward: 99.0 Training loss: 0.0674 Explore P: 0.0106
Episode: 865 Total reward: 99.0 Training loss: 0.1149 Explore P: 0.0106
Episode: 868 Total reward: 99.0 Training loss: 191.0773 Explore P: 0.0106
Episode: 871 Total reward: 99.0 Training loss: 0.1013 Explore P: 0.0106
Episode: 873 Total reward: 156.0 Training loss: 0.2140 Explore P: 0.0105
Episode: 874 Total reward: 185.0 Training loss: 0.1837 Explore P: 0.0105
Episode: 876 Total reward: 45.0 Training loss: 0.1855 Explore P: 0.0105
Episode: 877 Total reward: 55.0 Training loss: 0.0789 Explore P: 0.0105
Episode: 880 Total reward: 99.0 Training loss: 0.0890 Explore P: 0.0105
Episode: 882 Total reward: 185.0 Training loss: 0.0788 Explore P: 0.0105
Episode: 885 Total reward: 99.0 Training loss: 0.1397 Explore P: 0.0104
Episode: 888 Total reward: 99.0 Training loss: 0.0400 Explore P: 0.0104
Episode: 891 Total reward: 99.0 Training loss: 0.0962 Explore P: 0.0104
Episode: 894 Total reward: 99.0 Training loss: 0.1356 Explore P: 0.0104
Episode: 897 Total reward: 99.0 Training loss: 0.2037 Explore P: 0.0104
Episode: 900 Total reward: 99.0 Training loss: 0.0486 Explore P: 0.0103
Episode: 903 Total reward: 99.0 Training loss: 0.2492 Explore P: 0.0103
Episode: 906 Total reward: 99.0 Training loss: 0.1467 Explore P: 0.0103
Episode: 909 Total reward: 99.0 Training loss: 0.2217 Explore P: 0.0103
Episode: 912 Total reward: 99.0 Training loss: 0.1772 Explore P: 0.0103
Episode: 915 Total reward: 99.0 Training loss: 0.0898 Explore P: 0.0103
Episode: 918 Total reward: 99.0 Training loss: 0.0552 Explore P: 0.0103
Episode: 921 Total reward: 99.0 Training loss: 0.1267 Explore P: 0.0102
Episode: 924 Total reward: 99.0 Training loss: 0.3037 Explore P: 0.0102
Episode: 927 Total reward: 99.0 Training loss: 0.1654 Explore P: 0.0102
Episode: 930 Total reward: 99.0 Training loss: 0.1975 Explore P: 0.0102
Episode: 933 Total reward: 99.0 Training loss: 0.2122 Explore P: 0.0102
Episode: 936 Total reward: 99.0 Training loss: 0.0754 Explore P: 0.0102
Episode: 939 Total reward: 99.0 Training loss: 0.1481 Explore P: 0.0102
Episode: 942 Total reward: 99.0 Training loss: 0.0895 Explore P: 0.0102
Episode: 945 Total reward: 99.0 Training loss: 0.0690 Explore P: 0.0102
Episode: 948 Total reward: 99.0 Training loss: 0.0942 Explore P: 0.0102
Episode: 951 Total reward: 99.0 Training loss: 0.0567 Explore P: 0.0101
Episode: 954 Total reward: 99.0 Training loss: 0.0665 Explore P: 0.0101
Episode: 957 Total reward: 99.0 Training loss: 0.0645 Explore P: 0.0101
Episode: 960 Total reward: 99.0 Training loss: 224.4461 Explore P: 0.0101
Episode: 963 Total reward: 99.0 Training loss: 0.0508 Explore P: 0.0101
Episode: 966 Total reward: 99.0 Training loss: 0.0792 Explore P: 0.0101
Episode: 969 Total reward: 99.0 Training loss: 0.0754 Explore P: 0.0101
Episode: 972 Total reward: 99.0 Training loss: 0.0655 Explore P: 0.0101
Episode: 975 Total reward: 99.0 Training loss: 0.0686 Explore P: 0.0101
Episode: 978 Total reward: 99.0 Training loss: 0.0361 Explore P: 0.0101
Episode: 981 Total reward: 99.0 Training loss: 0.1777 Explore P: 0.0101
Episode: 984 Total reward: 99.0 Training loss: 0.0633 Explore P: 0.0101
Episode: 987 Total reward: 99.0 Training loss: 0.0559 Explore P: 0.0101
Episode: 990 Total reward: 99.0 Training loss: 0.0543 Explore P: 0.0101
Episode: 993 Total reward: 99.0 Training loss: 0.0833 Explore P: 0.0101
Episode: 996 Total reward: 99.0 Training loss: 0.1037 Explore P: 0.0101
Episode: 997 Total reward: 45.0 Training loss: 0.0619 Explore P: 0.0101

Visualizing training

Below we plot the total rewards for each episode. The rolling average is plotted in blue.

In [10]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 
In [11]:
eps, rews = np.array(rewards_list).T
smoothed_rews = running_mean(rews, 10)
plt.plot(eps[-len(smoothed_rews):], smoothed_rews)
plt.plot(eps, rews, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
Out[11]:
Text(0,0.5,'Total Reward')

Text(0,0.5,'Total Reward')

png

Playing Atari Games

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

Deep Q-Learning Atari

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.