In this paper I detail my implementation of a reinforcement learning agent for the lunar lander environment provided by Open Ai gym. The goal is to train an agent that can achieve a 200 point rolling average reward by interacting with the environment. I will discuss the adjustments and hyper parameter tuning required to reproduce the results of my deep Q-learning model.

 Reinforcement learning is ideal for the lunar lander environment as it involves learning in a simulation environment with a large enough state action space that traditional MDPs are infeasible. The off policy, Q-Learning algorithm will allow us to solve the control problem. Off policy means we will be updating the Q-function based on a greedy policy but will be following a more exploratory scheme, epsilon-greedy. There are many non optimal routes to the landing pad so it should be beneficial that our agent follows a more exploratory scheme as a candidate optimal value function begins to emerge. 

 The environment is defined by an eight dimensional state space, with six continuous state variables and a four dimensional action space. Due to the continuous state space, a basic implementation of Q-learning is infeasible. This is because the memory requirements to store state action pairs in the Q-table is too high as well as the inability of the agent to visit each state. The solution is to use linear functional approximation to determine the optimal value function, (q*(s,a)), instead of a Q-table. 

A neural network allows us to approximate our value function. For backpropagation, we use stochastic gradient descent to update the weights in our neural network. This is a computationally feasible version of gradient descent. I use the build in Adam Optimizer to handle the computation of the error gradient in my implementation and use mean squared error as the loss function. Once we have the optimal value function, it follows that the optimal policy is to take the argmax of the function, q(s, a), over the action space for a given state.

 Our agent employs an e-greedy exploration strategy. This means it takes the optimal learned action, greedy action, 1-epsilon percent of the time and a random action otherwise. In other words, we explore novel actions e percent of the time in order to address the explore vs exploit tradeoff.

 When updating the q function in deep Q-Learning, unlike traditional Q-learning, you are update the  value for many state action pairs within the MDP. This an lead to unstable learning, particularly when you affect the action values for the very next state. This becomes clear when we parameterize the weights and biases of the Q network, as $\\theta$, and include them in the update rule. Note how the goal and the current value use the same network weight. 

$$ Q(s',a', \theta)\to R+ \gamma Q(s',a',\theta) $$

   To solve this problem, I employ a target Q-network that receives soft updates from my local Q-network allowing it to move slightly in the direction of the local network. To address the explore vs exploit tradeoff I decay epsilon. Lastly, I implemented experience replay with mini-batch learning to allow the agent to learn multiple times from training episodes and improve convergence. This does increase runtime as the agent learns batch_size weight updates every update_rate iterations.   

Fig 4: Reward per Episode for Alpha=.001, Gamma=.99, tau=.2, Update Rate=4, Average Reward=240.499, Average Steps=203.62, Replay Buffer=1000

Fig 4: Reward per Episode for Alpha=.001, Gamma=.99, tau=.2, Update Rate=4, Average Reward=240.499, Average Steps=203.62, Replay Buffer=1000

 With this implementation, my agent was able to achieve the desired two hundred point rolling average. Fig1 shows the rewards per episode for the training rewards using our best hyper parameters. Fig 2 is the testing rewards per episode for one hundred episodes of the trained agent. When testing we do not make any updates to the weights of the Q-networks and use an epsilon of zero as we are only looking to exploit our best policy. The initial rise observed in the graph of the training data can be attributed to the decaying exploratory epsilon value. The jagged shape of both graphs can be explained by the uneven reward structure of the environment, the agent receives +/-100 points from a landing and a crash respectively.

  In the process of finding the hyper parameters that result in the highest average reward for my agent, I focused on the learning rate of our stochastic gradient descent (alpha), the value used to determine how quickly the target network caught up with the local network (tau) and the discount rate (gamma). 

Fig 3: Reward per Episode for Alpha=.001, Gamma=.99, tau=.1, Update Rate=4, Average Reward=219.968, Average Steps=273.544, Replay Buffer=10000

Fig 3: Reward per Episode for Alpha=.001, Gamma=.99, tau=.1, Update Rate=4, Average Reward=219.968, Average Steps=273.544, Replay Buffer=10000

Fig 5: Reward per Episode for Alpha=.001, Gamma=.99, tau=.3, Update Rate=4, Average Reward=-288.506, Average Steps=200.174, Replay Buffer=1000

Fig 5: Reward per Episode for Alpha=.001, Gamma=.99, tau=.3, Update Rate=4, Average Reward=-288.506, Average Steps=200.174, Replay Buffer=1000

Fig 7: Reward per Episode for Alpha=.001, Gamma=.99, tau=.5, Update Rate=4, Average Reward=-47.482, Average Steps=972.9687, Replay Buffer=1000

Fig 7: Reward per Episode for Alpha=.001, Gamma=.99, tau=.5, Update Rate=4, Average Reward=-47.482, Average Steps=972.9687, Replay Buffer=1000

Fig 8: Reward per Episode for Alpha=.001, Gamma=.99, tau=.5, Update Rate=4, Average Reward=-47.482, Average Steps=972.9687, Replay Buffer=1000

Fig 8: Reward per Episode for Alpha=.001, Gamma=.99, tau=.5, Update Rate=4, Average Reward=-47.482, Average Steps=972.9687, Replay Buffer=1000

 As described above, the reason for soft updates of our target network is to prevent the agent from forgetting previously learned e-greedy actions while updating the value of a state. Thus, we see regressions in the training performance at the upper end of our tau rage and and infeasibly slow learning at the lower end. 

Fig 9: Reward per Episode for Alpha=.001, Gamma=.999, tau=.1, Update Rate=4, Average Reward=-883.53, Average Steps=244.41, Replay Buffer=1000

Fig 9: Reward per Episode for Alpha=.001, Gamma=.999, tau=.1, Update Rate=4, Average Reward=-883.53, Average Steps=244.41, Replay Buffer=1000

Fig 10: Reward per Episode for Alpha=.001, Gamma=.999, tau=.1, Update Rate=4, Average Reward=-705.06, Average Steps=103.41, Replay Buffer=1000

Fig 10: Reward per Episode for Alpha=.001, Gamma=.999, tau=.1, Update Rate=4, Average Reward=-705.06, Average Steps=103.41, Replay Buffer=1000

 Gamma discounts the value the algorithm places on estimated future rewards, in our case the q function targets. A high discount factor, corresponding to a low discount of the value of the proceeding state, prevents the agent from learning and effective strategy. On the other hand, a low discount factor increases training time as the weight updates to the network are reduced. I was surprised how pronounced this effect was with the lowest value taking significantly longer than any other hyper parameter tuning.