In this paper, I recreate results from the Correlated-Q Learning paper. The experiment in question is zero sum game for which there does not exist a deterministic equilibrium policy. The game setup is as follows. The soccer field is a grid and the circle represents possession of a ball. Two agents are able to move north, east, west and stick. The agents actions are executed in random order. If there is a collision then only the first moves and possession of the ball is transferred to the player that sticks or the player that moved second. If the player with the ball moves into the goal then he receives a reward of +100. If he moves into his own goal it is -100, the game ends in either case.
Notice the reason there does not exist deterministic equilibria policies is that some states do not have deterministic equilibria. The figure below is one of such states and the one we will be examining in our experiments. A stochastic policy gives B hope to move around player A and this is what variants to the Q-Learning algorithm provide us.

For our hyper parameters we are using .9 for gamma and decaying alpha to .001 and epsilon to .01. Q-Learning is the only agent that makes use of the epsilon parameter as all of the actions for our off policy agents during training are random.
The error values in the figure we are trying to recreate represent the difference in subsequent time steps of player A's Q-values corresponding to action S and B sticking. All of our agents, except our Q-Learning agent, are off policy meaning they learn an optimal policy while following another. The paper was unclear about whether iterations represented steps or full simulations of the game and I decided to use full iterations of the game. Lastly, Q-Learning agent does not consider the opponents actions when selecting its action and I was not sure whether or not to include the dimension of its opponents action in its Q-table. Ultimately, better results were obtained when I did so suggesting that the researchers included the dimension in the original implementation.

Figure 1: Q-Learning

Figure 2: Friend-Q

Figure 3: Foe-Q

Figure 4: Correlated Q
Many of the graphs in my recreation had spikes near the end of the simulation. My error is relatively small so I do not think this is a problem with the implementation, but due to the stochastic nature of the optimal policy.
As expected, our correlated and foe Q-Learners converge to the optimal stochastic policy, which is split between sticking and moving down. Our friend Q-Learner converges to policy of sticking into the corners to avoid getting the negative reward for its opponent. It converges to this policy quickly and has the same error spike at the beginning as Greenwald, this reassures me that their iterations were full simulations of the game. Lastly, our regular Q-Learner is unable to converge to an optimal policy and similar to the paper, the reduction in error over time is due to the decaying of our learning and exploratory rates. This makes sense as there does not exist a deterministic equilibrium policy.
dbba680114fc355dfe2001f26daccf177e268e35