Pong 001

Lets Get Started!

Adam Price

March 20th 2020

If you are like me, one of your first programming projects was to produce a clone of the videogame 'Pong'. While reproducing Pong, a young novice can learn some very important early programming lessons like clocks and OO. The task is a fulfilling one, and in the end, you have a game to play with a friend. But now, I hear you say "My novice days are far past me! What interest can I possibly find in such a simple programming problem?" Well, I would argue that the game of 'Pong' will make a great testbed for some slightly more advanced lessons, namely Reinforcement Learning!

Despite the apparent simplicity of Pong, it will instantly appear more complex when you consider the two players competing against each other. Even the most simple of skill-based games will always be won by the superior player. This will make the search for the perfect policy for a Pong agent far more difficult than it might seem at first. And so, this is my goal! Over the next few posts, I hope to have created an AI that reign Pong supreme! ...or at least beat me over and over again...

Right, let's get into this.

To start off, we will need a Pong 'environment' to run our experiments in. The obvious option would be to use Open AI's implementation of the Atari game 'Pong'. However, this version gives an RGB image for its state observation (100,800 possible states). This is far too much information for simple algorithms to handle, and, as I don't have a large GPU at my disposal, we will need a version of pong that can give us more information-dense observations. Personally, I see no other option. I will have to sit down and reproduce the first homework from my first OO course again. Also, this way, I don't need to compete with the Open AI leader boards.

I won't waste too much time going over the specifics of my implementation, but, it is important to mention the observations that an agent can make in this environment:

  • The Y coordinates of both paddle [0, 720]
  • The X and Y cords of the ball [0, 780], [0, 720]
  • X and Y speed of the ball [-very fast, +very fast]

Together this gives us only 6 independent observations. This is a far more compact state vector than in the Open-AI environment, so hopefully, we will be able to use nice, simple algorithms, that won't take as long to train.

We can't expect to solve this problem straight away, so for now, I have made the game nearly as simple as possible. The ball will slowly increase in speed as it is rallied, but, the players can't do anything to impact the direction of the ball. This does essentially remove any competition from the game, you might as well play against a wall, but let's not worry about that for now. It is time for our first challenge!

If you look at the pong code you will see that I have already programmed an AI into the game. This is a 'conditional AI', it functions on preset conditional logic (line 64). It is also the most simple opponent I could think of. The 'AI' will move the paddle to follow the Y coordinate of the ball. A plan that will work at first but, as the ball speeds up, the paddle will not be able to keep up with its movement. So, for our next task, we will train a reinforcement algorithm to beat this hardcoded opponent.

To face this first trail we will be utilising Q-Learning. If you have never heard of Q-Learning (or any other type of Temporal Difference Learning algorithm) Then hopefully I can shed some light on the subject:

Q-Learning Agent Environment Interaction Cycle. (From a book!)

A Q-Learning agent learns by taking 'actions' in an environment (moving the paddle up or down). After an action is taken, it will observe the 'state' of the environment (ball location etc). Sometimes, when a certain state occurs, the agent will be given a positive or negative reward (winning or losing a point). The long term goal of the agent is to maximise the reward it receives. It does this by calculating the 'value' of every action it can take given the state of the environment. If the true value of all the possible state-action pairs is known, an optimal policy can be found (It can become the ultimate pong AI).

Now that is criminally oversimplifying things, but there is plenty of information out there if you want it. Even whole books.

Okay, we now only have one hurdle left to overcome before we can start training our agent. We have a very large 'discrete state-space'. Q-Learning works by assigning a 'value' to every possible state-action pair and our ball can be centered on any pixel on a 780x720 pixel surface. This is a lot of possible states (and that just for the ball). To overcome this, we will group many of the possible states together! By splitting the game surface into a 40 by 40 pixel grids the ball can now only be in 342 possible locations! (This is less than the 561,600 possible locations we had before).

Ball location segments

It is worth noting that this isn't a perfect solution. We have reduced the state space down to something that this laptop can handle, but by doing this, we introduced a level of uncertainty into the problem. The agent no longer knows exactly where the ball is in the environment. For this simple problem, it won't matter too much but, it could become an issue as we make the game more advance.

We do similar grouping to the paddle position and ball x and y speeds and we are ready to go. We set the agent to play 5000 games and then see how it gets on!

Q-Learning Agent Training

We can see in the graph, our agent slowly gets better and better throughout thousands of games against the conditional AI until the rally gets to about 12 returns. It is at this point the conditional AI can't keep up and the Q-Learning agent is winning consistently.

Q-Learning Agent Vs Conditional AI

We can see that the Q-Learning agent is, in fact, superior to the simple conditional AI. Huzzah! Our cunning agent learnt to predict the path of the ball, letting it breeze past its first opponent.

However, this isn't to say that a conditional AI couldn't outperform our RL algorithm. The environment is very predictable, so, it would be easy to write an algorithm to calculate where to place the paddle at any time. So, next time we will look into making the environment a bit more complex.

If all this information has sparked your interest then feel free to experiment yourself. You can find all the code I have produced for this post here