The cartpole problem is a prominent benchmark problem in Reinforcement Learning, its implementation on OpenAI’s Gym being the most well known. It is also the most popular on their leaderboards; which makes sense, it is one of the easier problems in classic control. But just how easy is it?
As it stands right now, a solution needs to balance the pole for an average of 195 (or more) steps over 100 episodes. Keep in mind that each episode ends after 200 steps, so the solution does need to be somewhat stable.
For a while now, I have thought that balancing the pole for only 200 steps is too easy. So, in an attempt to confirm my suspicions, I have brute-forced the problem.
If you want to follow along with all the code that I used, please feel free to view it here.
In cryptography, a brute force attack first calculates all the possible password strings. It then applies these passwords iteratively until it finds the correct one. To use a similar approach in control, we need to generate a range of policies to control the cartpole. This would be impossible to do with continuous state space so we will have to discretise the state-space.
The cartpole problem has 4 observations: ‘cart position’, ‘cart velocity’, ‘pole angle’, and ‘pole velocity’. We will be completely disregarding ‘cart position’ and ‘cart velocity’, 200 timesteps won’t be enough for the cart to move far enough to end the episode. This leaves us with ‘pole angle’ and ‘pole velocity’ which will both be ‘discretised’ into 3 separate bins, creating 9 possible states of the environment.
The cartpole problem only has 2 actions, so with 9 states, we have a limit of 512 (2⁹) deterministic greedy policies. All we need now is a way to generate these policies then we can attempt to solve using brute force:
At each state-action pair, the cart will either move left or right (0 or 1). So, to create the policy, we simply generate the binary for every number from 0 to 511. The string then gets reshaped to form a 3 by 3 matrix for easier pipelining.
It would be computationally expensive to completely evaluate every policy, so, to start, we ran each policy on the problem once and kept track of those that scored above 195.
When we ran this script, we found that 34, of the initial 512, policies are potential solutions. To examine them further, we need to run these potential solutions for 100 episodes and average their performance.
When we ran this we found 8 solutions in 39.1 seconds. So, there we have it. We successfully brute-forced the cartpole problem!
Now, this is clearly a terrible method to solve control problems. The solutions we found are very precarious, none of them could balance the pole for more than 300 timesteps. To find better solutions, we would need to use more buckets in the state space and given the quadratic complexity of this method, this would be quite foolish to try.
Yet, we have found valid solutions. Is this a damning indictment on cartpole as a benchmark problem? I don’t think so. The problem can still be difficult, you just need to balance the pole for longer. Raising the average to reward to average to 490 with a cap at 500 would likely make a brute force method like this redundant.
Besides, do benchmarks even need to be difficult? Algorithms solving cartpole are usually compared by the number of episodes they needed before solving the environment. This brute force algorithm needs 512 non-evaluation iterations, which would be last on Gym’s leaderboard…