Lessons from AlphaZero: Part 1


Reading time: 8 minutes.

I recently came across this book from Bertsekas that looked very interesting. It promises to provide a novel (geometric) interpretation of the methods that underlie the success of AlphaZero, AlphaGo, and TD-Gammon. I thought it would be interesting to take this as an opportunity to read the book, and have a blog post series on the key takeaways.

I think this will be a 3-part blog series with the following contents:

With that, let’s kick-off part 1.


Before I dive into the book, let me lay out an overview of the methodology underlying AlphaZero and TD-Gammon. They were both trained off-line extensively using neural networks and an approximate version of policy iteration. However, note that the AlphaZero player that was obtained off-line is not directly used during on-line play. Instead, a separate on-line player is used to select moves, based on multi-step lookahead minimization and a terminal position evaluator that was trained using experience with the off-line player. In a similar fashion, TD-Gammon (on-line) uses an off-line neural network-trained terminal position evaluator, and extends its on-line lookahead by rollout using the one-step lookahead player based on the position evaluator.

Bertsekas observes that

This book answers why on-line player works so much better than directly using the off-line player.

The answer can be summarized as:

Bertsekas also says the following which I believe is the major claim made in this book:

The major determinant of the quality of the on-line policy is the Newton step that is performed on-line, while off-line training plays a secondary role by comparison

With the motivation and the main contributions out of the way, its time for some notation to understand the insights behind the contributions.


The following notation will be used in all of the blog posts concerning this book. This is the same notation that Bertsekas uses in the book. Oddly, it differs from the usual RL notation that we are used to, so I will try to draw parallels in this section.

The dynamics are of the form, $$ x_{k+1} = f(x_k, u_k, w_k) $$

We are primarily interested in minimizing the expected total cost $$ J_\mu(x_0) = \lim_{N \rightarrow \infty} \mathbb{E}[\sum_{k=0}^{N-1} \alpha^k g(x_k, \mu(x_k), w_k)] $$

Bellman Operators

To avoid unwieldy notation and to also encourage thinking of Bellman’s equations as operators, we define the following Bellman operators. Understanding this notation and properties of these operators is crucial for understanding the main takeaways presented in these blog posts.

These two operators have the following properties:

The linearity of $T_\mu$ and the concavity of $T$ are the two main properties that enable the interpretation of policy improvement in AlphaZero-like approaches as Newton steps. They will also be crucial in our geometric interpretations of popular RL methods.

Geometric Interpretation of Value Iteration

Before we end part 1, I thought it would be nice for the readers to see how the above properties lend us a new (geometric) way of interpreting popular RL methods like value iteration. First, let’s use the operator notation to describe the value iteration algorithm.

We describe two versions of value iteration: the policy evaluation version, and the policy improvement version.

Both of the above value iteration approaches have a staircase-like construction in their geometric interpretation as shown below. First, we will show the geometric interpretation of policy evaluation VI.

Policy Evaluation Value Iteration

The above graph is the first of many in this blog post series. For all of these graphs, both X- and Y-axis denote the space of all cost functions $J$. The (blue) line going through the origin is always the $y=x$ line.

The above graph shows two iterations of the policy evaluation VI starting from $J_0$ and ending at $J_2$. The black line depicts the function that applies the operator $T_\mu$. Note that it is linear as we know $T_\mu$ is a linear operator. The point at which it intersects the $y=x$ line is represented by $J_\mu$, i.e. the cost of the policy $\mu$.

Starting from $J_0$, the first iteration applies the operator $T_\mu$ to get the next iterate $J_1$, which subsequently undergoes the same operation $T_\mu$ once more to get to $J_2$. Note that this repeated operation converges to $J_\mu$ as it is the fixed point of the operation $J = T_\mu J$.

Now, let’s see the geometric interpretation of policy improvement VI.

Policy Improvement Value Iteration

Couple of things to note here:

The above graph shows two iterations of the policy improvement VI. Starting from $J_0$, the first iteration applies the operator $T$ to get the next iterate $J_1$, which is used in the next iteration to apply $T$ once again to get $J_2$. This repeated operation would now converge to the optimal cost $J^*$ as it is the fixed point of the operation $J = TJ$.



This is a good point to end this part of the series. The next two posts get to the meat of the insights from the book, and build upon these geometric interpretations. Stay tuned!

comments powered by Disqus