Lessons from AlphaZero: Part 3

2023/12/13

Reading time: 5 minutes.

We will continue with part 3 (and the final part) of the blog post series on this book by Bertsekas. As always, I would highly recommend the readers to start with part 1 and part 2 before continuing in part 3, as all of the notation used in this post is introduced in the previous parts.

As a recap, here’s the breakdown for the 3-part blog series:

Let’s begin!

Policy Iteration

Another major class of RL approaches are based on Policy Iteration (PI), which involves repeated application of two steps:

The above procedure is summarized in the following figure from the book:

Thus, PI generates a sequence of policies ${\mu^{k}}$ through:

Note that the policy improvement update is the one-step lookahead minimization that we explored in part 2. Thus, policy improvement can be interpreted as the Newton step starting from $J_{\mu^k}$ that yields policy $\mu^{k+1}$ as the one-step lookahead policy.

Geometric Interpretation of Policy Iteration

Since policy improvement is the same as one-step lookahead minimization, and the policy evaluation operation can be done by solving the Bellman equation for base policy $\mu$, we can borrow the tools we built in part 1 and 2 to come up with the following geometric interpretation.

The above graph can be understood through the following:

Note that if the curve $T$ is nearly linear (i.e. has small curvature,) then the lookahead policy’s performance $J_{\tilde{\mu}}$ is close to optimal $J^*$, even if the base policy $\mu$ is far from optimal.

Furthermore, the Newton step interpretation gives us an error rate

$$ \lim_{J_\mu \rightarrow J^*} \frac{J_{\tilde{\mu}}(x) - J^*(x)}{J_\mu(x) - J^*(x)} \rightarrow 0 $$

that has superlinear convergence.

Truncated Rollout

Truncated rollout is the final piece of the puzzle that enables the use of any cost approximation $\tilde{J}$ along with a base policy $\mu$ to approximate the cost $J_\mu$, thus avoiding the possibly expensive step of policy evaluation.

Going back to the AlphaZero approach figure that we saw in part 2 (figure is from the book):

We can observe that rather than using the cost of base policy $J_\mu$ at the end of lookahead minimization, we perform $m$ steps of rollout using the base policy $\mu$ after which we apply the terminal cost approximation $\tilde{J}$. In this section, we examine the effect of this approximation and its geometric interpretation.

Note that the cost obtained by using the base policy $\mu$ for $m$ steps followed by terminal cost approximation $\tilde{J}$, is given by $T_{\mu}^m \tilde{J}$, i.e it is the result of applying $m$ iterations of policy evaluation VI corresponding to $\mu$ starting from $\tilde{J}$.

Thus, a one-step lookahead policy with $m$ steps of truncated rollout results in the policy $\tilde{\mu}$ that satisfies:

$$ T_{\tilde{\mu}} (T_\mu^m \tilde{J}) = T (T_\mu^m \tilde{J}) $$

This can also be extended to the general case of $\ell$-step lookahead (from part 2) as follows:

$$ T_{\tilde{\mu}} (T^{\ell-1} T_\mu^m \tilde{J}) = T (T^{\ell - 1} T_\mu^m \tilde{J}) $$

This can be interpreted geometrically using the following figure for ($m = 3$ and $\ell = 1$)

Observe that truncated lookahead with $ell$-step lookahead, followed by $m$-steps of base policy $\mu$ is an economical alternative to $(\ell+m)$-step lookahead.

Final Remarks

To conclude this blog post series, I wanted to list the main takeaways that I had from reading Bertsekas’ book:

Interesting Questions

Is lookahead by rollout an economic substitute for lookahead by minimization?

YES. For a given computational budget, judiciously balancing $m$ and $\ell$ tends to give a much better lookahead policy than simply increasing $\ell$ as much as possible, while setting $m=0$.

Why not just train a policy network and use it without online play?

The question is asking, in other words, whether using only approximation in policy space to learn a policy that can be used online is good enough. It does have the advantage of fast online inference (directly invoking the learned policy.) However, the offline trained policy may not perform nearly as well as the corresponding one-step or multi-step lookahead policy because it lacks the power of the associated exact Newton step

This concludes the blog post series. Please reach out in the comments or directly to me if you have any questions or corrections.

comments powered by Disqus