Frank-Wolfe a.k.a Conditional Gradient Descent

2020/06/27

Reading time: 2 minutes.

An alternative approach to projected gradient descent for minimizing a smooth convex function $f$ over a compact, convex set $K$.

At the $t$-th step, it considers the first-order taylor approximation $\hat{f}$ of $f$ around $x_t$, and minimizes $\hat{f}$ over $K$, obtaining a minimizer $y_t$. $$ \hat{f}(y) = f(x_t) + \langle \nabla f(x_t), y - x_t \rangle $$ $$ y_t = \arg\min_{y \in K} \hat{f}(y) = \arg\min_{y \in K} \langle \nabla f(x_t), y \rangle $$

Then, it takes the new point $x_{t+1}$ to be the convex combination of $x_t$ and $y_t$ with parameter $\gamma_t \in [0, 1]$ $$ x_{t+1} = (1 - \gamma_t) x_t + \gamma_t y_t $$ Because this is a convex combination, $x_{t+1} \in K$.

This way, instead of projecting onto $K$ at each iteration, we need to optimize the linear function $\hat{f}(y)$ over $K$, which may be easier.

A major advantage of conditional gradient descent over projected gradient descent (other than th e lack of projection operation) is that the former can adapt to the smoothness (or lipschitzness of the gradient) of $f$ in an arbitrary norm.

Theorem: Let $f$ be convex, and $\beta$-smooth w.r.t. some norm $||\cdot||$, $R = \sup_{x, y \in K} ||x - y||$, and $\gamma_t = \frac{2}{t+1}$ for $t \geq 1$. Then for any $t \geq 2$ we have $$ f(x_t) - f(x^*) \leq \frac{2\beta R^2}{t+1} $$

Proof: Can be found in Bubeck’s Monograph Theorem 3.8. Very easy and relies mostly on the smoothness property, the fact that $y_t$ is minimizer of $\langle \nabla f(x_t), y\rangle$, and the construction of $\gamma_t = \frac{2}{t+1}$.

In addition to being projection-free and “norm-free”, the conditional gradient descent produces sparse iterates in the vertex representation of $K$. Observe that any iterate $x_t$ is a convex combination of $t$ vertices (assume $x_1$ is a vertex), and $t < < n$ where $K \subset \mathbb{R}^n$.

Thus, conditional gradient descent has the following properties: dimension-free, projection-free, norm-free, and sparse iterates in the vertex representation.

Bubeck presents a cool application of Franke-Wolfe on a least squares regression problem with structured sparsity (Section 3.3)

References

  1. Bubeck’s Monograph on Convex Optimization Section 3.3
  2. Nisheeth Vishnoi’s notes on Convex Optimization Section 1.7.8
comments powered by Disqus