Online Gradient Descent

2020/07/01

Reading time: 4 minutes.

Similar to SGD but different: loss functions are different at each time step

Algorithm

Projected Online Gradient Descent: $x_1 \in V \subseteq \mathbb{R}^d$ where $V$ is a closed, convex set and $\eta_1, \cdots, \eta_T > 0$

  1. for $t = 1, \cdots, T$
  2. Play $x_t$
  3. Receive $\ell_t: \mathbb{R}^d \rightarrow (-\infty, \infty]$ and pay $\ell_t(x_t)$
  4. Set $g_t = \nabla \ell_t(x_t)$
  5. $x_{t+1} = \Pi_V(x_t - \eta_t g_t) = \arg\min_{y \in V}||x_t - \eta_tg_t - y||_2$
  6. end

Regret Analysis

Crucial to the analysis is the following simple lemma:

Projection Lemma: $\forall x \in \mathbb{R}^d, y \in V$ where $V$ is closed, convex set we have $$ ||\Pi_V(x) - y||_2 \leq ||x - y||_2 $$ In other words, projection always decreases the distance of the point to the set it is being projected on.

Proof: Very simple. Relies on the first-order constrained optimality condition. Refer to Francesco Orabona’s Notes.

Next, we will try to bound the instantaneous regret of OGD

Instantaneous Regret Lemma: $\ell_t: V \rightarrow \mathbb{R}$ be convex and differentiable and $g_t = \nabla \ell_t(x_t)$. Then $\forall u \in V$ we have $$ \eta_t(\ell_t(x_t) - \ell_t(u)) \leq \eta_t \langle g_t, x_t - u \rangle \leq \frac{1}{2}||x_t - u||_2^2 - \frac{1}{2}||x_{t+1} - u||_2^2 + \frac{\eta_t^2}{2}||g_t||_2^2 $$

Proof: Regret analyses typically involve constructing a potential term, and observing how the potential evolves over iterations. For OGD, let’s consider $||x_t - u||_2^2$ as the potential term and observe how it decreases in an iteration: \begin{align} ||x_{t+1} - u||_2^2 - ||x_t - u||_2^2 &\leq ||x_t - \eta_t g_t - u||_2^2 - ||x_t - u||_2^2 \\ &= -2\eta_t \langle g_t, x_t - u \rangle + \eta_t^2 ||g_t||_2^2 \\ &\leq -2\eta_t (\ell_t(x_t) - \ell_t(u)) + \eta_t^2 ||g_t||_2^2 \end{align} Rearranging the above inequality gives us the statement of the lemma. The first inequality uses the projection lemma and the second one uses the first-order convexity condition. $\square$

Now it is very simple to use telescoping using the instantaneous regret lemma to obtain the following regret guarantee

Regret Guarantee for OGD: If $V$ is closed, convex set with diameter $D$, i.e. $\max_{x, y \in V}||x - y||_2 \leq D$. If $\ell_1, \cdots, \ell_T$ are differentiable convex loss functions, pick any $x_1 \in V$ and $\eta_{t+1} \leq \eta_t$ for all $t=1, \cdots, T-1$ then for all $u \in V$ we have

$$ R_T(u) = \sum_{t=1}^T (\ell_t(x_t) - \ell_t(u)) \leq \frac{D^2}{2\eta_T} + \sum_{t=1}^T \frac{\eta_t}{2} ||g_t||_2^2$$

If $\eta_t = \eta$ for all $t = 1, \cdots, T$, then we have

$$ \sum_{t=1}^T (\ell_t(x_t) - \ell_t(u)) \leq \frac{||u - x_1||_2^2}{2\eta} + \frac{\eta}{2}\sum_{t=1}^T ||g_t||_2^2$$

Proof: Using the instantaneous regret lemma and telescoping we get, $$\begin{aligned} \sum_{t=1}^T (\ell_t(x_t) - \ell_t(u_t)) &\leq \frac{1}{2\eta_1}||x_1 - u||_2^2 + \sum_{t=2}^T (\frac{1}{2\eta_t} - \frac{1}{2\eta_{t-1}}) ||x_t - u||_2^2 + \sum_{t=1}^T \frac{\eta_t}{2} ||g_t||_2^2 \\ &\leq \frac{D^2}{2\eta_1} + \sum_{t=2}^T (\frac{1}{2\eta_t} - \frac{1}{2\eta_{t-1}}) D^2 + \sum_{t=1}^T \frac{\eta_t}{2} ||g_t||_2^2 \\ &= \frac{D^2}{2\eta_1} + \frac{D^2}{2\eta_T} - \frac{D^2}{2\eta_1} + \sum_{t=1}^T \frac{\eta_t}{2} ||g_t||_2^2 \\ &= \frac{D^2}{2\eta_T} \sum_{t=1}^T \frac{\eta_t}{2} ||g_t||_2^2 \end{aligned}$$ It is easy to see that if $\eta_t = \eta$ then from the first step above that $$ \sum_{t=1}^T (\ell_t(x_t) - \ell_t(u)) \leq \frac{||x_1 - u||_2^2}{2\eta} + \frac{\eta}{2}\sum_{t=1}^T ||g_t||_2^2$$ $\square$

In the case of constant learning rate, we can find an upper bound on the regret by minimizing the above expression using $\eta = \frac{||x_1 - u||}{\sqrt{\sum_{t=1}^T ||g_t||_2^2}}$ giving us the bound $R_T(u) \leq ||u - x_1||_2 \sqrt{\sum_{t=1}^T ||g_t||_2^2}$. But this $\eta$ requires the knowledge of all future gradients!

We can come up with a loose upper bound by assuming that the norm of the gradients are bounded (i.e. $\ell_t$ are $L$-lipschitz continuous) and assuming that the domain $V$ is bounded (i.e. $\max_{x, y \in V}||x-y||_2 \leq D$) to get $$\eta^* = \arg\min_{\eta} \frac{D^2}{2\eta} + \frac{\eta L^2 T}{2} = \frac{D}{L\sqrt{T}} $$ which gives a regret bound of $R_T(u) \leq DL\sqrt{T}$, which is sublinear!

Hence, OGD is no-regret.

Observations

Orabona makes a few interesting observations:

Online Subgradient Method

Note that the only property of $g_t$ we have used is the first-order convexity condition, which is true even if $g_t$ is simply a subgradient of $\ell_t$ at $x_t$. Thus, the same regret guarantees hold for online subgradient method!

Thus, OGD can also be used for non-differentiable convex functions using subgradients instead of gradients.

References

  1. Zinkevich 2003 original paper
  2. Elad Hazan’s monograph
  3. Francesco Orabona’s Notes
comments powered by Disqus