Online Gradient Descent with Strongly Convex Functions

2020/07/03

Reading time: 2 minutes.

Turns out we can do better $\sqrt{T}$ regret we saw for OGD analysis for convex lipschitz losses if they possess some curvature that can be exploited.

Strong Convexity

$f$ is $\mu$-strongly convex over convex set $V$ w.r.t norm $||\cdot||$ if $\forall x, y \in V$ and $g \in \partial f(x)$ $$ f(y) \geq f(x) + \langle g, y - x\rangle +\frac{\mu}{2}||x - y||^2$$

In other words, a strongly convex function is lower bounded by a quadratic (instead of linear like convex functions). Hence, we have a tighter lower bound. Actually, several possible quadratic lower bounds since there can be more than one subgradient.

Lemma: If $f$ is twice-differentiable then a sufficient condition for $\mu$-strong convexity is that $\forall x, y, \in V$ we have $$ \langle \nabla^2 f(x) y, y\rangle \geq \mu||y||^2 $$

Regret Analysis

Theorem: Assume losses $\ell_t$ are $\mu_t$-strongly convex w.r.t $||\cdot||_2$ where $\mu_t >0$. Using OGD with stepsizes equal to $\eta_t = \frac{1}{\sum_{i=1}^t \mu_i}$. Then for any $u \in V$, $$\sum_{t=1}^T (\ell_t(x_t) - \ell_t(u)) \leq \frac{1}{2}\sum_{t=1}^T \eta_t ||g_t||_2^2$$

Proof: Observe that \begin{align} \eta_1 &= \frac{1}{\mu_1} \\ \implies\frac{1}{2\eta_1} - \frac{\mu_1}{2} &= 0 \\ \implies\frac{1}{2\eta_t} - \frac{\mu_t}{2} &= \frac{1}{2\eta_{t-1}} \end{align} for $t=2, \cdots, T$. We can use the instantaneous regret lemma (refer to this note) and the strong-convexity condition to get \begin{align} \sum_{t=1}^T &(\ell_t(x_t) - \ell_t(u)) \leq \sum_{t=1}^T \frac{1}{2\eta_t}||x_t - u||_2^2 - \frac{1}{2\eta_t}||x_{t+1} - u||_2^2 - \frac{\mu_t}{2}||x_t - u||_2^2 + \frac{\eta_t}{2}||g_t||_2^2 \\ &= -\frac{1}{2\eta_1}||x_2 - u||_2^2 + \sum_{t=2}^T (\frac{1}{2\eta_{t-1}}||x_t - u||_2^2 - \frac{1}{2\eta_t}||x_{t+1} - u||_2^2) + \sum_{t=1}^T\frac{\eta_t}{2}||g_t||_2^2 \\ &= \frac{1}{2}\sum_{t=1}^T \eta_t||g_t||_2^2 \end{align} $\square$

We can obtain a more interpretable regret upper bound if the loss functions are lipschitz in addition to being strongly convex.

Corollary: If we have $\mu_t = \mu > 0$ for all $t$ and $\ell_t$ is $L$-lipschitz w.r.t $||\cdot||_2$ for $t=1, \cdots T$. Then $$\sum_{t=1}^T (\ell_t(x_t) - \ell_t(u)) \leq \frac{L^2}{2\mu} (1 + \log T) $$

Proof: $\ell_t$ is $L$-lipschitz gives us $||g_t||_2^2 \leq L^2$ and thus, using the above theorem we have $$ \sum_{t=1}^T (\ell_t(x_t) - \ell_t(u)) \leq \frac{L^2}{2\mu}\sum_{t=1}^T \frac{1}{t} $$ We can bound this using the inequality $\sum_{t=1}^T \frac{1}{t} \leq 1 + \log T$ to get $$ \sum_{t=1}^T (\ell_t(x_t) - \ell_t(u)) \leq \frac{L^2}{2\mu}(1 + \log T) $$ $\square$

Thus, OGD with strongly-convex and lipschitz loss functions has regret with a logarithmic dependence on $T$! Much better than $\sqrt{T}$ dependence for convex, lipschitz loss functions.

Very important to see that corollary requires bounded domain otherwise loss functions will not be lipschitz given that they are also strongly-convex.

References

  1. Logarithmic regret algorithms for online convex optimization
  2. Adaptive online gradient descent
  3. Francesco Orabona’s Notes
comments powered by Disqus