Francesco Orabona’s monograph on online learning has a very neat and intuitive way of looking at AdaGrad. I found it to be really useful.
Looking back at OGD
From our note on online (sub)gradient descent, we had the following regret bound
$$R_T(u) \leq \frac{D^2}{2\eta_T} + \sum_{t=1}^T \frac{\eta_t}{2} ||g_t||_2^2 $$
where $D$ is the diameter of the closed, convex set $V$. If we assume a fixed learning rate for all time steps and optimizing we get the optimal learning rate as
$$ \eta^* = \frac{D}{\sqrt{\sum_{t=1}^T ||g_t||_2^2}} $$
But the above choice of learning rate requires us to peek into the future subgradients! A reasonable approximation would be to consider using the subgradients that we have seen so far.
$$ \eta_t = \frac{D}{\sqrt{\sum_{i=1}^{t} ||g_i||_2^2}} $$
Let us try to see what regret we can achieve using the above learning rate scheme. In the OGD regret bound above, we can see that the first term is unchanged as $\eta_T = \eta^* $. The second term, on the other hand, using the optimal learning rate $\eta^* $ gives us
$$ \frac{1}{2}\sum_{t=1}^T \eta^* ||g_t||_2^2 = \frac{D}{2} \sqrt{\sum_{t=1}^T ||g_t||_2^2} $$
Using the chosen learning rate scheme, we obtain the second term as
$$ \frac{1}{2}\sum_{t=1}^T \eta_t ||g_t||_2^2 = \frac{D}{2}\sum_{t=1}^T \frac{||g_t||_2^2}{\sqrt{\sum_{i=1}^t ||g_i||_2^2}} $$
The above term can be bounded using the following lemma
Bounding Discrete Sums using Integrals: Let $a_0 \geq 0$ and $f:[0, \infty) \rightarrow [0, \infty)$ be a non-increasing function. Then $$ \sum_{t=1}^T a_t f(a_0 + \sum_{i=1}^t a_i) \leq \int_{a_0}^{\sum_{t=0}^T a_t} f(x) dx $$
Using the above lemma, we bound the above term as
$$ \frac{D}{2}\sum_{t=1}^T \frac{||g_t||_2^2}{\sqrt{\sum_{i=1}^t ||g_i||_2^2}} \leq D\sqrt{\sum_{t=1}^T ||g_t||_2^2} $$
Thus, substituting this second term in the regret bound (and computing the first term as before), we get the following regret bound $$ R_T(u) \leq \frac{3D}{2}\sqrt{\sum_{t=1}^T ||g_t||_2^2} $$