Regret Analysis of Stochastic Bandit Problems

2020/07/23

Reading time: 9 minutes.

Assumptions

  1. Bandit Feedback: only observe reward for the arm played
  2. IID rewards: For each arm $a$, $\exists$ a reward distribution $D_a$ which is unknown
  3. Per-round rewards are bounded, i.e. $r_t \in [0, 1]$ for $t \in [T]$

Notation

Regret

$$ R(T) = \mu^*T - \sum_{t=1}^T \mu(a_t) $$

Let us start with a naive algorithm that explores first for a fixed number of rounds and then exploits

Explore-First

Algorithm:

  1. Exploration phase: try each arm $N$ times
  2. Select the arm $\hat{a}$ with highest average reward (arbitrary tie-breaks)
  3. Exploitation phase: play arm $\hat{a}$ for the remaining rounds

Let us try to analyze the regret of the above algorithm. To aid us in the analysis, it is useful to know Hoeffding’s inequality given by $$ \mathbb{P}[|\bar{\mu}(a) - \mu(a)| \leq r(a)] \geq 1 - \frac{2}{T^4} $$ where $\bar{\mu}(a)$ is the average reward for arm $a$ after exploration phase, $r(a) = \sqrt{\frac{2\log T}{N}}$ is the confidence radius. In other words, the probability that the average would deviate from the true expectation will be small.

Another trick that simplifies the regret analysis is the clean event trick where we define a clean event and a bad event (which is simply the complement of the clean event.) For the purposes of analyzing the regret of Explore-first we define the following clean event: $$ \mathcal{E}: |\bar{\mu}(a) - \mu(a)| \leq r(a), \forall a \in \mathcal{A} $$

Regret Guarantee: Explore-first achieves regret $\mathbb{E}[R(T)] \leq T^{\frac{2}{3}}\mathcal{O}(K\log T)^{\frac{1}{3}}$

Proof: First, consider that the clean event $\mathcal{E}$ holds. Say at the end of exploration phase, we chose $\hat{a}$ as the best arm but $\hat{a} \neq a^* $. So at each step in the exploitation phase we incur regret. But for $\hat{a}$ to have been chosen, the following should be true: $\bar{\mu}(\hat{a}) \geq \bar{\mu}(a^*)$.

Since it is a clean event we have, $$\begin{aligned} \mu(\hat{a}) + r(\hat{a}) &\geq \bar{\mu}(\hat{a}) \geq \bar{\mu}(a^*) \geq \mu(a^*) - r(a^*) \\ \mu^* - \mu(\hat{a}) &\leq r(\hat{a}) + r(a^*) = \mathcal{O}\left(\sqrt{\frac{\log T}{N}}\right) \end{aligned}$$

Thus, we can bound the regret as \begin{align} R(T) &\leq KN + (T - KN) \mathcal{O}\left(\sqrt{\frac{\log T}{N}}\right) \\ &\leq KN + T \mathcal{O}\left(\sqrt{\frac{\log T}{N}}\right) \end{align} where the first term in the first line of the above set of equations comes from the exploration phase (assuming worst case.) Minimizing the right hand side over $N$ we get $N = \left(\frac{T}{K}\right)^{\frac{2}{3}}\mathcal{O}(\log T)^{\frac{1}{3}}$ and $R(T) \leq T^{\frac{2}{3}}\mathcal{O}(K\log T)^{\frac{1}{3}}$.

But we need to also consider the bad event. However, its probability is so small (thanks to Hoeffding’s!) that it won’t affect expected regret \begin{align} \mathbb{E}[R(T)] &= \mathbb{E}[R(T)|\mathcal{E}]\mathbb{P}[\mathcal{E}] + \mathbb{E}[R(T)|\bar{\mathcal{E}}]\mathbb{P}[\bar{\mathcal{E}}] \\ &\leq \mathbb{E}[R(T)|\mathcal{E}] + T * \mathcal{O}(T^{-4}) \\ &\leq T^{\frac{2}{3}} \mathcal{O}(K\log T)^{\frac{1}{3}} \end{align} $\square$

An issue with explore-first, that the regret accrued in the exploration phase is terrible! Maybe its better to interleave exploration with exploitation.

Epsilon-Greedy

Algorithm:

  1. For each round $t=1, 2, \cdots$ do
  2. Toss a coin with success probability $\epsilon_t$
  3. If success, explore: choose arm uniformly at random and play it
  4. Else, exploit: choose arm with highest average reward so far and play it
  5. End for

Since exploration is spread more uniformly, we can derive meaningful regret bounds even for small $t$ (unlike explore-first which required $T > KN$)

Let us focus on $\epsilon_t \sim t^{\frac{-1}{3}}$ so that expected number of exploration rounds upto time $t$ is of the order of $t^{\frac{2}{3}}$, same as in explore-first

Regret Guarantee: Epsilon-greedy with $\epsilon_t = t^{\frac{-1}{3}}(K\log t)^{\frac{1}{3}}$ achieves regret $\mathbb{E}[R(t)] \leq t^{\frac{2}{3}}\mathcal{O}(K\log t)^{\frac{1}{3}}$ for each round $t$

Proof: Fix round $t$ and let us compute $\mathbb{E}[\Delta(a_t)]$. Let us try to define the clean event for ease of analysis. Let $n_t(a)$ be the number of times arm $a$ was played until round $t$. If $n_t(a)$ were fixed in advance (like in explore-first) then we can simply use Hoeffding’s $$ \mathbb{P}[|\bar{\mu}_t(a) - \mu(a)| \leq r_t(a)] \geq 1 - \frac{2}{T^4} $$ where $r_t(a) = \sqrt{\frac{2\log T}{n_t(a)}}$. But $n_t(a)$ is a random variable and depends on past rewards, thus conditioned on $n_t(a)$ samples of $a$ are not necessarily independent.

We can get around this by ensuring Hoeffding’s holds for all $j \leq t$, i.e. $$ \forall j, \mathbb{P}[|\bar{\mu}_t(a) - \mu(a)| \leq r_t(a)] \geq 1 - \frac{2}{T^4} $$ Using union bound (and assuming number of arms $K \leq T$) we have $$ \mathbb{P}[\forall a, \forall j, |\bar{\mu}_t(a) - \mu(a)| \leq r_t(a)] \geq 1 - \frac{2}{T^2} $$

We will define this as the clean event $\mathcal{E}$: $$ \mathcal{E}: {\forall a, \forall t, |\bar{\mu}_t(a) - \mu(a)| \leq r_t(a)} $$ It is clear from the above arguments that $$ \mathbb{P}[\mathcal{E}] \geq 1 - \frac{2}{T^2} $$

Now let’s move onto the regret analysis of epsilon-greedy. Consider clean event $\mathcal{E}$ holds.

At any round $t$, if we exploit and the arm played $a_t \neq a^* $, then we know $\bar{\mu}(a_t) \geq \bar{\mu}(a^*)$. Since it is a clean event we have

\begin{align} \mu(a_t) + r(a_t) &\geq \bar{\mu}(a_t) \geq \bar{\mu}(a^*) \geq \mu(a^*) - r(a^*) \\ \mu^* - \mu(a_t) &\leq r(a_t) + r(a^*) \\ \mu^* - \mu(a_t) &\leq \mathcal{O}\left(\sqrt{\frac{\log T}{n_t(a^*)}}\right) + \mathcal{O}\left(\sqrt{\frac{\log T}{n_t(a_t)}}\right) \end{align}

Note that $\mathbb{E}[n_t(a)] \geq \frac{\epsilon_t}{K} t$ if we only count the exploration rounds. Thus, we have $$ \mathbb{E}[\mu^* - \mu(a_t)] \leq \mathcal{O}\left( \sqrt{\frac{K\log T}{\epsilon_t t}}\right) $$

Now consider $\mathbb{E}[\Delta(a_t)]$ where $\Delta(a_t)$ is the suboptimality gap of arm $a_t$ \begin{align} \mathbb{E}[\Delta(a_t)] &\leq \epsilon_t + (1 - \epsilon_t)\mathbb{E}[\mu^* - \mu(a_t)] \\ &\leq \epsilon_t + (1 - \epsilon_t)\mathcal{O}\left( \sqrt{\frac{K\log T}{\epsilon_t t}}\right) \end{align} In the first equation among the above set of equations, we account for both exploration (assuming worst case regret) and exploitation.

Now we can decompose expected regret in terms of the expected suboptimality gaps $$ \mathbb{E}[R(t)] = \sum_t \mathbb{E}[\Delta(a_t)] \leq \epsilon_t t + (1 - \epsilon_t)\mathcal{O}\left( \sqrt{\frac{Kt\log T}{\epsilon_t}}\right) $$ Optimizing over $\epsilon_t$ we get $\epsilon_t = t^{-\frac{1}{3}}(K\log t)^{\frac{1}{3}}$ which results in $$ \mathbb{E}[R(t)] \leq t^{\frac{2}{3}}\mathcal{O}(K\log t)^{\frac{1}{3}} $$

Note that we have only considered clean event, but we can easily show that the probability of bad event is so small that it doesn’t affect expected regret. $\square$

Successive Elimination

Before we move onto the algorithm, let us define what upper/lower confidence bounds are: $$ \mathtt{UCB}_t(a) = \bar{\mu}_t(a) + r_t(a) $$ $$ \mathtt{LCB}_t(a) = \bar{\mu}_t(a) - r_t(a) $$ The interval $[\mathtt{LCB}_t(a), \mathtt{UCB}_t(a)]$ is called the confidence interval.

Algorithm:

  1. Initially all arms are set to “active”
  2. Each phase:
  3. Try all arms that are active
  4. Deactivate all arms $a$ such that there exists arm $a’$ with $\mathtt{UCB}_t(a) \leq \mathtt{LCB}_t(a)$
  5. Repeat until end of all rounds

Regret Guarantee: Successive elimination achieves regret $$ \mathbb{E}[R(t)] = \mathcal{O}(\sqrt{Kt\log T}) $$ for all rounds $t \leq T$ (Note that we achieve a $\sqrt{t}$ dependence over $T^{\frac{2}{3}}$ dependence that explore-first/epsilon-greedy achieve and this is due to adaptive-exploration)

Proof: Consider any arm $a$ that is not optimal i.e. $\mu(a) \neq \mu^* $. Let $t$ be the last round before $a$ is deactivated. Define the clean event $\mathcal{E}$ the same way as we did for epsilon-greedy regret analysis. Consider clean event $\mathcal{E}$ holds. Then, $$ \Delta(a) = \mu^* - \mu(a) \leq 2(r_t(a^*) + r_t(a)) = \mathcal{O}(r_t(a)) $$ The above inequality is true since confidence intervals of $a^*$ and $a$ must overlap as $a$ is not eliminated yet. And the final equality (above) is true because $n_t(a)$ and $n_t(a^*)$ differ at most by $1$ since we are alternating among arms.

But for arm $a$, $n_t(a) = n_T(a)$ which implies $r_t(a) = r_T(a)$. Thus we have, $$ \Delta(a) \leq \mathcal{O}(r_T(a)) = \mathcal{O}\left(\sqrt{\frac{\log T}{n_T(a)}}\right) $$ for all arms $a$ such that $\mu(a) < \mu^*$.

Thus, an arm that is played a lot of times cannot be bad

Consider regret of arm $a$ until round $t$ $$ R(t, a) = \Delta(a) n_t(a) \leq n_t(a) \mathcal{O}\left(\sqrt{\frac{\log T}{n_T(a)}}\right) \leq \mathcal{O}(\sqrt{n_t(a)\log T}) $$ Consider the set $\mathcal{A}^+ = \{a: \mu(a) < \mu^*\}$. Then $$ R(t) = \sum_{a \in \mathcal{A}^+} R(t, a) \leq \mathcal{O}(\sqrt{\log T})\sum_{a \in \mathcal{A}^+} \sqrt{n_t(a)} \leq \mathcal{O}(\sqrt{\log T}) \sum_{a \in \mathcal{A}} \sqrt{n_t(a)} $$ Note that the function $f(x) = \sqrt{x}$ is a concave function, so from Jensen’s we have $$ \frac{1}{K}\sum \sqrt{x} \leq \sqrt{\frac{\sum x}{K}} $$

Hence, $$R(t) \leq \mathcal{O}(\sqrt{\log T}) \sqrt{K\sum_{a \in \mathcal{A}}n_t(a)} = \mathcal{O}(\sqrt{Kt\log T})$$

Thus, when the clean event $\mathcal{E}$ holds we have shown that the above regret bound is true. Similar to previous analyses we can also show that the regret contribution from the bad event $\bar{\mathcal{E}}$ is small and does not affect the regret bound. $\square$

In addition to the above regret bound with absolute constants, we can additionally show an instance-dependent bound for successive elimination as follows

Instance-dependent Regret Guarantee: Successive Elimination achieves regret \begin{align} \mathbb{E}[R(T)] &\leq \mathcal{O}(\log T) \left[ \sum_{a \in \mathcal{A}^+} \frac{1}{\Delta(a)} \right] \\ &= \mathcal{O}(\log T) \left[\sum_{a: \mu(a) < \mu^*} \frac{1}{\mu^* - \mu(a)} \right] \end{align}

Proof:From the above proof we know $$ \Delta(a) \leq \mathcal{O}\left(\sqrt{\frac{\log T}{n_T(a)}}\right) $$ which gives $$ n_T(a) \leq \mathcal{O}\left(\frac{\log T}{(\Delta(a))^2}\right) $$

So we have, $$ R(T, a) = n_T(a)\Delta(a) \leq \mathcal{O}\left(\frac{\log T}{\Delta(a)}\right) $$ and $$R(T) = \sum_{a \in \mathcal{A}^+} R(T, a) \leq \mathcal{O}(\log T)\left[\sum_{a \in \mathcal{A}^+} \frac{1}{\Delta(a)} \right]$$ $\square$

The regret bound is logarithmic in $T$ with an instance-dependent constant whereas the previous regret bound is achievable with an absolute constant.

UCB1

Algorithm:

  1. Try each arm once
  2. for each round $t$
  3. Pick arm $\arg\max_{a \in \mathcal{A}} \mathtt{UCB}_t(a)$ and play it
  4. end for

It is quite clear that UCB1 has an implicit exploration strategy since an arm $a$ is chosen either if it has a high average reward estimate $\bar{\mu}(a)$ or has a high confidence radius $r_t(a)$.

Regret Guarantee: UCB1 achieves the following regret bound for all rounds $t \leq T$ $$ \mathbb{E}[R(t)] = \mathcal{O}(\sqrt{Kt\log T}) $$ In addition, we can also show a instance-dependent bound as follows $$\mathbb{E}[R(T)] \leq \mathcal{O}(\log T) \left[\sum_{a \in \mathcal{A}^+} \frac{1}{\Delta(a)}\right]$$

Proof: Note that the regret bounds are the same as successive elimination. We will mostly borrow the same analysis and keep it simple.

Say arm $a_t$ is chosen in round $t$, then we know $$\mathtt{UCB}_t(a_t) \geq \mathtt{UCB}_t(a^*) \geq \mu^*$$ But we know that under the clean event $\mathcal{E}$ (defined the same as successive elimination) \begin{align} \mu(a_t) &\geq \bar{\mu}(a_t) - r(a_t) \\ \bar{\mu}(a_t) &\leq \mu(a_t) + r(a_t) \end{align} Therefore, $$\mu(a_t) + 2r(a_t) \geq \bar{\mu}(a_t) + r(a_t) = \mathtt{UCB}_t(a_t) \geq \mu^*$$ $$ \Delta(a_t) \leq 2r(a_t) = 2\sqrt{\frac{2\log T}{n_t(a_t)}} $$ The above trick is often used in the analyses for UCB-style algorithms.

Now, choosing $t$ as the last round where arm $a$ was played will give us the same property that we had in the proof for successive elimination and the rest of the analysis follows $\square$

References

All the above proofs, intuitions and algorithms are taken from Aleksandrs Slivkins’ great monograph on multi-armed bandits.

comments powered by Disqus