Model the explore/exploit problem as a slot machine for gambling with K arms. Each “arm” corresponds to an unknown payoff probability $p_k$.

For simplicity, assume that the probabilities don't change over time, and each event is independent of the others given these parameters.

The expected reward from bandit k is $E[X_k] = 1 p_k + 0(1-p_k)$

At iteration N, pick the bandit that has performed the best up to this time. This method under-explores and tends to get stuck on sub-optimal bandits.

At iteration N:

- * if the bandit from iteration N-1 was successful:
- play it
- otherwise select another arm uniformly at random or cycle through them

At iteration N:

- * select the best bandit with probability $1-\epsilon$
- * else select another uniformly or in proportion to their performance up to this point

This uses a parameter $\epsilon$ to control exploration and exploitation. It is fixed the whole time, so the core explore/exploit tradeoff is not handled dynamically.

Modifications may reduce epsilon over time e.g. via a schedule or decay factor.

a.k.a. **Thomson Sampling** or **Bayesian Bandits**. The number of pulls from each bandit k should be proportional to the probability that the bandit is optimal.

$p(p_k|r_k,N_k)$ is a Bayesian density on the value $p_k$.

- * $r_k,N_k$ are the number of trials and successes with the kth bandit so far
- * $p(p_k|r_k,N_k)$ is the posterior belief about $p_k$ given data $r_k,N_k$

Typically learned via sampling methods:

- at each iteration i:
- Sample M values of p_k for each bandit k from its density P(p_k|r_k,N_k)
- …

This technique has the advantages that there are no parameters to tune and it works well in practice. It is more computation-heavy and its theoretical results are not known in comparison to other methods.

Can learn individual distributions using logistic regression classification.

In the online learning of click-through rates example, each “arm” corresponds to an advertisement with unkown click-through rate.