Model the explore/exploit problem as a slot machine for gambling with K arms. Each “arm” corresponds to an unknown payoff probability $p_k$.
For simplicity, assume that the probabilities don't change over time, and each event is independent of the others given these parameters.
The expected reward from bandit k is $E[X_k] = 1 p_k + 0(1-p_k)$
At iteration N, pick the bandit that has performed the best up to this time. This method under-explores and tends to get stuck on sub-optimal bandits.
At iteration N:
At iteration N:
This uses a parameter $\epsilon$ to control exploration and exploitation. It is fixed the whole time, so the core explore/exploit tradeoff is not handled dynamically.
Modifications may reduce epsilon over time e.g. via a schedule or decay factor.
a.k.a. Thomson Sampling or Bayesian Bandits. The number of pulls from each bandit k should be proportional to the probability that the bandit is optimal.
$p(p_k|r_k,N_k)$ is a Bayesian density on the value $p_k$.
Typically learned via sampling methods:
This technique has the advantages that there are no parameters to tune and it works well in practice. It is more computation-heavy and its theoretical results are not known in comparison to other methods.
Can learn individual distributions using logistic regression classification.
In the online learning of click-through rates example, each “arm” corresponds to an advertisement with unkown click-through rate.