Michael Data

Model the explore/exploit problem as a slot machine for gambling with K arms. Each “arm” corresponds to an unknown payoff probability $p_k$.

For simplicity, assume that the probabilities don't change over time, and each event is independent of the others given these parameters.

The expected reward from bandit k is $E[X_k] = 1 p_k + 0(1-p_k)$

Strategies

Deterministic Greedy Strategy

At iteration N, pick the bandit that has performed the best up to this time. This method under-explores and tends to get stuck on sub-optimal bandits.

Play-the-Winner Strategy

At iteration N:

  • * if the bandit from iteration N-1 was successful:
  • play it
  • otherwise select another arm uniformly at random or cycle through them

Epsilon-Greedy Strategy

At iteration N:

  • * select the best bandit with probability $1-\epsilon$
  • * else select another uniformly or in proportion to their performance up to this point

This uses a parameter $\epsilon$ to control exploration and exploitation. It is fixed the whole time, so the core explore/exploit tradeoff is not handled dynamically.

Modifications may reduce epsilon over time e.g. via a schedule or decay factor.

Randomized Probability Matching

a.k.a. Thomson Sampling or Bayesian Bandits. The number of pulls from each bandit k should be proportional to the probability that the bandit is optimal.

$p(p_k|r_k,N_k)$ is a Bayesian density on the value $p_k$.

  • * $r_k,N_k$ are the number of trials and successes with the kth bandit so far
  • * $p(p_k|r_k,N_k)$ is the posterior belief about $p_k$ given data $r_k,N_k$

Typically learned via sampling methods:

  • at each iteration i:
  • Sample M values of p_k for each bandit k from its density P(p_k|r_k,N_k)

This technique has the advantages that there are no parameters to tune and it works well in practice. It is more computation-heavy and its theoretical results are not known in comparison to other methods.

Learning

Can learn individual distributions using logistic regression classification.

Applications

In the online learning of click-through rates example, each “arm” corresponds to an advertisement with unkown click-through rate.