# Michael's Wiki

Model the explore/exploit problem as a slot machine for gambling with K arms. Each “arm” corresponds to an unknown payoff probability $p_k$.

For simplicity, assume that the probabilities don't change over time, and each event is independent of the others given these parameters.

The expected reward from bandit k is $E[X_k] = 1 p_k + 0(1-p_k)$

#### Deterministic Greedy Strategy

At iteration N, pick the bandit that has performed the best up to this time. This method under-explores and tends to get stuck on sub-optimal bandits.

#### Play-the-Winner Strategy

At iteration N:

• * if the bandit from iteration N-1 was successful:
• play it
• otherwise select another arm uniformly at random or cycle through them

#### Epsilon-Greedy Strategy

At iteration N:

• * select the best bandit with probability $1-\epsilon$
• * else select another uniformly or in proportion to their performance up to this point

This uses a parameter $\epsilon$ to control exploration and exploitation. It is fixed the whole time, so the core explore/exploit tradeoff is not handled dynamically.

Modifications may reduce epsilon over time e.g. via a schedule or decay factor.

#### Randomized Probability Matching

a.k.a. Thomson Sampling or Bayesian Bandits. The number of pulls from each bandit k should be proportional to the probability that the bandit is optimal.

$p(p_k|r_k,N_k)$ is a Bayesian density on the value $p_k$.

• * $r_k,N_k$ are the number of trials and successes with the kth bandit so far
• * $p(p_k|r_k,N_k)$ is the posterior belief about $p_k$ given data $r_k,N_k$

Typically learned via sampling methods:

• at each iteration i:
• Sample M values of p_k for each bandit k from its density P(p_k|r_k,N_k)

This technique has the advantages that there are no parameters to tune and it works well in practice. It is more computation-heavy and its theoretical results are not known in comparison to other methods.

##### Learning

Can learn individual distributions using logistic regression classification.

##### Applications

In the online learning of click-through rates example, each “arm” corresponds to an advertisement with unkown click-through rate.