Two Years of Bayesian Bandits for E-Commerce

NYC College of Technology • April, 18 2019 • @AustinRochford

About Monetate

  • Founded 2008, web optimization and personalization SaaS
  • Observed 5B impressions and $4.1B in revenue during Cyber Week 2017

Non-technical marketer-focused

About this talk

Outline

  • Web optimization
    • A/B testing
    • Multi-armed bandits
  • Bayesian bandits
    • Thompson sampling
  • Bandit bias
    • Inverse propensity weighting

Web Optimization

A/B testing

A/B testing machinery

Ronald Fisher

Sequential testing

Abraham Wald

Sequential optimization

Multi-armed bandits

Multi-armed bandit systems

Bayesian Bandits

Beta-binomial model

$$ \begin{align*} x_A, x_B & = \textrm{number of rewards from users shown variant } A, B \\ x_A & \sim \textrm{Binomial}(n_A, r_A) \\ x_B & \sim \textrm{Binomial}(n_B, r_B) \\ r_A, r_B & \sim \textrm{Beta}(1, 1) \end{align*} $$

$$ \begin{align*} r_A\ |\ n_A, x_A & \sim \textrm{Beta}(x_A + 1, n_A - x_A + 1) \\ r_B\ |\ n_B, x_B & \sim \textrm{Beta}(x_B + 1, n_B - x_B + 1) \end{align*} $$

Thompson sampling

Thompson sampling randomizes user/variant assignment according to the probabilty that each variant maximizes the posterior expected reward.

The probability that a user is assigned variant A is

$$ \begin{align*} P(r_A > r_B\ |\ \mathcal{D}) & = \int_0^1 P(r_A > r\ |\ \mathcal{D})\ \pi_B(r\ |\ \mathcal{D})\ dr \\ & = \int_0^1 \left(\int_r^1 \pi_A(s\ |\ \mathcal{D})\ ds\right)\ \pi_B(r\ |\ \mathcal{D})\ dr \\ & \propto \int_0^1 \left(\int_r^1 s^{\alpha_A - 1} (1 - s)^{\beta_A - 1}\ ds\right) r^{\alpha_B - 1} (1 - r)^{\beta_B - 1}\ dr \end{align*} $$

Monte Carlo Methods

In [6]:
N = 5000

x, y = np.random.uniform(0, 1, size=(2, N))
In [8]:
fig
Out[8]:
In [9]:
in_circle = x**2 + y**2 <= 1
In [11]:
fig
Out[11]:
In [12]:
4 * in_circle.mean()
Out[12]:
3.1743999999999999
In [14]:
fig
Out[14]:
In [15]:
%%time
N_SAMPLES = 10000

a_samples = a_post.rvs(size=N_SAMPLES)
b_samples = b_post.rvs(size=N_SAMPLES)
CPU times: user 0 ns, sys: 10 ms, total: 10 ms
Wall time: 9.12 ms
In [17]:
fig
Out[17]:
In [19]:
(a_samples > b_samples).mean()
Out[19]:
0.2457

Thompson sampling, redeemed

  1. Sample $\hat{r}_A \sim r_A\ |\ n_A, x_A$ and $\hat{r}_B \sim r_B\ |\ n_B, x_B$.
  2. Assign the user variant A if $\hat{r}_A > \hat{r}_B$, otherwise assign them variant B.

Simulating a bandit

In [20]:
class BetaBinomial:
    def __init__(self, a0=1., b0=1.):
        self.a = a0
        self.b = b0
        
    def sample(self):
        return sp.stats.beta.rvs(self.a, self.b)
    
    def update(self, n, x):
        self.a += x
        self.b += n - x
In [21]:
class Bandit:
    def __init__(self, a_post, b_post):
        self.a_post = a_post
        self.b_post = b_post
        
    def assign(self):
        return 1 * (self.a_post.sample() < self.b_post.sample())
    
    def update(self, arm, reward):
        arm_post = self.a_post if arm == 0 else self.b_post
        arm_post.update(1, reward)
In [23]:
A_RATE, B_RATE = 0.05, 0.1
N = 1000

rewards_gen = generate_rewards(A_RATE, B_RATE, N)
In [24]:
bandit = Bandit(BetaBinomial(), BetaBinomial())
arms = np.empty(N, dtype=np.int64)
rewards = np.empty(N)

for t, arm_rewards in tqdm(enumerate(rewards_gen), total=N):
    arms[t] = bandit.assign()
    rewards[t] = arm_rewards[arms[t]]

    bandit.update(arms[t], rewards[t])
100%|██████████| 1000/1000 [00:00<00:00, 4504.52it/s]
In [26]:
fig
Out[26]:

Simulating many bandits

In [28]:
N_BANDIT = 100

arms = np.empty((N_BANDIT, N), dtype=np.int64)
rewards = np.empty((N_BANDIT, N))

for i in trange(N_BANDIT):
    arms[i], rewards[i] = simulate_bandit(
        generate_rewards(A_RATE, B_RATE, N), N
    )
100%|██████████| 100/100 [00:13<00:00,  7.93it/s]
In [30]:
fig
Out[30]:

A/A testing

A/A bandits
In [32]:
N = 2000

arms, rewards = simulate_bandits(
    lambda: generate_rewards(A_RATE, A_RATE, N), 
    N, N_BANDIT
)
100%|██████████| 100/100 [00:26<00:00,  4.76it/s]
In [34]:
fig
Out[34]:
In [36]:
fig
Out[36]:

Why Bayesian?

  • Marketers are natural Bayesians!

Bandit Bias

In [39]:
fig
Out[39]:
In [41]:
fig
Out[41]:
In [43]:
fig
Out[43]:

Inverse propensity weighting

$$ \hat{r}^{\textrm{IPS}}_A = \frac{1}{n} \sum_{t = 1}^n \frac{Y_i \cdot \mathbb{I}(A_t = 1)}{P(A_t = 1\ |\ \mathcal{D}_{1, t - 1})} $$

where

$$ A_t = \begin{cases} 1 & \textrm{session } t \textrm{ saw variant A} \\ 0 & \textrm{session } t \textrm{ saw variant B} \end{cases}. $$

Calculating $P(A_t = 1\ |\ \mathcal{D}_{1, t - 1})$

In [44]:
ts_fig
Out[44]:
In [46]:
%%time
N_MC = 500

a_samples = sp.stats.beta.rvs(
    a_post_alpha[..., np.newaxis],
    a_post_beta[..., np.newaxis],
    size=(N_BANDIT, N, N_MC)
)
b_samples = sp.stats.beta.rvs(
    b_post_alpha[..., np.newaxis],
    b_post_beta[..., np.newaxis],
    size=(N_BANDIT, N, N_MC)
)

a_prob = np.empty((N_BANDIT, N))
a_prob[:, 0] = 0.5
a_prob[:, 1:] = (a_samples > b_samples).mean(axis=2)[:, :-1]
CPU times: user 14.6 s, sys: 5.24 s, total: 19.8 s
Wall time: 19.9 s
In [48]:
a_ips_est = 1. / plot_t * (rewards * (arms == 0) / a_prob_).cumsum(axis=1)
b_ips_est = 1. / plot_t * (rewards * (arms == 1) / (1 - a_prob_)).cumsum(axis=1)
In [50]:
fig
Out[50]:

Bias-variance tradeoff

In [52]:
fig
Out[52]:

Intuition: the variant with the highest reward rate should receive the most traffic.

In [55]:
fig
Out[55]:

Learn more