Solving Contextual Bandits with Greediness

Mar 2, 2020

In this post, I will explain and implement Epsilon-Greedy, a simple algorithm that solves the contextual bandits problem. Despite its simplicity, this algorithm performs considerably well [1].


* Note: In this article, I use the words arm and action, and the words step and round, interchangeably.


The intuition of the algorithm is to choose random actions at sometimes. In the rest of the time, it chooses the action that it thinks is the best one. This simple balance between exploration (random actions) and exploitation (greediness) yields good results.



This pseudocode is from [3].


We can see that the algorithm receives three arguments, a probability p, a decay rate d and the oracles $\hat{f}_{1: k}$. Let’s explain each one of them.

  • p: Dictates what will be the probability of a random action in each round.
  • d: This variable controls how fast p decreases during training.
  • $\hat{f}_{1: k}$: Oracles that are classifiers or regressors. They learn the rewards that each action will return, given the context x.

Line-by-Line Analysis


This is the standard loop of contextual bandits, in which the algorithm receives a context x in each round.


In this block, the algorithm chooses a random action with probability p. Otherwise, it chooses the action that gives the maximum reward according to the oracles.


The probability rate is decreased according to d.


The algorithm receives a reward $r_a^t$ and stores it together with the observed context $x^t$. Note that the data stored is exclusive to the performed action/arm a.


Then, the oracles learn the reward of each (context, action) pair given the data history. This training does not need to happen in every round. It is possible, for example, to train the oracles every 50 rounds.


The programming language used in this implementation is Python and the full implementation is available here. For simplicity, the decay rate d will be discarded and it will be assumed that the probability of a random action will remain constant. Only binary rewards will be used. For the oracles, logistic regressors of the Sklearn library have been chosen.

Contextual Environment

I made an interface of the contextual bandits environment so that I can easily request contexts and get rewards by acting on the environment.

class ContextualEnv(ABC):
    def get_context(self) -> np.ndarray:

    def act(self, action: int) -> float:

    def get_num_arms(self) -> int:

    def get_context_dim(self) -> int:

Epsilon Greedy Constructor

class EpsilonGreedy:

    def __init__(self, c_env: ContextualEnv, 
                 epsilon: float, num_steps: int,
                 training_freq: int):
        self.num_arms = c_env.get_num_arms()
        self.num_steps = num_steps
        self.epsilon = epsilon
        self.training_freq = training_freq
        self.classifiers = [LogisticRegression() for _ in 
        context_dim = c_env.get_context_dim()
        self.context_data = np.zeros((num_steps,
        self.rewards_data = np.full((num_steps, 
                                     self.num_arms), -1, 
        self.c_env = c_env

Here, epsilon is the probability of a random action p of the pseudocode. This class also receives a contextual environment object, the number of steps/rounds, and a parameter that dictates the training frequency of the oracles.

One logistic regressor oracle is initialized for each arm/action of the environment. The context and reward history are also initialized. Since binary rewards are being used, the reward history is filled with -1, so it is possible to differentiate what arm received a reward in each round (a value of -1 says that the respective arm was not used in a given round).

Save Step/Round

def save_step(self, context: np.ndarray, action: int,
              reward: float, step: int) -> None:
    self.context_data[step] = context
    self.rewards_data[step, action] = reward

This method saves the context and reward received in each round/step. Note that the reward is saved on the history of the action/arm used.

Action Policy

def action_policy(self, context: np.ndarray) -> int:
    coin = random.uniform(0, 1)
    if coin > self.epsilon:
        action = self.greedy_action(context)
        action = random.randint(0, self.num_arms-1)
    return action

This method determines which strategy will be chosen, exploration or exploitation. The random action is chosen with a probability equal to epsilon, the greedy action is chosen otherwise.

Greedy Action

def greedy_action(self, context: np.ndarray) -> int:
    rewards = np.zeros(len(self.classifiers))
    for index, classifier in enumerate(self.classifiers):
            context = context.reshape(1, -1)
            action_score = classifier.predict(context)
        except NotFittedError as e:
            a = 3.0/self.num_arms
            action_score = np.random.beta(a, 4)
        rewards[index] = action_score

     max_rewards = max(rewards)
     best_actions = np.argwhere(rewards == max_rewards)
     best_actions = best_actions.flatten()
     return np.random.choice(best_actions)

On the greedy action method, each classifier is evaluated based on the context. If the classifier has not yet been trained, the score is estimated by running a beta distribution. This trick is done on [3]. Afterwards, the action with the maximum estimated reward is chosen.


def fit(self, step: int) -> None:
    step += 1
    contexts_so_far = self.context_data[:step]
    rewards_so_far = self.rewards_data[:step]
    for arm_index in range(self.num_arms):
                            rewards_so_far, arm_index)

In this method, each classifier is trained, only the contexts and rewards seen so far are used.

def fit_classifier(self, contexts: np.ndarray,
                   rewards: np.ndarray,
                   arm: int) -> None:
    arm_rewards = rewards[:, arm]
    # get the index of the rewards that the arm saw
    index = np.argwhere(arm_rewards != -1)
    index = index.flatten()
    arm_rewards = arm_rewards[index]
    # test if the arm saw at least one example of 
    # each class
    if len(np.unique(arm_rewards)) == 2:
        arm_contexts = contexts[index]

Here, only the rewards that the each arm saw are used (that is why the reward history was initialized with -1, note the argwhere). It is verified if the arm saw at least one example of each reward (values of 0 and 1), so it can be trained.


def simulate(self) -> np.ndarray:
    """Returns rewards per step"""

    rewards_history = np.zeros(self.num_steps)
    for step in range(self.num_steps):
        context = self.c_env.get_context()
        action = self.action_policy(context)
        reward = self.c_env.act(action)
        rewards_history[step] = reward
        self.save_step(context, action, reward, step)
        if step % self.training_freq == 0:
    return rewards_history

The simulate method does the training itself. A context is observed, an action is estimated based on the context, the (context, reward) is stored on the respective arm history, and every training_freq steps, the oracles are trained.


Let’s evaluate the algorithm on three environments, taken from here. I will give a brief description of each one:

  1. Binary context and deterministic rewards (easiest).
  2. Binary context and stochastic rewards.
  3. Continuous context and stochastic rewards.

The results were obtained using epsilon equal to 0.2, training the oracles every 50 rounds and averaging the results of 100 runs. This plot considers the mean reward of all history until each round.


We can see that the algorithm learns the deterministic binary environment really fast. The performance on the stochastic binary and stochastic continuous environments are about the same, but the algorithm has a little more trouble to learn the continuous environment.


Previous Article: Brainstorming Benchmarks for Contextual Bandits

Brainstorming Benchmarks for Contextual Bandits

Feb 25, 2020

Since contextual bandits are relatively new, they haven’t been studied and tested so heavily as standard multi-armed bandits algorithms [1]. This being the case, it’s useful to design benchmarks for contextual bandits so these algorithms can be reliably tested, compared and improved.


Simplest benchmark

Many people in the AI community recommend the use of toy problems, these low complexity examples provide better and faster debugging cycles, a good property of early-stage research and experimentation. So now, let’s try to imagine the simplest benchmark for the contextual bandits problem. First, let’s use binary everywhere. Second, let’s make the environment deterministic (an action k with context x always returns the same reward r). Third and last, let’s make the reward function f(x, k) as simple as possible. Remember that this reward function is calculated given the observed context and the chosen action.

So, this is what I came about:

  • Context: 1 binary feature
  • Number of actions: 2
  • Reward function pseudocode:
if context == 0
    action 0 gives a reward of 1, other actions give 0 
    action 1 gives a reward of 1, other actions give 0 

In this case, the best policy is to choose the $0^{th}$ action if the context is 0 and choose the $1^{st}$ action otherwise, easy.

How to complicate

Now, let’s try to gradually add complexity to this problem.

Stochastic rewards

Stochasticity always makes things harder and since this is what we are looking for, let’s add it to our environment.

  • Context: 1 binary feature
  • Number of actions: 2
  • Reward function pseudocode:
if context == 0
    action 0 returns a reward of 2 with 50% chance, other 
    actions give 0 rewards
    action 1 returns a reward of 2 with 50% chance, other 
    actions give 0 rewards

Here, the best policy is still the same as in the previous environment, but now, it should be harder for an algorithm to discover it. In this example, the reward is 2 because of the stochasticity, so the expected reward of each context can still be 1.

Continuous context

Instead of the context being made of only 1 binary feature, it makes sense to use continuous ones, in case of the age of someone, for example, is being represented as a feature. So, let’s change the context possibilities to infinity!

  • Context: 1 continuous feature on the interval [0, 1]
  • Number of actions: 2
  • Reward function pseudocode:
if context > 0.5
    action 0 gives a reward of 2 with 50% chance, other 
    actions give 0 rewards
    action 1 gives a reward of 2 with 50% chance, other 
    actions give 0 rewards

Note that we have to change the first if case of our reward function to deal with the continuous feature. Instead of testing equality, we test if the context is greater than a threshold, in this case, 0.5.

Other ways

There are other ways to increase the complexity of the problem and I will list some below.


While it is good to manually tweak a contextual bandit environment for research or some experiments, this is not a straight forward task and requires dedicated human thought. A good alternative is to test contextual bandit algorithms on supervised-learning datasets. This can be done by representing the context and action by each (X, y) feature vector pair of the dataset, both regression and classification datasets can be used. With this approach, the environment can randomly sample one instance of the dataset, and the reward is based on the $y^k$ raw value in the case of regression. With classification datasets, we can use accuracy as the reward.

Some possibilities for benchmarks

Here, I try to imagine more benchmarks for this problem, trying to add complexity and deal with corner cases.

  • Standard multi-armed bandits scenario: the context doesn’t influence the reward function.
  • Imbalanced Dataset: There is an action that is better more often.
  • Type of reward function function instead of just context > 0.5:
    • linear function
    • non-linear function (neural network)
  • Many degrees of randomness: This can be made by artificially adding noise to the dataset.
  • Vary the number of actions and the number of context features: It would be good to test algorithms on datasets that have a high and low number of actions, this can also be said about the dimensionality of the context.
  • Vary the number of samples: If the context repeats more often, then it should be easier for the algorithm to learn.


  • [1] Cortes, David. “Adapting Multi-Armed Bandits Policies to Contextual Bandits Scenarios.” ArXiv:1811.04383 [Cs, Stat], Nov. 2019.,
  • [2] Bietti, Alberto, et al. “A Contextual Bandit Bake-Off.” ArXiv:1802.04064 [Cs, Stat], Jan. 2020.,

Next Article: Solving Contextual Bandits with Greediness Previous Article: A Very Short Intro to Contextual Bandits

A Very Short Intro to Contextual Bandits

Feb 24, 2020

In the contextual bandits problem, we have an environment with K possible actions. The environment returns some context x, taken from a distribution X. The environment also has a function f(x, k), that calculates the reward based on the context and the chosen action k.

The goal then is to find the policy $\pi$ that maximizes the rewards obtained in the long-term. A policy is a function that maps contexts to actions.

The diagram below illustrates the problem:


Explaining the steps:

  • The environment displays a context x.
  • The agent chooses an action k based on the observed context.
  • The environment returns a reward based on the most recent context and action.

This process continues without a determined time-limit.

Additional considerations can be taken into account, like statistical efficiency (if the algorithm learns fast considering the number of examples) and computational complexity.


  • Agent: A doctor.
  • Environment: Every day, a random patient comes into the hospital with a disease X (the disease is the same to all patients).
    • Context: Features of the patient like age, sex, etc.
    • Actions: K-number of medicines that the doctor can prescribe.
    • Reward: 1 if the patient was cured, 0 otherwise.


Slivkins, Aleksandrs. “Introduction to Multi-Armed Bandits.” ArXiv:1904.07272 [Cs, Stat], Sept. 2019.,

Next Article: Brainstorming Benchmarks for Contextual Bandits

About Me

Jan 1, 0001

Hi, I am Renan Cunha, a Computer Scientist from Pará, a state in northern Brazil. I made this blog to learn and write more, two things I really enjoy doing. Most of the stuff I write here is work-related, but sometimes I can end up on weird and unexpected topics.

My main interests that intersect with my work (and between themselves) are AI and Data Science. My interest in AI started because I became fascinated by the intelligent behavior that was starting to be performed by machines, and that motivated me to build software that had this kind of complexity. My interest in Data Science emerged due to the enormous curiosity I have for the world, seeking to understand it and make justified decisions based on the available evidence.

Outside of work, I am interested a lot in Philosophy. Even though it is considered a field without much progress (this is debatable), I find myself amused by many of the questions that philosophers ask:

I think all kids ask themselves these questions, and I don’t understand why most people stop asking that at some point in life.

Also, I am sympathetic with the Effective Altruism movement, which aims to do the most good given the resources we have. The EA movement ties together empathy with rationality. The goal is to use evidence and reason to figure out which ways to do good are the most effective. You’d be surprised that, through this lens, working on AI might be more effective at doing good than, say, trying to solve global warming.

Having all these interests and cool things to work with, I am grateful for the simple fact that I am alive. It’s fascinating to stop for a few minutes and realize that you’re experiencing something, that your senses are picking up information, and you’re actually perceiving in the first person. Because of that, I find it easy to see meaning in the most mundane tasks, and I think it’s a sin to live without giving it your all.

Besides all that, I like bicycles. It is an effective way to move from point A to point B while exercising and getting sunlight. I love my cat and playing with her, although she can leave some scars :( and I think animes and games are the ultimate art. Blame them for so beautifully combining stunning visuals, immersive soundtracks, and heart-warming stories.

Ways you can contact me and see my work:



Jan 1, 0001