A Very Short Intro to Contextual Bandits

In the contextual bandits problem, we have an environment with K possible actions. The environment returns some context x, taken from a distribution X. The environment also has a function f(x, k), that calculates the reward based on the context and the chosen action k.

The goal then is to find the policy $\pi$ that maximizes the rewards obtained in the long-term. A policy is a function that maps contexts to actions.

The diagram below illustrates the problem:


Explaining the steps:

  • The environment displays a context x.
  • The agent chooses an action k based on the observed context.
  • The environment returns a reward based on the most recent context and action.

This process continues without a determined time-limit.

Additional considerations can be taken into account, like statistical efficiency (if the algorithm learns fast considering the number of examples) and computational complexity.


  • Agent: A doctor.
  • Environment: Every day, a random patient comes into the hospital with a disease X (the disease is the same to all patients).
    • Context: Features of the patient like age, sex, etc.
    • Actions: K-number of medicines that the doctor can prescribe.
    • Reward: 1 if the patient was cured, 0 otherwise.


Slivkins, Aleksandrs. “Introduction to Multi-Armed Bandits.” ArXiv:1904.07272 [Cs, Stat], Sept. 2019. arXiv.org, http://arxiv.org/abs/1904.07272.