# A Very Short Intro to Contextual Bandits

In the contextual bandits problem, we have an
environment with *K* possible actions. The environment returns some context *x*, taken from a distribution *X*.
The environment also has a function *f(x, k)*, that calculates the reward
based on the context and the chosen action *k*.

The goal then is to find the policy $\pi$ that maximizes the rewards obtained in the long-term. A policy is a function that maps contexts to actions.

The diagram below illustrates the problem:

Explaining the steps:

- The environment displays a context
*x*. - The agent chooses an action
*k*based on the observed context. - The environment returns a reward based on the most recent context and action.

This process continues without a determined time-limit.

Additional considerations can be taken into account, like statistical efficiency (if the algorithm learns fast considering the number of examples) and computational complexity.

## Example

- Agent: A doctor.
- Environment: Every day, a random patient comes into the hospital with a
disease X (the disease is the same to all patients).
- Context: Features of the patient like age, sex, etc.
- Actions: K-number of medicines that the doctor can prescribe.
- Reward: 1 if the patient was cured, 0 otherwise.

## References

Slivkins, Aleksandrs. “Introduction to Multi-Armed Bandits.” ArXiv:1904.07272 [Cs, Stat], Sept. 2019. arXiv.org, http://arxiv.org/abs/1904.07272.