Let's say you go to a casino and you want to maximize your cash return playing on a series of slot machines. You need to divide your time between trying to explore multiple machines to see which one gets the highest payout and exploit the one that is paying the most so far. This problem captures the Multi-Armed Bandit scenario, a type of reinforcement learning algorithm where an agent wants to maximize the total cumulative reward for actions taken over a sequence of iterations, but where the feedback is incomplete, as you can only pull one slot-machine at a time. At any given iteration, the agent can only observe the reward of the chosen action but doesn't know the reward for the other actions.
The problem of incomplete feedback is what researchers call the exploration-exploitation dilemma. The agent needs to exploit the most promising known set of actions or policy to get reward, but at the same time, it needs to explore other actions that could provide a higher future reward. Intuitively, if we have a news recommendation system, it can't learn that a particular article about science generates more clicks than one about football if we only show the science article to the users. The challenge is to find the right balance between exploration and exploitation.
The multi-armed bandit setting, however, doesn't use information about the environment, which in many cases can be beneficial for the learning algorithm. In the news recommendation example before, it would useful to incorporate into the model information about the articles visited by the user in the past or the user demographics. Contextual bandits solve this problem.
The term contextual bandits was coined by John Langford in 2007 to integrate side information (context) into multi-armed bandits .
The contextual bandit problem is stated in the following way: in each iteration, an agent is presented with an n-dimensional feature vector which encodes the context. The agent uses the current context vector, the past context vectors and the rewards of the actions taken in the past to choose the action to take in the current iteration. After a number of iterations, the agent is able to learn the intrinsic relationships between context vectors and reward to predict the sequence of actions that maximize the total cumulative reward. The contextual information that the agent can use can be static, like user demographics or item information, or dynamic, like location or time of the day.
As many reinforcement learning algorithms, contextual bandits work better in a real-time setting. Real-time execution needs a different set of optimizations and requirements in comparison with traditional machine learning execution.
The most common setup for machine learning training is batch training. The learning process is typically divided into two separate phases: training and test. In the training phase, the model takes a mini-batch of data (a slice of data) and computes stochastic gradient descent to update the model weights. This operation is repeated until the model has seen all the mini-batches. A complete round is usually called an epoch, and the model is trained for a number of epochs. In the test phase, the model makes a prediction over a data partition that wasn't shown in the previous phase to evaluate its performance.
Online learning algorithms take just one data point and recalibrate the model weights before making the prediction. Typically, online learning makes use of simple algorithms over complex ones like neural networks, they are less vulnerable to concept drift in comparison with batch learning methods, works well in dynamic environments, uses a smaller memory footprint and are computationally faster. Based on these features they are used in real-time settings. They have some disadvantages: they are highly vulnerable to DDoS attacks, which can break the training process. They can become skewed towards a particular class if there it receives too many examples of that class.
A common setup is to have a hybrid architecture of online and offline models. The online model is able to adapt to a dynamic environment and capture real-time trends. The offline model can be used to experiment with different hyperparameters or algorithms, to make sure that the model is not overfitting or underfitting and for other tasks.
A good way to optimize memory consumption and to improve scalability is feature hashing. For this, one of the most popular online learning libraries, Vowpal Wabbit (VW), uses the hashing trick. The idea is to encode the features using a hash instead of loading the training data into memory like traditional machine learning algorithms. This trick produces an important characteristic, the memory footprint of the system is always constant independently of the data size.
Personalized news recommendation: From , a user requests a homepage and the machine learning system must decide how to order the articles on the page. If a user is logged in, there is context: demographics (e.g., age, location) and the topics of news stories they have clicked on in the past; otherwise, only the location is available. The action choices are a set of selected articles which are featurized and added to the model. The reward signal is clicks.
Ad placement on a website: From  a number of ads (arms) are available to be placed on a number of web-pages (context information), the user profile can be also added as context information if available. Each visit can be regarded as a random sample of the context information from an unknown distribution that the algorithm has to learn. The reward is the revenue generated when the visitor clicks on an ad and the goal is to maximize the cumulative expected revenue by placing the most relevant ad for each user.
Website component optimization: From this example, the objective is to find the optimal label text and color of a button on a website to increase the click-through rate. There are two bandits, one for the color and one for the text, the different color and text options are the arms. The reward signal is clicks. In situations like these, the text and colors variants are generally small to avoid hurting the user experience.
Tech support assistance: From , the system optimizes the response to the user question, trying to minimize human intervention (negative reward). The actions are pointers to potential solutions and the context is user information and previous dialogue elements.
A typical contextual bandits architecture is described in . Consider an application APP that interacts with its environment in real-time where the goal is to find a policy that maximizes the average reward over a sequence of interactions.
Looking at the diagram, the architecture has the following components:
EXPLORE: It takes as input context features x and an event key k from APP, outputs an action a and observes the reward r only for that action. The explore component supports multi-threaded predictions over a locally-stored model, which is critical for front-end applications that serve many concurrent requests. The algorithms available in VW are:
LOG: Generates data by joining each (x, a, p) tuple with its reward r, where p is the probability of the chosen action according to the exploration policy.
LEARN: This component performs policy training online. Typically, the policy is defined by a light machine learning model, such as linear vectors or decision trees. The component takes a stream (x, a, r, p) of exploration datapoints and train the model online.
DEPLOY: This component deploys the new model. Typically, a new model is deployed every 10-15 minutes. The deployment time depends on how rich your model is, how changing is the user behavior, and other factors.
FEATURE GENERATOR: auto-generate features for certain types of content. Any publicly accessible content (e.g., articles, videos, etc.) can be passed to the Feature Generator ahead of time to be automatically featurized and cached—e.g., text sentiment analysis, video content indexing, etc. These features are inserted on-the-fly by the Explore component before making a prediction involving this content
OFFLINE LEARNER: it is used for offline experimentation, such as hyperparameter tuning, evaluating learning algorithms or policy classes, changing the reward metric, etc. Some of the algorithms in the Explore component perform policy evaluation. It is possible to evaluate any policy π (i.e., estimate its average reward) regardless of how the data was collected and without actually running them, in real-time. In VW 3 approaches can be used: inverse propensity score (ips), direct method (dm), and doubly robust (dr). These are all detailed in .
 Alekh Agarwal, Sarah Bird, Markus Cozowicz, Luong Hoang, John Langford, Stephen Lee, Jiaji Li, Dan Melamed, Gal Oshri, Oswaldo Ribas, Siddhartha Sen, Alex Slivkins, "Making Contextual Decisions with Low Technical Debt", 2016.
 Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, Robert E. Schapire, "Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits", 2014.
 Miroslav Dudik, John Langford and Lihong Li, "Doubly Robust Policy Evaluation and Learning", ICML, 2011.
 John Langford and Tong Zhang, "The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits", NIPS, 2007.