An Overview of Contextual Bandits | by Ugur Yildirim

Machine Learning

An Overview of Contextual Bandits | by Ugur Yildirim | Feb, 2024

hhhhm

2024年2月3日

An Overview of Contextual Bandits | by Ugur Yildirim | Feb, 2024

[ad_1]

A dynamic method to therapy personalization

Introduction
When To Use Contextual Bandits
2.1. Contextual Bandit vs Multi-Armed Bandit vs A/B Testing
2.2. Contextual Bandit vs A number of MABs
2.3. Contextual Bandit vs Multi-Step Reinforcement Studying
2.4. Contextual Bandit vs Uplift Modeling
Exploration and Exploitation in Contextual Bandits
3.1. ε-greedy
3.2. Higher Confidence Certain (UCB)
3.3. Thompson Sampling
Contextual Bandit Algorithm Steps
Offline Coverage Analysis in Contextual Bandits
5.1. OPE Utilizing Causal Inference Strategies
5.2. OPE Utilizing Sampling Strategies
Contextual Bandit in Motion
Conclusion
Acknowledgements
References

Think about a state of affairs the place you simply began an A/B check that might be working for the following two weeks. Nonetheless, after only a day or two, it’s turning into more and more clear that model A is working higher for sure kinds of customers, whereas model B is working higher for an additional set of customers. You assume to your self: Maybe I ought to re-route the site visitors such that customers are getting extra of the model that’s benefiting them extra, and fewer of the opposite model. Is there a principled approach to obtain this?

Contextual bandits are a category of one-step reinforcement studying algorithms particularly designed for such therapy personalization issues the place we wish to dynamically modify site visitors based mostly on which therapy is working for whom. Regardless of being extremely highly effective in what they will obtain, they’re one of many lesser recognized strategies in Knowledge Science, and I hope that this publish offers you a complete introduction to this wonderful subject. With out additional ado, let’s dive proper in!

In case you are simply getting began with contextual bandits, it may be complicated to know how contextual bandits are associated to different extra broadly recognized strategies reminiscent of A/B testing, and why you may wish to use contextual bandits as an alternative of these different strategies. Subsequently, we begin our journey by discussing the similarities and variations between contextual bandits and associated strategies.

2.1. Contextual Bandit vs Multi-Armed Bandit vs A/B Testing

Allow us to begin with essentially the most fundamental A/B testing setting that allocates site visitors into therapy and management in a static trend. For instance, an information scientist may determine to run an A/B check for 2 weeks with 50% of site visitors going to therapy and 50% going to regulate. What this implies is that no matter whether or not we’re on the primary day of the check or the final, we might be assigning customers to regulate or therapy with 50% chance.

Alternatively, if the information scientist had been to make use of a multi-armed bandit (MAB) as an alternative of an A/B check on this case, then site visitors might be allotted to therapy and management in a dynamic trend. In different phrases, site visitors allocations in a MAB will change as days go by. For instance, if the algorithm decides that therapy is doing higher than management on the primary day, the site visitors allocation can change from 50% therapy and 50% management to 60% therapy vs 40% management on the second day, and so forth.

Regardless of allocating site visitors dynamically, MAB ignores an vital truth, which is that not all customers are the identical. Because of this a therapy that’s working for one kind of consumer may not work for an additional. For instance, it could be the case that whereas therapy is working higher for core customers, management is definitely higher for informal customers. On this case, even when therapy is healthier total, we are able to really get extra worth from our software if we assign extra core customers to therapy and extra informal customers to regulate.

That is precisely the place contextual bandits (CB) are available in. Whereas MAB merely appears to be like at whether or not therapy or management is doing higher total, CB focuses on whether or not therapy or management is doing higher for a consumer with a given set of traits. The “context” in contextual bandits exactly refers to those consumer traits and is what differentiates it from MAB. For instance, CB may determine to extend therapy allocation to 60% for core customers however lower therapy allocation to 40% for informal customers after observing first day’s information. In different phrases, CB will dynamically replace site visitors allocation taking consumer traits (core vs informal on this instance) under consideration.

The next desk summarizes the important thing variations between A/B testing, MAB, and CB, and the determine that follows visualizes these concepts.

Desk 1: Variations Between A/B Testing, MAB, and CB

Determine 1: Visitors Allocations in A/B Testing, MAB, and CB

2.2. Contextual Bandit vs A number of MABs

At this level, you could be tempted to assume that CB is nothing greater than a set of a number of MABs working collectively. In reality, when the context we’re fascinated with is a small one (e.g., we’re solely fascinated with whether or not a consumer is a core consumer or an informal consumer), we are able to merely run one MAB for core customers and one other MAB for informal customers. Nonetheless, because the context will get giant (core vs informal, age, nation, time since final energetic, and many others.) it turns into impractical to run a separate MAB for every distinctive context worth.

The actual worth of CB emerges on this case by means of the usage of fashions to explain the connection of the experimental circumstances in numerous contexts to our end result of curiosity (e.g., conversion). Versus enumerating by means of every context worth and treating them independently, the usage of fashions permits us to share info from completely different contexts and makes it attainable to deal with giant context areas. This concept of a mannequin might be mentioned at a number of completely different factors on this publish, so carry on studying to be taught extra.

2.3. Contextual Bandit vs Multi-Step Reinforcement Studying

The introduction referred to CB as a category of one-step reinforcement studying (RL) algorithms. So, what precisely is the distinction between one-step and multi-step RL? And what makes CB one-step? The elemental distinction between CB and multi-step RL is that in CB we assume the actions the algorithm takes (e.g., serve therapy or management to a particular consumer) don’t have an effect on the long run states of the general system. In different phrases, the state (or “context” as is extra appropriately known as in CB) impacts what motion we take, however that motion we took does not in flip impression or change the state. The next determine summarizes this distinction.

Determine 2: Contextual Bandit vs Multi-Step RL

A number of examples ought to make this distinction clearer. Let’s say that we’re constructing a system to determine what advertisements to point out to customers based mostly on their age. We might count on that customers from completely different age teams could discover completely different advertisements extra related to them, which signifies that a consumer’s age ought to have an effect on what advertisements we must always present them. Nonetheless, the advert we confirmed them doesn’t in flip have an effect on their age, so the one-step assumption of CB appears to carry. Nonetheless, if we transfer one step additional and discover out that serving costly advertisements deplete our stock (and restrict which advertisements we are able to serve sooner or later) or that the advert we present immediately have an effect on whether or not the consumer will go to our web site once more, then the one-step assumption is not directly inviolated, so we could wish to take into account creating a full-blown RL system as an alternative.

A be aware of warning although: Whereas multi-step reinforcement studying is extra versatile in comparison with contextual bandits, it’s additionally extra sophisticated to implement. So, if the issue at hand might be precisely framed as a one-step downside (though it appears to be like like a multi-step downside at first look), contextual bandits may very well be the extra sensible method.

2.4. Contextual Bandit vs Uplift Modeling

Earlier than transferring on to discussing completely different CB algorithms, I’d additionally wish to briefly contact upon the connection between CB and uplift modeling. An uplift mannequin is often constructed on high of A/B check information to find the connection between the therapy impact (uplift) and consumer traits. The outcomes from such a mannequin can then be used to personalize therapies sooner or later. For instance, if the uplift mannequin discovers that sure customers usually tend to profit from a therapy, then solely these kinds of customers could be given the therapy sooner or later.

Given this description of uplift modeling, it must be clear that each CB and uplift modeling are options to the personalization downside. The important thing distinction between them is that CB approaches this downside in a extra dynamic means within the sense that personalization occurs on-the-fly as an alternative of ready for outcomes from an A/B check. At a conceptual stage, CB can very loosely be regarded as A/B testing and uplift modeling occurring concurrently as an alternative of sequentially. Given the main target of this publish, I gained’t be discussing uplift modeling additional, however there are a number of nice sources to be taught extra about it reminiscent of [1].

Above we mentioned how CB dynamically allocates site visitors relying on whether or not therapy or management is doing higher for a given group of customers at a given time limit. This raises an vital query: How aggressive will we wish to be once we are making these site visitors allocation modifications? For instance, if after simply someday of information we determine that therapy is working higher for customers from the US, ought to we fully cease serving management to US customers?

I’m certain most of you’ll agree that this is able to be a foul concept, and you’ll be appropriate. The primary downside with altering site visitors allocations this aggressively is that making inferences based mostly on inadequate quantities of information can result in inaccurate conclusions. For instance, it could be that the primary day of information we gathered is definitely not consultant of dormant customers and that in actuality management is healthier for them. If we cease serving management to US customers after the primary day, we’ll by no means be capable to be taught this appropriate relationship.

A greater method to dynamically updating site visitors allocations is placing the correct stability between exploitation (serve one of the best experimental situation based mostly on the information thus far) and exploration (proceed to serve different experimental circumstances as effectively). Persevering with with the earlier instance, if information from the primary day point out that therapy is healthier for US customers, we are able to serve therapy to those customers with an elevated chance the following day whereas nonetheless allocating a lowered however non-zero fraction to regulate.

There are quite a few exploration methods utilized in CB (and MAB) in addition to a number of variations of them that attempt to strike this proper stability between exploration and exploitation. Three in style methods embrace ε-greedy, higher confidence certain, and Thompson sampling.

3.1. ε-greedy

On this technique, we first determine which experimental situation is doing higher for a given group of customers at a given time limit. The best means to do that is by evaluating the typical goal values (y) for every experimental situation for these customers. Extra formally, we are able to determine the “profitable” situation for a bunch of customers by discovering the situation d that has the upper worth for

the place n_dx is the variety of samples we have now so removed from customers in situation d with context x, and y_idx is the goal worth for a given pattern i in situation d with context x.

After deciding which experimental situation is presently “greatest” for these customers, we serve them that situation with 1-ε chance (the place ε is often a small quantity reminiscent of 0.05) and serve a random experimental situation with chance ε. In actuality, we would wish to dynamically replace our ε such that it’s giant initially of the experiment (when extra exploration is required) and progressively will get smaller as we acquire an increasing number of information.

Moreover, context X could be high-dimensional (nation, gender, platform, tenure, and many others.) so we would wish to use a mannequin to get these y estimates to take care of the curse of dimensionality. Formally, the situation to serve might be determined by discovering the situation d that has the upper worth for

the place x^T is an m-dimensional row-vector of context values and θ_d is an m-dimensional column-vector of learnable parameters related to situation d.

3.2. Higher Confidence Certain (UCB)

This technique decides the following situation to serve by not solely which situation has a better y estimate but in addition our precision of (or confidence in) that estimate. In a easy MAB setting, precision might be regarded as a perform of what number of instances a given situation has already been served thus far. Particularly, a situation that (i) has a excessive common y (so it is smart to take advantage of) or (ii) has not but been served many instances (so it wants extra exploration) is extra prone to be served subsequent.

We are able to generalize this concept to the CB setting by protecting monitor of what number of instances completely different circumstances are served in numerous contexts. Assuming a easy setting with a low-dimensional context X such that CB might be regarded as simply a number of MABs working collectively, we are able to choose the following situation to serve based mostly on which situation d has the upper worth for

the place c is a few fixed (to be chosen based mostly on how a lot emphasis we wish to placed on the precision of our estimate when exploring) and n_x is the variety of instances context x is seen thus far.

Nonetheless, typically, the context X might be high-dimensional, which signifies that similar to within the ε-greedy case, we would want to utilize a mannequin. On this setting, a situation d might be served subsequent if it has the upper worth for

the place SE(.) is the usual error of our estimate (or extra typically a metric that quantifies our present stage of confidence in that estimate).

Word that there are a number of variations of UCB, so you’ll seemingly come throughout completely different formulation. A well-liked UCB methodology is LinUCB that formalizes the issue in a linear mannequin framework (e.g., [2]).

3.3. Thompson Sampling

The third and remaining exploration technique to be mentioned is Thompson sampling, which is a Bayesian method to fixing the exploration-exploitation dilemma. Right here, we have now a mannequin f(D, X; Θ) that returns predicted y values given experimental situation D, context X, and a few set of learnable parameters Θ. This perform offers us entry to posterior distributions of anticipated y values for any condition-context pair, thus permitting us to decide on the following situation to serve in accordance with the chance that it yields the very best anticipated y given context. Thompson sampling naturally balances exploration and exploitation as we’re sampling from the posterior and updating our mannequin based mostly on the observations. To make these concepts extra concrete, listed here are the steps concerned in Thompson sampling:

In follow, as an alternative of getting a single perform we are able to additionally use a distinct perform for every experimental situation (e.g., consider each f_c(X; Θ_c) and f_t(X; Θ_t) after which choose the situation with the upper worth). Moreover, the replace step often takes place not after every pattern however quite after seeing a batch of samples. For extra particulars on Thompson sampling, you possibly can check with [3] [4].

The earlier part (particularly the half on Thompson sampling) ought to already offer you a fairly good sense of the steps concerned in a CB algorithm. Nonetheless, for the sake of completeness, here’s a step-by-step description of a normal CB algorithm:

A brand new information level arrives with context X (e.g., a core consumer with an iOS machine within the US).
Given this information level and the exploration technique chosen (e.g., ε-greedy), the algorithm decides on a situation to serve this consumer (e.g., therapy or management).
After the situation is served, we observe the result y (e.g., whether or not the consumer made a purchase order or not).
Replace (or totally retrain) the mannequin utilized in Step 2 after seeing the brand new information. (As talked about beforehand, we often make an replace not after each pattern however after seeing a batch of samples to make sure that updates are much less noisy.)
Repeat.

Up to now we have now solely mentioned the way to implement a CB algorithm as new information are available in. An equally vital subject to cowl is the way to consider a CB algorithm utilizing previous (or logged) information. That is known as offline analysis or offline coverage analysis (OPE).

5.1. OPE Utilizing Causal Inference Strategies

One approach to do OPE is utilizing well-known causal inference strategies reminiscent of Inverse Propensity Scoring (IPS) or the Doubly Sturdy (DR) methodology. Causal inference is acceptable right here as a result of we’re basically attempting to estimate the counterfactual of what would have occurred if a distinct coverage served a distinct situation to a consumer. There may be already an excellent Medium article on this subject [5], so right here I’ll solely briefly summarize the primary concept from that piece and adapt it to our dialogue.

Taking IPS for example, doing OPE often requires us to know not solely (i) the chance of assigning a given situation to a pattern utilizing our new CB algorithm but in addition (ii) the chance with which a given situation was assigned to a pattern within the logged information. Take the next hypothetical logged information with X_1-X_3 being context, D being the experimental situation, P_O(D) being the chance of assigning D to that consumer, and y being the result.

Desk 2: Instance Logged Knowledge From An A/B Take a look at

As you possibly can see, on this instance P_O(D) is at all times 0.6 for D=1 and 0.4 for D=0 whatever the context, so the logged information might be assumed to come back from an A/B check that assigns therapy with chance 0.6. Now, if we wish to check how a CB algorithm would have carried out had we assigned circumstances utilizing a CB algorithm quite than a easy A/B check, we are able to use the next system to get the IPS estimate of the cumulative y for CB

the place n is the variety of samples within the logged information (which is 5 right here) and P_N(D_i) is the chance of serving the logged D for user_i had we used the brand new CB algorithm as an alternative (this chance will rely on the precise algorithm being evaluated).

As soon as we have now this estimate, we are able to evaluate that to the noticed cumulative y from the previous A/B check (which is 1+0+0+1+1=3 right here) to determine if the CB would have yielded a better cumulative y.

For extra info on OPE utilizing causal inference strategies, please check with the article linked initially of the part. The article additionally hyperlinks to a pleasant GitHub repo with a lot of OPE implementations.

A facet be aware right here is that this part mentioned causal inference strategies solely as a way utilized in OPE. Nonetheless, in actuality, one may also apply them whereas the CB algorithm is being run in order to “debias” the coaching information that the algorithm collects alongside the way in which. The explanation why we would wish to apply strategies reminiscent of IPS to our coaching information is that the CB coverage that generates this information is a non-uniform random coverage by definition, so estimating causal results from it to determine what motion to take would profit from utilizing causal inference strategies. If you need to be taught extra about debiasing, please check with [6].

5.2. OPE Utilizing Sampling Strategies

One other approach to do OPE is thru the usage of sampling strategies. Particularly, a quite simple replay methodology [7] can be utilized to judge a CB algorithm (or some other algorithm for that matter) utilizing logged information from a randomized coverage reminiscent of an A/B check. In its easiest type (the place we assume a uniform random logging coverage), the strategy works as follows:

Pattern the following consumer with context X from the logged information.
Determine what situation to assign to that consumer utilizing the brand new CB algorithm.
If the chosen situation matches the precise situation within the logged information, then add the noticed y to the cumulative y counter. If it doesn’t match, ignore the pattern.
Repeat till all samples are thought of.

If the logging coverage doesn’t assign therapies uniformly at random, then the strategy must be barely modified. One modification that the authors themselves point out is to make use of rejection sampling (e.g., [8]) whereby we’d settle for samples from the bulk therapy much less usually in comparison with the minority therapy in Step 3. Alternatively, we may take into account dividing the noticed y by the propensity in Step 3 to equally “down-weight” the extra frequent therapy and “up-weight” the much less frequent one.

Within the subsequent part, I make use of a fair less complicated methodology in my analysis that makes use of up- and down-sampling with bootstrap to rework the unique non-uniform information right into a uniform one after which apply the strategy as it’s.

To display contextual bandits in motion, I put collectively a pocket book that generates a simulated dataset and compares the cumulative y (or “reward”) estimates for brand spanking new A/B, MAB, and CB insurance policies evaluated on this dataset. Many elements of the code on this pocket book are taken from the Contextual Bandits chapter of a tremendous e-book on Reinforcement Studying [9] (extremely really helpful if you need to dig deeper into Reinforcement Studying utilizing Python) and two nice posts by James LeDoux [10] [11] and tailored to the setting we’re discussing right here.

The setup could be very easy: The unique information we have now comes from an A/B check that assigned therapy to customers with chance 0.75 (so not uniformly at random). Utilizing this randomized logged information, we wish to consider and evaluate the next three insurance policies based mostly on their cumulative y:

A brand new A/B coverage that randomly assigns therapy to customers with chance 0.4.
A MAB coverage that decides what therapy to assign subsequent utilizing an ε-greedy coverage that doesn’t take context X under consideration.
A CB coverage that decides what therapy to assign subsequent utilizing an ε-greedy coverage that takes context X under consideration.

I modified the unique methodology described within the Li et al. paper such that as an alternative of instantly sampling from the simulated information (which is 75% therapy and solely 25% management in my instance), I first down-sample therapy circumstances and up-sample management circumstances (each with alternative) to get a brand new dataset that’s precisely 50% therapy and 50% management.

The explanation why I begin with a dataset that’s not 50% therapy and 50% management is to point out that even when the unique information doesn’t come from a coverage that assigns therapy and management uniformly at random, we are able to nonetheless work with that information to do offline analysis after doing up- and/or down-sampling to therapeutic massage it right into a 50/50% dataset. As talked about within the earlier part, the logic behind up- and down-sampling is much like rejection sampling and the associated concept of dividing the noticed y by the propensity.

The next determine compares the three insurance policies described above (A/B vs MAB vs CB) when it comes to their cumulative y values.

Determine 3: Cumulative Reward Comparability

As might be seen on this determine, cumulative y will increase quickest for CB and slowest for A/B with MAB someplace in between. Whereas this result’s based mostly on a simulated dataset, the patterns noticed right here can nonetheless be generalized. The explanation why A/B testing isn’t capable of get a excessive cumulative y is as a result of it isn’t altering the 60/40% allocation in any respect even after seeing adequate proof that therapy is healthier than management total. Alternatively, whereas MAB is ready to dynamically replace this site visitors allocation, it’s nonetheless performing worse than CB as a result of it isn’t personalizing the therapy vs management task based mostly on the context X being noticed. Lastly, CB is each dynamically altering the site visitors allocation and likewise personalizing the therapy, therefore the superior efficiency.

Congratulations on making it to the top of this pretty lengthy publish! We lined numerous floor associated to contextual bandits on this publish, and I hope that you simply go away this web page with an appreciation of the usefulness of this fascinating methodology for on-line experimentation, particularly when therapies must be personalised.

In case you are fascinated with studying extra about contextual bandits (or wish to go a step additional into multi-step reinforcement studying), I extremely suggest the e-book Mastering Reinforcement Studying with Python by E. Bilgin. The Contextual Bandit chapter of this e-book was what lastly gave me the “aha!” second in understanding this subject, and I saved on studying to be taught extra about RL generally. So far as offline coverage analysis is anxious, I extremely suggest the posts by E. Conti and J. LeDoux, each of which offer nice explanations of the strategies concerned and supply code examples. Relating to debiasing in contextual bandits, the paper by A. Bietti, A. Agarwal, and J. Langford offers an excellent overview of the strategies concerned. Lastly, whereas this publish completely centered on utilizing regression fashions when constructing contextual bandits, there may be another method known as cost-sensitive classification, which you can begin studying by trying out these lecture notes by A. Agarwal and S. Kakade [12].

Have enjoyable constructing contextual bandits!

I wish to thank Colin Dickens for introducing me to contextual bandits in addition to offering beneficial suggestions on this publish, Xinyi Zhang for all her useful suggestions all through the writing, Jiaqi Gu for a fruitful dialog on sampling strategies, and Dennis Feehan for encouraging me to take the time to write down this piece.

Until in any other case famous, all photographs are by the creator.

[1] Z. Zhao and T. Harinen, Uplift Modeling for A number of Remedies with Value Optimization (2019), DSAA

[2] Y. Narang, Recommender programs utilizing LinUCB: A contextual multi-armed bandit method (2020), Medium

[3] D. Russo, B. Van Roy, A. Kazerouni, I. Osband, and Z. Wen, A Tutorial on Thompson Sampling (2018), Foundations and Developments in Machine Studying

[4] B. Shahriari, Okay. Swersky, Z. Wang, R. Adams, and N. de Freitas, Taking the Human Out of the Loop: A Assessment of Bayesian Optimization (2015), IEEE

[5] E. Conti, Offline Coverage Analysis: Run fewer, higher A/B assessments (2021), Medium

[6] A. Bietti, A. Agarwal, and J. Langford, A Contextual Bandit Bake-off (2021), ArXiv

[7] L. Li, W. Chu, J. Langford, and X. Wang, Unbiased Offline Analysis of Contextual-bandit-based Information Article Advice Algorithms (2011), WSDM

[8] T. Mandel, Y. Liu, E. Brunskill, and Z. Popovic, Offline Analysis of On-line Reinforcement Studying Algorithms (2016), AAAI

[9] E. Bilgin, Mastering Reinforcement Studying with Python (2020), Packt Publishing

[10] J. LeDoux, Offline Analysis of Multi-Armed Bandit Algorithms in Python utilizing Replay (2020), LeDoux’s private web site

[11] J. LeDoux, Multi-Armed Bandits in Python: Epsilon Grasping, UCB1, Bayesian UCB, and EXP3 (2020), LeDoux’s private web site

[12] A. Agarwal and S. Kakade, Off-policy Analysis and Studying (2019), College of Washington Pc Science Division

[ad_2]