Multi-Armed Bandit Outreach

Abstract

This is the Multi-Armed Bandit Outreach project. This project was built in an effort to provide outreach to students who are interested in machine learning and are between the education levels of 11th Grade and college Sophomore.

The application uses a real-world scenario of selecting a restaurant to eat at. The restaurants are each given a reward distribution, it is the goal of the system to find which restaurant is the optimal choice for each iteration. The participant is shown a simulation of the problem both without context and with it. This is to help them build an understanding of the purpose of adding context to multi-armed bandits, as well as to see how context affects every day scenarios. The participant is also able to change the scale of the system, by changing the number of restaurants, number of iterations or even selecting a different bandit model.

The user is able to select from Epsilon Greedy, Thompson Sampling, Upper Confidence Bound and Random Selection models. The participant may also choose to run the application in a way that will compare these models for them and show the results of each bandit model. This functionality helps the participant understand that not all multi-armed bandits operate exactly the same and shows that there are different solutions to the same problem. Even if they all fall under the category of multi-armed bandits, each model approaches the problem differently.


Download

This project can be accessed at OutreachMAB Github.

Setup

Python 3.10 is required for this application. You can then clone the repository in order to access the project.

Repository Cloning

  1. Clone
    git clone https://github.com/iCMAB/OutreachMAB.git
  2. Install dependencies
    pip3 install -r requirements.txt
  3. Run
    python3 main.py

Application Usage

When running the program, there are 3 important screens to pay attention to:

1. Settings Selection

A screen prompting the user to select a multi-armed bandit model, the number of arms and the number of iterations.

Here you can change the bandit model, number of arms (restaurants in the context of the problem) and then number of iterations.

2. Simulation

The simulation screen, including graphs for cumulative reward and regret, a distribution of rewards received from each restaurant and a control center for going through the iterations.

The simulation consists of three major parts. The control center in the top left where the participant can go through the iterations of the simulation. Then there is the reward and regret graphs, one is cumulative and one is per iteration. The last part is the graphs along the right side of the screen. These graphs show the current distribution of rewards that the bandit has collected from each restaurant.

3. Results

The final results of the situation. Total Reward: 627.8, Total Regret: 194.6. Restaurants were selected 11, 18, 6, 59, and 6 times respectively.

At the conclusion of the simulation, the final graphs are shown.

The two larger graphs show reward and regret in two different ways. The first graph is cumulative reward and regret, with the second being average reward and regret over each iteration. Each of these graphs have a description beneath them to explain what the graph represents. Then on the right the final distribution found by the bandit for each restaurant is shown.


The Team

Carter Vail Dante Falardeau
Fourth Year Software Engineering student.
Interested in software development.
LinkedIn
Fifth year Software Engineer interested in integrating automation into existing workflows.
LinkedIn
Devroop Kar Dr. Daniel Krutz
Incoming PhD Student in Computing and Information Sciences.
Data Engineer and AI Enthusiast
LinkedIn
Director of the AWARE LAB and assistant professor.
Interested in Self Adaptive Systems, Strategic Reasoning and Computing Education.