Merge pull request #208 from huggingface/ThomasSimonini/MultiAgents

Add Multi Agents Unit
This commit is contained in:
Thomas Simonini
2023-02-01 16:19:40 +01:00
committed by GitHub
8 changed files with 646 additions and 0 deletions

View File

@@ -162,6 +162,22 @@
title: Conclusion
- local: unit6/additional-readings
title: Additional Readings
- title: Unit 7. Introduction to Multi-Agents and AI vs AI
sections:
- local: unit7/introduction
title: Introduction
- local: unit7/introduction-to-marl
title: An introduction to Multi-Agents Reinforcement Learning (MARL)
- local: unit7/multi-agent-setting
title: Designing Multi-Agents systems
- local: unit7/self-play
title: Self-Play
- local: unit7/hands-on
title: Let's train our soccer team to beat your classmates' teams (AI vs. AI)
- local: unit7/conclusion
title: Conclusion
- local: unit7/additional-readings
title: Additional Readings
- title: What's next? New Units Publishing Schedule
sections:
- local: communication/publishing-schedule

View File

@@ -0,0 +1,17 @@
# Additional Readings [[additional-readings]]
## An introduction to multi-agents
- [Multi-agent reinforcement learning: An overview](https://www.dcsc.tudelft.nl/~bdeschutter/pub/rep/10_003.pdf)
- [Multiagent Reinforcement Learning, Marc Lanctot](https://rlss.inria.fr/files/2019/07/RLSS_Multiagent.pdf)
- [Example of a multi-agent environment](https://www.mathworks.com/help/reinforcement-learning/ug/train-3-agents-for-area-coverage.html?s_eid=PSM_15028)
- [A list of different multi-agent environments](https://agents.inf.ed.ac.uk/blog/multiagent-learning-environments/)
- [Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents](https://bit.ly/3nVK7My)
- [Dealing with Non-Stationarity in Multi-Agent Deep Reinforcement Learning](https://bit.ly/3v7LxaT)
## Self-Play and MA-POCA
- [Self Play Theory and with MLAgents](https://blog.unity.com/technology/training-intelligent-adversaries-using-self-play-with-ml-agents)
- [Training complex behavior with MLAgents](https://blog.unity.com/technology/ml-agents-v20-release-now-supports-training-complex-cooperative-behaviors)
- [MLAgents plays dodgeball](https://blog.unity.com/technology/ml-agents-plays-dodgeball)
- [On the Use and Misuse of Absorbing States in Multi-agent Reinforcement Learning (MA-POCA)](https://arxiv.org/pdf/2111.05992.pdf)

View File

@@ -0,0 +1,11 @@
# Conclusion
Thats all for today. Congrats on finishing this unit and the tutorial!
The best way to learn is to practice and try stuff. **Why not training another agent with a different configuration?**
And dont hesitate from time to time to check the [leaderboard](https://huggingface.co/spaces/huggingface-projects/AIvsAI-SoccerTwos)
See you on Unit 8 🔥,
## Keep Learning, Stay awesome 🤗

317
units/en/unit7/hands-on.mdx Normal file
View File

@@ -0,0 +1,317 @@
# Hands-on
Now that you learned the bases of multi-agents. You're ready to train our first agents in a multi-agent system: **a 2vs2 soccer team that needs to beat the opponent team**.
And youre going to participate in AI vs. AI challenges where your trained agent will compete against other classmates **agents every day and be ranked on a new leaderboard.**
To validate this hands-on for the certification process, you just need to push a trained model. There **are no minimal results to attain to validate it.**
For more information about the certification process, check this section 👉 [https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process)
This hands-on will be different since to get correct results **you need to train your agents from 4 hours to 8 hours**. And given the risk of timeout in Colab, we advise you to train on your computer. You dont need a supercomputer: a simple laptop is good enough for this exercise.
Let's get started! 🔥
## What is AI vs. AI?
AI vs. AI is an open-source tool we developed at Hugging Face to compete agents on the Hub against one another in a multi-agent setting. These models are then ranked in a leaderboard.
The idea of this tool is to have a robust evaluation tool: **by evaluating your agent with a lot of others, youll get a good idea of the quality of your policy.**
More precisely, AI vs. AI is three tools:
- A *matchmaking process* defining the matches (which model against which) and running the model fights using a background task in the Space.
- A *leaderboard* getting the match history results and displaying the models ELO ratings: [https://huggingface.co/spaces/huggingface-projects/AIvsAI-SoccerTwos](https://huggingface.co/spaces/huggingface-projects/AIvsAI-SoccerTwos)
- A *Space demo* to visualize your agents playing against others: [https://huggingface.co/spaces/unity/ML-Agents-SoccerTwos](https://huggingface.co/spaces/unity/ML-Agents-SoccerTwos)
We're going to write a blog post to explain this AI vs. AI tool in detail, but to give you the big picture it works this way:
- Every four hours, our algorithm **fetches all the available models for a given environment (in our case ML-Agents-SoccerTwos).**
- It creates a **queue of matches with the matchmaking algorithm.**
- We simulate the match in a Unity headless process and **gather the match result** (1 if the first model won, 0.5 if its a draw, 0 if the second model won) in a Dataset.
- Then, when all matches from the matches queue are done, **we update the ELO score for each model and update the leaderboard.**
### Competition Rules
This first AI vs. AI competition **is an experiment**: the goal is to improve the tool in the future with your feedback. So some **breakups can happen during the challenge**. But don't worry
**all the results are saved in a dataset so we can always restart the calculation correctly without losing information**.
In order that your model to get correctly evaluated against others you need to follow these rules:
1. **You can't change the observation space or action space of the agent.** By doing that your model will not work during evaluation.
2. You **can't use a custom trainer for now,** you need to use Unity MLAgents ones.
3. We provide executables to train your agents. You can also use the Unity Editor if you prefer **, but to avoid bugs, we advise you to use our executables**.
What will make the difference during this challenge are **the hyperparameters you choose**.
The AI vs AI algorithm will run until April the 30th, 2023.
We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues).
### Exchange with your classmates, share advice and ask questions on Discord
- We created a new channel called `ai-vs-ai-challenge` to exchange advice and ask questions.
- If you didnt joined yet the discord server you can [join here](https://discord.gg/ydHrjt3WP5)
## Step 0: Install MLAgents and download the correct executable
⚠ We're going to use an experimental version of ML-Agents which allows you to push and load your models to/from the Hub. **You need to install the same version.**
⚠ ⚠ ⚠ Were not going to use the same version than for the Unit 5: Introduction to ML-Agents ⚠ ⚠ ⚠
We advise you to use [conda](https://docs.conda.io/en/latest/](https://docs.conda.io/en/latest/) as a package manager and create a new environment.
With conda, we create a new environment called rl:
```bash
conda create --name rl python=3.8
conda activate rl
```
To be able to train correctly our agents and push to the Hub, we need to install an experimental version of ML-Agents (the branch aivsai from Hugging Face ML-Agents fork)
```bash
git clone --branch aivsai https://github.com/huggingface/ml-agents
```
When the cloning is done (it takes 2.5Go), we go inside the repository and install the package
```bash
cd ml-agents
pip install -e ./ml-agents-envs
pip install -e ./ml-agents
```
We also need to install pytorch with:
```bash
pip install torch
```
Now that its installed, we need to add the environment training executable. Based on your operating system you need to download one of them, unzip it and place it in a new folder inside `ml-agents` that you call `training-envs-executables`
At the end your executable should be in `mlagents/training-envs-executables/SoccerTwos`
Windows: Download [this executable](https://drive.google.com/file/d/1sqFxbEdTMubjVktnV4C6ICjp89wLhUcP/view?usp=sharing)
Linux (Ubuntu): Download [this executable](https://drive.google.com/file/d/1KuqBKYiXiIcU4kNMqEzhgypuFP5_45CL/view?usp=sharing)
Mac: Download [this executable](https://drive.google.com/file/d/14D8w6XYLRlXCSurdZxe70hwYULcuWxWZ/view?usp=share_link)
## Step 1: Understand the environment
The environment is called `SoccerTwos`. The Unity MLAgents Team made it. You can find its documentation [here](https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Learning-Environment-Examples.md#soccer-twos)
The goal in this environment **is to get the ball into the opponent's goal while preventing the ball from entering its own goal.**
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/soccertwos.gif" alt="SoccerTwos"/>
<figcaption>This environment was made by the <a href="https://github.com/Unity-Technologies/ml-agents"> Unity MLAgents Team</a></figcaption>
</figure>
### The reward function
The reward function is:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/soccerreward.png" alt="SoccerTwos Reward"/>
### The observation space
The observation space is composed vector size of 336:
- 11 ray-casts forward distributed over 120 degrees (264 state dimensions)
- 3 ray-casts backward distributed over 90 degrees (72 state dimensions)
- Both of these ray-casts can detect 6 objects:
- Ball
- Blue Goal
- Purple Goal
- Wall
- Blue Agent
- Purple Agent
### The action space
The action space is three discrete branches:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/socceraction.png" alt="SoccerTwos Action"/>
## Step 2: Understand MA-POCA
We know how to train agents to play against others: **we can use self-play.** This is a perfect technique for a 1vs1.
But in our case were 2vs2, and each team has 2 agents. How then we can **train cooperative behavior for groups of agents?**
As explained in the [Unity Blog](https://blog.unity.com/technology/ml-agents-v20-release-now-supports-training-complex-cooperative-behaviors), agents typically receive a reward as a group (+1 - penalty) when the team scores a goal. This implies that **every agent on the team is rewarded even if each agent didnt contribute the same to the win**, which makes it difficult to learn what to do independently.
The Unity MLAgents team developed the solution in a new multi-agent trainer called *MA-POCA (Multi-Agent POsthumous Credit Assignment)*.
The idea is simple but powerful: a centralized critic **processes the states of all agents in the team to estimate how well each agent is doing**. Think of this critic as a coach.
This allows each agent to **make decisions based only on what it perceives locally**, and **simultaneously evaluate how good its behavior is in the context of the whole group**.
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/mapoca.png" alt="MA POCA"/>
<figcaption>This illustrates MA-POCAs centralized learning and decentralized execution. Source: <a href="https://blog.unity.com/technology/ml-agents-plays-dodgeball">MLAgents Plays Dodgeball</a>
</figcaption>
</figure>
The solution then is to use Self-Play with an MA-POCA trainer (called poca). The poca trainer will help us to train cooperative behavior and self-play to get an opponent team.
If you want to dive deeper into this MA-POCA algorithm, you need to read the paper they published [here](https://arxiv.org/pdf/2111.05992.pdf) and the sources we put on the additional readings section.
## Step 3: Define the config file
We already learned in [Unit 5](https://huggingface.co/deep-rl-course/unit5/introduction) that in ML-Agents, you define **the training hyperparameters into `config.yaml` files.**
There are multiple hyperparameters. To know them better, you should check for each explanation with **[the documentation](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/Training-Configuration-File.md)**
The config file were going to use here is in `./config/poca/SoccerTwos.yaml` it looks like this:
```csharp
behaviors:
SoccerTwos:
trainer_type: poca
hyperparameters:
batch_size: 2048
buffer_size: 20480
learning_rate: 0.0003
beta: 0.005
epsilon: 0.2
lambd: 0.95
num_epoch: 3
learning_rate_schedule: constant
network_settings:
normalize: false
hidden_units: 512
num_layers: 2
vis_encode_type: simple
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
keep_checkpoints: 5
max_steps: 50000000
time_horizon: 1000
summary_freq: 10000
self_play:
save_steps: 50000
team_change: 200000
swap_steps: 2000
window: 10
play_against_latest_model_ratio: 0.5
initial_elo: 1200.0
```
Compared to Pyramids or SnowballTarget, we have new hyperparameters with a self-play part. How you modify them can be critical in getting good results.
The advice I can give you here is to check the explanation and recommended value for each parameters (especially self-play ones) with **[the documentation](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/Training-Configuration-File.md).**
Now that youve modified our config file, youre ready to train your agents.
## Step 4: Start the training
To train the agents, we need to **launch mlagents-learn and select the executable containing the environment.**
We define four parameters:
1. `mlagents-learn <config>`: the path where the hyperparameter config file is.
2. `-env`: where the environment executable is.
3. `-run_id`: the name you want to give to your training run id.
4. `-no-graphics`: to not launch the visualization during the training.
Depending on your hardware, 5M timesteps (the recommended value) will take 5 to 8 hours of training. You can continue using your computer in the meantime, but I advise deactivating the computer standby mode to prevent the training from being stopped.
Depending on the executable you use (windows, ubuntu, mac) the training command will look like this (your executable path can be different so dont hesitate to check before running).
```bash
mlagents-learn ./config/poca/SoccerTwos.yaml --env=./training-envs-executables/SoccerTwos.exe --run-id="SoccerTwos" --no-graphics
```
The executable contains 8 copies of SoccerTwos.
⚠️ Its normal if you dont see a big increase of ELO score (and even a decrease below 1200) before 2M timesteps, since your agents will spend most of their time moving randomly on the field before being able to goal.
## Step 5: **Push the agent to the Hugging Face Hub**
Now that we trained our agents, were **ready to push them to the Hub to be able to participate in the AI vs. AI challenge and visualize them playing on your browser🔥.**
To be able to share your model with the community, there are three more steps to follow:
1⃣ (If its not already done) create an account to HF ➡ https://huggingface.co/join](https://huggingface.co/join
2⃣ Sign in and store your authentication token from the Hugging Face website.
Create a new token (https://huggingface.co/settings/tokens)) **with write role**
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
Copy the token, run this, and paste the token
```bash
huggingface-cli login
```
Then, we need to run `mlagents-push-to-hf`.
And we define four parameters:
1. `-run-id`: the name of the training run id.
2. `-local-dir`: where the agent was saved, its results/<run_id name>, so in my case results/First Training.
3. `-repo-id`: the name of the Hugging Face repo you want to create or update. Its always <your huggingface username>/<the repo name>
If the repo does not exist **it will be created automatically**
4. `--commit-message`: since HF repos are git repository you need to define a commit message.
In my case
```bash
mlagents-push-to-hf --run-id="SoccerTwos" --local-dir="./results/SoccerTwos" --repo-id="ThomasSimonini/poca-SoccerTwos" --commit-message="First Push"`
```
```bash
mlagents-push-to-hf --run-id= # Add your run id --local-dir= # Your local dir --repo-id= # Your repo id --commit-message="First Push"
```
If everything worked you should have this at the end of the process(but with a different url 😆) :
Your model is pushed to the Hub. You can view your model here: https://huggingface.co/ThomasSimonini/poca-SoccerTwos
It's the link to your model. It contains a model card that explains how to use it, your Tensorboard, and your config file. **What's awesome is that it's a git repository, which means you can have different commits, update your repository with a new push, etc.**
## Step 6: Verify that your model is ready for AI vs AI Challenge
Now that your model is pushed to the Hub, **its going to be added automatically to the AI vs AI Challenge model pool.** It can take a little bit of time before your model is added to the leaderboard given we do a run of matches every 4h.
But in order that everything works perfectly you need to check:
1. That you have this tag in your model: ML-Agents-SoccerTwos. This is the tag we use to select models to be added to the challenge pool. To do that go to your model and check the tags
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/verify1.png" alt="Verify"/>
If its not the case you just need to modify readme and add it
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/verify2.png" alt="Verify"/>
2. That you have a `SoccerTwos.onnx` file
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/verify3.png" alt="Verify"/>
We strongly suggest that you create a new model when you push to the Hub if you want to train it again or train a new version.
## Step 7: Visualize some match in our demo
Now that your model is part of AI vs AI Challenge, **you can visualize how good it is compared to others**: https://huggingface.co/spaces/unity/ML-Agents-SoccerTwos
In order to do that, you just need to go on this demo:
- Select your model as team blue (or team purple if you prefer) and another. The best to compare your model is either with the one whos on top of the leaderboard. Or use the [baseline model as opponent](https://huggingface.co/unity/MLAgents-SoccerTwos)
This matches you see live are not used to the calculation of your result **but are good way to visualize how good your agent is**.
And don't hesitate to share the best score your agent gets on discord in #rl-i-made-this channel 🔥

View File

@@ -0,0 +1,55 @@
# An introduction to Multi-Agents Reinforcement Learning (MARL)
## From single agent to multiple agents
In the first unit, we learned to train agents in a single-agent system. Where our agent was alone in its environment: **it was not cooperating or collaborating with other agents**.
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/patchwork.jpg" alt="Patchwork"/>
<figcaption>
A patchwork of all the environments you've trained your agents on since the beginning of the course
</figcaption>
</figure>
When we do multi-agents reinforcement learning (MARL), we are in a situation where we have multiple agents **that share and interact in a common environment**.
For instance, you can think of a warehouse where **multiple robots need to navigate to load and unload packages**.
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/warehouse.jpg" alt="Warehouse"/>
<figcaption> [Image by upklyak](https://www.freepik.com/free-vector/robots-warehouse-interior-automated-machines_32117680.htm#query=warehouse robot&position=17&from_view=keyword) on Freepik </figcaption>
</figure>
Or a road with **several autonomous vehicles**.
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/selfdrivingcar.jpg" alt="Self driving cars"/>
<figcaption>
[Image by jcomp](https://www.freepik.com/free-vector/autonomous-smart-car-automatic-wireless-sensor-driving-road-around-car-autonomous-smart-car-goes-scans-roads-observe-distance-automatic-braking-system_26413332.htm#query=self driving cars highway&position=34&from_view=search&track=ais) on Freepik
</figcaption>
</figure>
In these examples, we have **multiple agents interacting in the environment and with the other agents**. This implies defining a multi-agents system. But first, let's understand the different types of multi-agent environments.
## Different types of multi-agent environments
Given that in a multi-agent system, agents interact with other agents, we can have different types of environments:
- *Cooperative environments*: where your agents needs **to maximize the common benefits**.
For instance, in a warehouse, **robots must collaborate to load and unload the packages as efficiently (as fast as possible)**.
- *Competitive/Adversarial environments*: in that case, your agent **want to maximize its benefits by minimizing the opponent ones**.
For example, in a game of tennis, **each agent wants to beat the other agent**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/tennis.png" alt="Tennis"/>
- *Mixed of both adversarial and cooperative*: like in our SoccerTwos environment, two agents are part of a team (blue or purple): they need to cooperate with each other and beat the opponent team.
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/soccertwos.gif" alt="SoccerTwos"/>
<figcaption>This environment was made by the <a href="https://github.com/Unity-Technologies/ml-agents">Unity MLAgents Team</a></figcaption>
</figure>
So now we might wonder: how can we design these multi-agent systems? Said differently, **how can we train agents in a multi-agent setting** ?

View File

@@ -0,0 +1,36 @@
# Introduction [[introduction]]
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/thumbnail.png" alt="Thumbnail"/>
Since the beginning of this course, we learned to train agents in a *single-agent system* where our agent was alone in its environment: it was **not cooperating or collaborating with other agents**.
This worked great, and the single-agent system is useful for many applications.
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/patchwork.jpg" alt="Patchwork"/>
<figcaption>
A patchwork of all the environments youve trained your agents on since the beginning of the course
</figcaption>
</figure>
But, as humans, **we live in a multi-agent world**. Our intelligence comes from interaction with other agents. And so, our **goal is to create agents that can interact with other humans and other agents**.
Consequently, we must study how to train deep reinforcement learning agents in a *multi-agents system* to build robust agents that can adapt, collaborate, or compete.
So today, were going to **learn the basics of this fascinating topic of multi-agents reinforcement learning (MARL)**.
And the most exciting part is that during this unit, youre going to train your first agents in a multi-agents system: **a 2vs2 soccer team that needs to beat the opponent team**.
And youre going to participate in **AI vs. AI challenge** where your trained agent will compete against other classmates agents every day and be ranked on a [new leaderboard](https://huggingface.co/spaces/huggingface-projects/AIvsAI-SoccerTwos).
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/soccertwos.gif" alt="SoccerTwos"/>
<figcaption>This environment was made by the <a href="https://github.com/Unity-Technologies/ml-agents">Unity MLAgents Team</a></figcaption>
</figure>
So lets get started!

View File

@@ -0,0 +1,57 @@
# Designing Multi-Agents systems
For this section, you're going to watch this excellent introduction to multi-agents made by <a href="https://www.youtube.com/channel/UCq0imsn84ShAe9PBOFnoIrg"> Brian Douglas </a>.
<Youtube id="qgb0gyrpiGk" />
In this video, Brian talked about how to design multi-agent systems. He specifically took a vacuum cleaner multi-agents setting and asked how they **can cooperate with each other**?
We have two solutions to design this multi-agent reinforcement learning system (MARL).
## Decentralized system
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/decentralized.png" alt="Decentralized"/>
<figcaption>
Source: <a href="https://www.youtube.com/watch?v=qgb0gyrpiGk"> Introduction to Multi-Agent Reinforcement Learning </a>
</figcaption>
</figure>
In decentralized learning, **each agent is trained independently from others**. In the example given, each vacuum learns to clean as many places as it can **without caring about what other vacuums (agents) are doing**.
The benefit is that **since no information is shared between agents, these vacuums can be designed and trained like we train single agents**.
The idea here is that **our training agent will consider other agents as part of the environment dynamics**. Not as agents.
However, the big drawback of this technique is that it will **make the environment non-stationary** since the underlying Markov decision process changes over time as other agents are also interacting in the environment.
And this is problematic for many Reinforcement Learning algorithms **that can't reach a global optimum with a non-stationary environment**.
## Centralized approach
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/centralized.png" alt="Centralized"/>
<figcaption>
Source: <a href="https://www.youtube.com/watch?v=qgb0gyrpiGk"> Introduction to Multi-Agent Reinforcement Learning </a>
</figcaption>
</figure>
In this architecture, **we have a high-level process that collects agents' experiences**: experience buffer. And we'll use these experiences **to learn a common policy**.
For instance, in the vacuum cleaner, the observation will be:
- The coverage map of the vacuums.
- The position of all the vacuums.
We use that collective experience **to train a policy that will move all three robots in the most beneficial way as a whole**. So each robot is learning from the common experience.
And we have a stationary environment since all the agents are treated as a larger entity, and they know the change of other agents' policies (since it's the same as theirs).
If we recap:
- In *decentralized approach*, we **treat all agents independently without considering the existence of the other agents.**
- In this case, all agents **consider others agents as part of the environment**.
- **Its a non-stationarity environment condition**, so non-guaranty of convergence.
- In centralized approach:
- A **single policy is learned from all the agents**.
- Takes as input the present state of an environment and the policy output joint actions.
- The reward is global.

View File

@@ -0,0 +1,137 @@
# Self-Play: a classic technique to train competitive agents in adversarial games
Now that we studied the basics of multi-agents. We're ready to go deeper. As mentioned in the introduction, we're going **to train agents in an adversarial game with SoccerTwos, a 2vs2 game**.
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/soccertwos.gif" alt="SoccerTwos"/>
<figcaption>This environment was made by the <a href="https://github.com/Unity-Technologies/ml-agents">Unity MLAgents Team</a></figcaption>
</figure>
## What is Self-Play?
Training agents correctly in an adversarial game can be **quite complex**.
On the one hand, we need to find how to get a well-trained opponent to play against your training agent. And on the other hand, even if you have a very good trained opponent, it's not a good solution since how your agent is going to improve its policy when the opponent is too strong?
Think of a child that just started to learn soccer. Playing against a very good soccer player will be useless since it will be too hard to win or at least get the ball from time to time. So the child will continuously lose without having time to learn a good policy.
The best solution would be **to have an opponent that is on the same level as the agent and will upgrade its level as the agent upgrades its own**. Because if the opponent is too strong, well learn nothing; if it is too weak, well overlearn useless behavior against a stronger opponent then.
This solution is called *self-play*. In self-play, **the agent uses former copies of itself (of its policy) as an opponent**. This way, the agent will play against an agent of the same level (challenging but not too much), have opportunities to gradually improve its policy, and then update its opponent as it becomes better. Its a way to bootstrap an opponent and progressively increase the opponent's complexity.
Its the same way humans learn in competition:
- We start to train against an opponent of similar level
- Then we learn from it, and when we acquired some skills, we can move further with stronger opponents.
We do the same with self-play:
- We **start with a copy of our agent as an opponent** this way, this opponent is on a similar level.
- We **learn from it**, and when we acquire some skills, we **update our opponent with a more recent copy of our training policy**.
The theory behind self-play is not something new. It was already used by Arthur Samuels checker player system in the fifties and by Gerald Tesauros TD-Gammon in 1955. If you want to learn more about the history of self-play [check this very good blogpost by Andrew Cohen](https://blog.unity.com/technology/training-intelligent-adversaries-using-self-play-with-ml-agents)
## Self-Play in MLAgents
Self-Play is integrated into the MLAgents library and is managed by multiple hyperparameters that were going to study. But the main focus as explained in the documentation is the **tradeoff between the skill level and generality of the final policy and the stability of learning**.
Training against a set of slowly changing or unchanging adversaries with low diversity **results in more stable training. But a risk to overfit if the change is too slow.**
We need then to control:
- How **often do we change opponents** with `swap_steps` and `team_change` parameters.
- The **number of opponents saved** with `window` parameter. A larger value of `window`
 means that an agent's pool of opponents will contain a larger diversity of behaviors since it will contain policies from earlier in the training run.
- **Probability of playing against the current self vs opponent** sampled in the pool with `play_against_latest_model_ratio`. A larger value of `play_against_latest_model_ratio`
 indicates that an agent will be playing against the current opponent more often.
- The **number of training steps before saving a new opponent** with `save_steps` parameters. A larger value of `save_steps`
 will yield a set of opponents that cover a wider range of skill levels and possibly play styles since the policy receives more training.
To get more details about these hyperparameters, you definitely need [to check this part of the documentation](https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Training-Configuration-File.md#self-play)
## The ELO Score to evaluate our agent
### What is ELO Score?
In adversarial games, tracking the **cumulative reward is not always a meaningful metric to track the learning progress:** because this metric is **dependent only on the skill of the opponent.**
Instead, were using an ***ELO rating system*** (named after Arpad Elo) that calculates the **relative skill level** between 2 players from a given population in a zero-sum game.
In a zero-sum game: one agent wins, and the other agent loses. Its a mathematical representation of a situation in which each participants gain or loss of utility **is exactly balanced by the gain or loss of the utility of the other participants.** We talk about zero-sum games because the sum of utility is equal to zero.
This ELO (starting at a specific score: frequently 1200) can decrease initially but should increase progressively during the training.
The Elo system is **inferred from the losses and draws against other players.** It means that player ratings depend **on the ratings of their opponents and the results scored against them.**
Elo defines an Elo score that is the relative skills of a player in a zero-sum game. **We say relative because it depends on the performance of opponents.**
The central idea is to think of the performance of a player **as a random variable that is normally distributed.**
The difference in rating between 2 players serves as **the predictor of the outcomes of a match.** If the player wins, but the probability of winning is high, it will only win a few points from its opponent since it means that it is much stronger than it.
After every game:
- The winning player takes **points from the losing one.**
- The number of points is determined **by the difference in the 2 players ratings (hence relative).**
- If the higher-rated player wins → few points will be taken from the lower-rated player.
- If the lower-rated player wins → a lot of points will be taken from the high-rated player.
- If its a draw → the lower-rated player gains a few points from the higher.
So if A and B have rating Ra, and Rb, then the **expected scores are** given by:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/elo1.png" alt="ELO Score"/>
Then, at the end of the game, we need to update the players actual Elo score. We use a linear adjustment **proportional to the amount by which the player over-performed or under-performed.**
We also define a maximum adjustment rating per game: K-factor.
- K=16 for master.
- K=32 for weaker players.
If Player A has Ea points but scored Sa points, then the players rating is updated using the formula:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/elo2.png" alt="ELO Score"/>
### Example
If we take an example:
Player A has a rating of 2600
Player B has a rating of 2300
- We first calculate the expected score:
\\(E_{A} = \frac{1}{1+10^{(2300-2600)/400}} = 0.849 \\)
\\(E_{B} = \frac{1}{1+10^{(2600-2300)/400}} = 0.151 \\)
- If the organizers determined that K=16 and A wins, the new rating would be:
\\(ELO_A = 2600 + 16*(1-0.849) = 2602 \\)
\\(ELO_B = 2300 + 16*(1-0.151) = 2298 \\)
- If the organizers determined that K=16 and B wins, the new rating would be:
\\(ELO_A = 2600 + 16*(0-0.849) = 2586 \\)
\\(ELO_B = 2300 + 16 *(1-0.151) = 2314 \\)
### The Advantages
Using ELO score has multiple advantages:
- Points are **always balanced** (more points are exchanged when there is an unexpected outcome, but the sum is always the same).
- It is a **self-corrected system** since if a player wins against a weak player, you will only win a few points.
- If **works with team games**: we calculate the average for each team and use it in Elo.
### The Disadvantages
- ELO **does not take the individual contribution** of each people in the team.
- Rating deflation: **good rating require skill over time to get the same rating**.
- **Cant compare rating in history**.