From 8b2aec1474e15b58c6b39c88c0867d58a6981d58 Mon Sep 17 00:00:00 2001 From: simoninithomas Date: Mon, 30 Jan 2023 21:27:15 +0100 Subject: [PATCH 01/29] Add first part of the unit --- units/en/_toctree.yml | 8 ++++ units/en/unit7/introduction-to-marl.mdx | 57 +++++++++++++++++++++++++ units/en/unit7/introduction.mdx | 26 +++++++++++ units/en/unit7/multi-agent-setting.mdx | 1 + 4 files changed, 92 insertions(+) create mode 100644 units/en/unit7/introduction-to-marl.mdx create mode 100644 units/en/unit7/introduction.mdx create mode 100644 units/en/unit7/multi-agent-setting.mdx diff --git a/units/en/_toctree.yml b/units/en/_toctree.yml index 3b6f440..4f34685 100644 --- a/units/en/_toctree.yml +++ b/units/en/_toctree.yml @@ -162,6 +162,14 @@ title: Conclusion - local: unit6/additional-readings title: Additional Readings +- title: Unit 7. Introduction to Multi-Agents and AI vs AI + sections: + - local: unit7/introduction + title: Introduction + - local: unit7/introduction-to-marl + title: An introduction to Multi-Agents Reinforcement Learning (MARL) + - local: unit7/multi-agent-setting + title: Designing Multi-Agents systems - title: What's next? New Units Publishing Schedule sections: - local: communication/publishing-schedule diff --git a/units/en/unit7/introduction-to-marl.mdx b/units/en/unit7/introduction-to-marl.mdx new file mode 100644 index 0000000..f0a3734 --- /dev/null +++ b/units/en/unit7/introduction-to-marl.mdx @@ -0,0 +1,57 @@ +# An introduction to Multi-Agents Reinforcement Learning (MARL) + +## From single agent to multiple agents + +From the first unit, we learned to train agents in a single-agent system. Where our agent was alone in its environment: **it was not cooperating or collaborating with other agents**. + +
+”Patchwork”/ +
+A patchwork of all the environments you've trained your agents on since the beginning of the course +
+
+ +When we do Multi agents reinforcement learning (MARL), we are in a situation where we have multiple agents **that share and interact in a common environment**. + +For instance, you can think of a warehouse where **multiple robots need to navigate to load and unload packages**. + +
+”Warehouse”/ +
Image by upklyak on Freepik
+ +Or a road with **several autonomous vehicles**. + +
+”Self +
+Image by jcomp on Freepik +
+
+ +In these examples, we have multiple agents interacting in the environment and with the other agents. This implies defining a multi-agent system. But first, let's understand the different types of multi-agent environments. + +## Different types of multi-agent environments + +Given that in a multi-agent system, agents interact with other agents we can have different types of environments: + +- *Cooperative environments*: where your agents needs **to maximize the common benefits**. + +For instance, in a warehouse, **robots must collaborate to load and unload the packages as efficiently (as fast as possible)**. + +- *Competitive/Adversarial environments*: in that case, your agent **want to maximize its benefits by minimizing the opponent ones**. + +For example, in a game of tennis, **each agent wants to beat the other agent**. + +”Tennis”/ + +- *Mixed of both adversarial and cooperative*: like in our SoccerTwos environment, two agents are part of a team (blue or purple): they need to cooperate with each other and beat the opponent team. + +
+”SoccerTwos”/ + +
+ +
This environment was made by the Unity MLAgents Team
+ + +So now we can ask how we design these multi-agent systems. Said differently, how can we train agents in a multi-agent setting? diff --git a/units/en/unit7/introduction.mdx b/units/en/unit7/introduction.mdx new file mode 100644 index 0000000..56ed60f --- /dev/null +++ b/units/en/unit7/introduction.mdx @@ -0,0 +1,26 @@ +# Introduction [[introduction]] + +”Thumbnail”/ + +Since the beginning of this course, we learned to train agents in a single-agent system. Where our agent was alone in its environment: it was not cooperating or collaborating with other agents. + +Our different agents worked great, and the single-agent system is useful for many applications. + +But, as humans, we live in a multi-agent world. Our intelligence comes from interaction with other agents. And so, our goal is to create agents that can interact with other humans and other agents. + +Consequently, we must study how to train deep reinforcement learning agents in a multi-agent system to build robust agents that can adapt, collaborate, or compete. + +So today, we’re going to learn the basics of this fascinating topic of multi-agents reinforcement learning (MARL). + +And the most exciting part is that during this unit, you’re going to train your first agents in a multi-agents system: a 2vs2 soccer team that needs to beat the opponent team. + +And you’re going to participate in AI vs. AI challenges where your trained agent will compete against other classmates’ agents every day and be ranked on a new leaderboard. + +
+”SoccerTwos”/ + +
+ +
This environment was made by the Unity MLAgents Team
+ +So let’s get started! diff --git a/units/en/unit7/multi-agent-setting.mdx b/units/en/unit7/multi-agent-setting.mdx new file mode 100644 index 0000000..bcb5304 --- /dev/null +++ b/units/en/unit7/multi-agent-setting.mdx @@ -0,0 +1 @@ +# Designing Multi-Agents systems From 1e1f44244af9dd9ba0d354db22f22fd580a45ecf Mon Sep 17 00:00:00 2001 From: simoninithomas Date: Mon, 30 Jan 2023 22:13:16 +0100 Subject: [PATCH 02/29] Add multi agent setting part --- units/en/_toctree.yml | 2 ++ units/en/unit7/multi-agent-setting.mdx | 45 ++++++++++++++++++++++++++ units/en/unit7/self-play | 1 + 3 files changed, 48 insertions(+) create mode 100644 units/en/unit7/self-play diff --git a/units/en/_toctree.yml b/units/en/_toctree.yml index 4f34685..63950e8 100644 --- a/units/en/_toctree.yml +++ b/units/en/_toctree.yml @@ -170,6 +170,8 @@ title: An introduction to Multi-Agents Reinforcement Learning (MARL) - local: unit7/multi-agent-setting title: Designing Multi-Agents systems + - local: unit7/self-play + title: Designing Multi-Agents systems - title: What's next? New Units Publishing Schedule sections: - local: communication/publishing-schedule diff --git a/units/en/unit7/multi-agent-setting.mdx b/units/en/unit7/multi-agent-setting.mdx index bcb5304..4bcfa2b 100644 --- a/units/en/unit7/multi-agent-setting.mdx +++ b/units/en/unit7/multi-agent-setting.mdx @@ -1 +1,46 @@ # Designing Multi-Agents systems + +For this section, you're going to watch this excellent introduction to multi-agents made by Brian Douglas . + + + + +In this video, Brian talked about how to design multi-agents systems? Especially he took a vacuum cleaner example and asked how they can cooperate each other? + +To design this Multi-Agent Reinforcement Learning system (MARL), we have two solutions. + +## Decentralized system + +[ADD illustration decentralized approach] + +In decentralized learning, each agent is trained independently from others. In the example given each vacuum learns to clean as much place it can without caring about what other vacuums (agents) are doing. + +The benefit is since no information is shared between agents, these vacuum can be designed and trained like we train single agents. + +The idea, here is that our training agent will consider other agents as part of the environment dynamics. Not as agents. + +However the big drawback of this technique, is that it will make the environment non-stationary since the underlying markov decision process changes over time since other agents are also interacting in the environment. +And this is problematic for many Reinforcement Learning algorithms that can't reach a global optimum. + +## Centralized approach + +[ADD illustration centralized approach] + +In this architecture, we have a high level process that collect agents experiences: experience buffer. And we'll use these experience to learn a common policy. + +For instance, in the vacuum cleaner, the observation will be: +- The coverage map of the vacuums. +- The position of all the vacuums. + +We use that collective experience to train a policy that will move all three robots in a most beneficial way as a whole. So each robots is learning from the common experience. +And we have a stationary environment since all the agents are treated as a larger entity and they know the change of other agents policy (since it’s the same than their). + +If we recap: +- In decentralized approach. We **treat all agents independently without considering the existence of the other agents.** + 1. In this case all agents considers others agents as part of the environment. + 2. **It’s a non-stationarity environment condition** ⇒ so non-guaranty of convergence. + +- In centralized approach: + 1. A single policy is learned from all the agents. + 2. Takes as input the present state of an environment and the policy output a jointed actions. + 3. The reward is global. diff --git a/units/en/unit7/self-play b/units/en/unit7/self-play new file mode 100644 index 0000000..0ab938a --- /dev/null +++ b/units/en/unit7/self-play @@ -0,0 +1 @@ +# Self-Play From f81f005c72995dc5783ffb4f54f6c5a8db939fff Mon Sep 17 00:00:00 2001 From: simoninithomas Date: Mon, 30 Jan 2023 23:36:13 +0100 Subject: [PATCH 03/29] Add self play --- units/en/unit7/additional-readings.mdx | 3 +++ units/en/unit7/introduction.mdx | 4 +-- units/en/unit7/self-play | 35 +++++++++++++++++++++++++- 3 files changed, 39 insertions(+), 3 deletions(-) create mode 100644 units/en/unit7/additional-readings.mdx diff --git a/units/en/unit7/additional-readings.mdx b/units/en/unit7/additional-readings.mdx new file mode 100644 index 0000000..71aba31 --- /dev/null +++ b/units/en/unit7/additional-readings.mdx @@ -0,0 +1,3 @@ +# Additional Readings [[additional-readings]] + +## Self-Play diff --git a/units/en/unit7/introduction.mdx b/units/en/unit7/introduction.mdx index 56ed60f..677409c 100644 --- a/units/en/unit7/introduction.mdx +++ b/units/en/unit7/introduction.mdx @@ -19,8 +19,8 @@ And you’re going to participate in AI vs. AI challenges where your trained age
”SoccerTwos”/ -
-
This environment was made by the Unity MLAgents Team
+
+ So let’s get started! diff --git a/units/en/unit7/self-play b/units/en/unit7/self-play index 0ab938a..3d302ba 100644 --- a/units/en/unit7/self-play +++ b/units/en/unit7/self-play @@ -1 +1,34 @@ -# Self-Play +# Self-Play: a classic technique to train competitive agents in adversarial games + +Now that we studied the basics of multi-agents. We're ready to go deeper. As mentioned in the introduction, we're going to train agents in an adversarial games a Soccer 2vs2 game. + +
+”SoccerTwos”/ + +
This environment was made by the Unity MLAgents Team
+ +
+ +## What is Self-Play? + +Training correctly agents in an adversarial game can be **quite complex**. + +On the one hand, we need to find how to get a well-trained opponent to play against your training agent. And on the other hand, even if you have a very good trained opponent, it's not a good solution since how your agent is going to improve its policy when the opponent is too strong? + +Think of a child that just started to learn soccer, playing against a very good soccer player will be useless since it will be too hard to win or at least get the ball from time to time. So the child will continuously lose without having time to learn a good policy. + +The best solution would be to have an opponent that is on the same level as the agent and will upgrade its level as the agent upgrade its own. Because if the opponent is too strong we’ll learn nothing and if it is too weak, we’re going to overlearn useless behavior against a stronger opponent then. + +This solution is called *self-play*. In self-play, the agent uses former copies of itself (of its policy) as an opponent. This way, the agent will play against an agent of the same level (challenging but not too much), have opportunities to improve gradually its policy, and then, as it becomes better update its opponent. It’s a way to bootstrap an opponent and have a gradual increase of opponent complexity. + +It’s the same way human learn in competition: + +- We start to train against an opponent of similar level +- Then we learn from it, and when we acquired some skills, we can move further with stronger opponents. + +We do the same with self-play: + +- We start with a copy of our agent as an opponent this way this opponent is on a similar level. +- We learn from it, and when we acquired some skills, we update our opponent with a more recent copy of our training policy. + +The theory behind self-play is not something new, it was already used by Arthur Samuel’s checker player system in the fifties, and by Gerald Tesauro’s TD-Gammon in 1955. If you want to learn more about the history of self-play check this very good blogpost by Andrew Cohen: [https://blog.unity.com/technology/training-intelligent-adversaries-using-self-play-with-ml-agents](https://blog.unity.com/technology/training-intelligent-adversaries-using-self-play-with-ml-agents) From c6e3739d206a6e9da09a40ed802c56e836981498 Mon Sep 17 00:00:00 2001 From: simoninithomas Date: Mon, 30 Jan 2023 23:58:37 +0100 Subject: [PATCH 04/29] Add the end of self play --- units/en/unit7/self-play | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/units/en/unit7/self-play b/units/en/unit7/self-play index 3d302ba..2285400 100644 --- a/units/en/unit7/self-play +++ b/units/en/unit7/self-play @@ -32,3 +32,21 @@ We do the same with self-play: - We learn from it, and when we acquired some skills, we update our opponent with a more recent copy of our training policy. The theory behind self-play is not something new, it was already used by Arthur Samuel’s checker player system in the fifties, and by Gerald Tesauro’s TD-Gammon in 1955. If you want to learn more about the history of self-play check this very good blogpost by Andrew Cohen: [https://blog.unity.com/technology/training-intelligent-adversaries-using-self-play-with-ml-agents](https://blog.unity.com/technology/training-intelligent-adversaries-using-self-play-with-ml-agents) + +## Self-Play in MLAgents + +Self-Play is integrated into the MLAgents library and is managed by multiple hyperparameters that we’re going to study. But the main focus as explained in the documentation is the tradeoff between the skill level and generality of the final policy and the stability of learning. + +Training against a set of slowly changing or unchanging adversaries with low diversity **results in more stable training. But a risk to overfit if the change is too slow.** + +We need then to control: + +- How often do we change opponents with swap_steps and team_change parameters. +- The number of opponents saved with window parameter. A larger value of `window` + means that an agent's pool of opponents will contain a larger diversity of behaviors since it will contain policies from earlier in the training run. +- Probability of playing against the current self vs opponent sampled in the pool with play_against_latest_model_ratio. A larger value of `play_against_latest_model_ratio` + indicates that an agent will be playing against the current opponent more often. +- The number of training steps before saving a new opponent with save_steps parameters. A larger value of `save_steps` + will yield a set of opponents that cover a wider range of skill levels and possibly play styles since the policy receives more training. + +To get more details about these hyperparameters you definitely need to check this part of the documentation: [https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Training-Configuration-File.md#self-play](https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Training-Configuration-File.md#self-play) From 95880600632689371c58a8ea321b76aff5e601dc Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Tue, 31 Jan 2023 07:29:06 +0100 Subject: [PATCH 05/29] Update _toctree.yml --- units/en/_toctree.yml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/units/en/_toctree.yml b/units/en/_toctree.yml index 63950e8..a979b50 100644 --- a/units/en/_toctree.yml +++ b/units/en/_toctree.yml @@ -172,6 +172,8 @@ title: Designing Multi-Agents systems - local: unit7/self-play title: Designing Multi-Agents systems + - local: unit7/additional-readings + title: Additional Readings - title: What's next? New Units Publishing Schedule sections: - local: communication/publishing-schedule From 8c3395fb0402d31f8dd9644fd8152d7f8013c1f0 Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Tue, 31 Jan 2023 07:29:23 +0100 Subject: [PATCH 06/29] Rename self-play to self-play.mdx --- units/en/unit7/{self-play => self-play.mdx} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename units/en/unit7/{self-play => self-play.mdx} (100%) diff --git a/units/en/unit7/self-play b/units/en/unit7/self-play.mdx similarity index 100% rename from units/en/unit7/self-play rename to units/en/unit7/self-play.mdx From 9f557d08d0b6cc3d91ced8c5653478080a6e74d9 Mon Sep 17 00:00:00 2001 From: simoninithomas Date: Tue, 31 Jan 2023 08:20:11 +0100 Subject: [PATCH 07/29] Add ELO --- units/en/unit7/self-play.mdx | 80 ++++++++++++++++++++++++++++++++++++ 1 file changed, 80 insertions(+) diff --git a/units/en/unit7/self-play.mdx b/units/en/unit7/self-play.mdx index 2285400..307354e 100644 --- a/units/en/unit7/self-play.mdx +++ b/units/en/unit7/self-play.mdx @@ -50,3 +50,83 @@ We need then to control:  will yield a set of opponents that cover a wider range of skill levels and possibly play styles since the policy receives more training. To get more details about these hyperparameters you definitely need to check this part of the documentation: [https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Training-Configuration-File.md#self-play](https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Training-Configuration-File.md#self-play) + + +## The ELO Score to evaluate our agent +### What is ELO Score? +In adversarial games, tracking the **cumulative reward is not always a meaningful metric to track the learning progress:** because this metric is **dependent only on the skill of the opponent.** + +Instead, we’re using an ***ELO rating system*** (named after Arpad Elo) that calculates the **relative skill level** between 2 players from a given population in a zero-sum game. + +In a zero-sum game: one agent wins is the other agent loss. It’s a mathematical representation of a situation in which each participant’s gain or loss of utility **is exactly balanced by the gain or loss of the utility of the other participants.** We talk about zero-sum games because the sum of utility is equal to zero. + +During the training, this ELO (starting at a certain score: in general 1200), can decrease in the beginning but should then increase progressively. + +The Elo system is **inferred from the losses and draws against other players.** It means that player ratings depend **on the ratings of their opponents and the results scored against them.** + +Elo defines an Elo score that is the relative skills of a player in a zero-sum game. **We say relative because it depends on the performance of opponents.** + +The central idea is to think of the performance of a player **as a random variable that is normally distributed.** + +The difference in rating between 2 players serves as **the predictor of the outcomes of a match.** If the player wins but the probability is high it will not win a lot of points from their opponent, since it means that it was much stronger than it. + +After every game: + +- The winning player takes **points from the losing one.** +- The number of points is determined **by the difference in the 2 players ratings (hence relative).** + - If the higher-rated player wins → few points will be taken from the lower-rated player. + - If the lower-rated player wins → a lot of points will be taken from the high-rated player. + - If it’s a draw → the lower-rated player gains a few points from the higher. + +So if A and B have rating Ra, and Rb, then the **expected scores are** given by: + +”ELO + +Then, at the end of the game, we need to update the player’s actual Elo score, we use a linear adjustment **proportional to the amount by which the player over-performed or under-performed.** + +We also define a maximum adjustment rating per game: K-factor. + +- K=16 for master. +- K=32 for weaker players. + +If Player A has Ea points but scored Sa points, then the player’s rating is updated using the formula: + +”ELO + +### Example + +If we take an example: + +Player A has a rating of 2600 + +Player B has a rating of 2300 + +- We first calculate the expected score: + +\\(E_{A} = \frac{1}{1+10^{(2300-2600)/400}} = 0.849 \\) +\\(E_{B} = \frac{1}{1+10^{(2600-2300)/400}} = 0.151 \\) + +- If the organizers determined that K=16 and A wins, the new rating would be: + +\\(ELO_A = 2600 + 16*(1-0.849) = 2602 \\) +\\(ELO_B = 2300 + 16*(1-0.151) = 2298 \\) + +- If the organizers determined that K=16 and B wins, the new rating would be: + +\\(ELO_A = 2600 + 16*(0-0.849) = 2586 \\) +\\(ELO_B = 2300 + 16 *(1-0.151) = 2314 \\) + + +### The Advantages + +Using ELO score has multiple advantages: + +- Points are **always balanced** (more points are exchanged when there is an unexpected outcome but the sum is always the same). +- It is a **self-corrected system** since if a player wins against a weak player, you will not win a lot of points. +- If **works with team games**: we calculate the average for each team and use it in Elo. + +### The Disadvantages + +- ELO **does not take the individual contribution** of each people in the team. +- Rating deflation: **good rating require skill over time to get the same rating**. +- **Can’t compare rating in history**. From d1a7841c3022872fd2e68bfaefbde728bb97d3b8 Mon Sep 17 00:00:00 2001 From: simoninithomas Date: Tue, 31 Jan 2023 08:58:39 +0100 Subject: [PATCH 08/29] Add first elements of hands-on --- units/en/unit7/hands-on.mdx | 74 +++++++++++++++++++++++++++++++++++++ 1 file changed, 74 insertions(+) create mode 100644 units/en/unit7/hands-on.mdx diff --git a/units/en/unit7/hands-on.mdx b/units/en/unit7/hands-on.mdx new file mode 100644 index 0000000..27b3255 --- /dev/null +++ b/units/en/unit7/hands-on.mdx @@ -0,0 +1,74 @@ +# Hands-on + + + + +Now that you learned the bases of multi-agents. You're ready to train our first agents in a multi-agents system: **a 2vs2 soccer team that needs to beat the opponent team**. + +And you’re going to participate in AI vs. AI challenges where your trained agent will compete against other classmates’ agents every day and be ranked on a new leaderboard. + +To validate this hands-on for the certification process, you just need to push your trained model. There are no minimal result to attain to validate it. + +For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process + + +This hands-on will be different, since to get correct results you need to train your agents for 4h to 5h. And given the risk of timeout in colab we advise you to train on your own computer. + +Let's get started, + + +## What is AI vs. AI ? + +AI vs. AI is a tool we developed at Hugging Face. +It's a matchmaking algorithm where your pushed models are ranked by playing against other models. + +AI vs. AI is three tools: + +- A *matchmaking process* defining which model against which model and running the model fights using a background task in the Space. +- A *leaderboard* getting the match history results and displaying the models ELO ratings: [ADD LEADERBOARD] +- A *Space demo* to visualize your agents playing against others : https://huggingface.co/spaces/unity/ML-Agents-SoccerTwos + + +We're going to write a blogpost to explain this AI vs. AI tool in detail, but to give you the big picture it works this way: +- Every 4h, our algorithm fetch all the available models for a given environment. +- It creates a queue of matches with the matchmaking algorithm. +- Simulate the match in a Unity headless process and gather the match result (1 if first model won, 0.5 if it’s a draw, 0 if the second model won) in a Dataset. +- Then, when all matches from the matches queue are done, we update the elo score for each model and update the leaderboard. + +### Competition Rules + +This first AI vs. AI competition is an experiment, the goal is to improve the tool in the future with your feedback. So some **breakups can happen during the challenge**. But don't worry +**all the results are saved in a dataset so we can always restart the calculation correctly without loosing information**. + +In order that your model get correctly evaluated against others you need to follow these rules: + +1. You can't change the observation space or action space. By doing that your model will not work in our evaluation. +2. You can't use a custom trainer for now, you need to use Unity MLAgents. +3. We provide executables to train your agents, you can also use the Unity Editor if you prefer **but in order to avoid bugs we advise you to use our executables**. + +What will make the difference during this challenge are **the hyperparameters you choose**. + +# Step 0: Install MLAgents and download the correct executable + + + +# Step 1: Understand the environment + + + + + + + + + + + + + + +- EXE From f22b8160f3c08b8db2ef1eadf9675df35a8d1510 Mon Sep 17 00:00:00 2001 From: simoninithomas Date: Tue, 31 Jan 2023 08:59:41 +0100 Subject: [PATCH 09/29] Update toctree --- units/en/_toctree.yml | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/units/en/_toctree.yml b/units/en/_toctree.yml index a979b50..f24a28d 100644 --- a/units/en/_toctree.yml +++ b/units/en/_toctree.yml @@ -171,7 +171,9 @@ - local: unit7/multi-agent-setting title: Designing Multi-Agents systems - local: unit7/self-play - title: Designing Multi-Agents systems + title: Self-Play + - local: unit7/hands-on + title: Let's train our soccer team to beat your classmates' teams (AI vs. AI) - local: unit7/additional-readings title: Additional Readings - title: What's next? New Units Publishing Schedule From 85f5ba0f59529931fc4ffbe3e9cc081ae600da94 Mon Sep 17 00:00:00 2001 From: simoninithomas Date: Tue, 31 Jan 2023 09:35:20 +0100 Subject: [PATCH 10/29] Add additional readings --- units/en/unit7/additional-readings.mdx | 16 +++++++- units/en/unit7/hands-on.mdx | 52 ++++++++++++++++++++++++-- units/en/unit7/self-play.mdx | 3 ++ 3 files changed, 67 insertions(+), 4 deletions(-) diff --git a/units/en/unit7/additional-readings.mdx b/units/en/unit7/additional-readings.mdx index 71aba31..6cb3239 100644 --- a/units/en/unit7/additional-readings.mdx +++ b/units/en/unit7/additional-readings.mdx @@ -1,3 +1,17 @@ # Additional Readings [[additional-readings]] -## Self-Play +## An introduction to multi-agents + +- [Multi-agent reinforcement learning: An overview](https://www.dcsc.tudelft.nl/~bdeschutter/pub/rep/10_003.pdf) +- [Multiagent Reinforcement Learning, Marc Lanctot](https://rlss.inria.fr/files/2019/07/RLSS_Multiagent.pdf) +- [Example of a multi-agent environment](https://www.mathworks.com/help/reinforcement-learning/ug/train-3-agents-for-area-coverage.html?s_eid=PSM_15028) +- [A list of different multi-agent environments](https://agents.inf.ed.ac.uk/blog/multiagent-learning-environments/) +- [Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents](https://bit.ly/3nVK7My) +- [Dealing with Non-Stationarity in Multi-Agent Deep Reinforcement Learning](https://bit.ly/3v7LxaT) + +## Self-Play and MA-POCA + +- [Self Play Theory and with MLAgents](https://blog.unity.com/technology/training-intelligent-adversaries-using-self-play-with-ml-agents) +- [Training complex behavior with MLAgents](https://blog.unity.com/technology/ml-agents-v20-release-now-supports-training-complex-cooperative-behaviors) +- [MLAgents plays dodgeball](https://blog.unity.com/technology/ml-agents-plays-dodgeball) +- [On the Use and Misuse of Absorbing States in Multi-agent Reinforcement Learning (MA-POCA)](https://arxiv.org/pdf/2111.05992.pdf) diff --git a/units/en/unit7/hands-on.mdx b/units/en/unit7/hands-on.mdx index 27b3255..31db82b 100644 --- a/units/en/unit7/hands-on.mdx +++ b/units/en/unit7/hands-on.mdx @@ -29,7 +29,7 @@ It's a matchmaking algorithm where your pushed models are ranked by playing aga AI vs. AI is three tools: - A *matchmaking process* defining which model against which model and running the model fights using a background task in the Space. -- A *leaderboard* getting the match history results and displaying the models ELO ratings: [ADD LEADERBOARD] +- A *leaderboard* getting the match history results and displaying the models ELO ratings: https://huggingface.co/spaces/huggingface-projects/AIvsAI-SoccerTwos - A *Space demo* to visualize your agents playing against others : https://huggingface.co/spaces/unity/ML-Agents-SoccerTwos @@ -54,10 +54,54 @@ What will make the difference during this challenge are **the hyperparameters yo # Step 0: Install MLAgents and download the correct executable +You need to install a specific version of MLAgents -# Step 1: Understand the environment - + + +## Step 1: Understand the environment + +The environment is called `` it was made by the Unity MLAgents Team. + +The goal in this + +The goal in this environment is to train our agent to **get the gold brick on the top of the Pyramid. To do that, it needs to press a button to spawn a Pyramid, navigate to the Pyramid, knock it over, and move to the gold brick at the top**. + +Pyramids Environment + + +## The reward function + +The reward function is: + +Pyramids Environment + +In terms of code, it looks like this +Pyramids Reward + +To train this new agent that seeks that button and then the Pyramid to destroy, we’ll use a combination of two types of rewards: + +- The *extrinsic one* given by the environment (illustration above). +- But also an *intrinsic* one called **curiosity**. This second will **push our agent to be curious, or in other terms, to better explore its environment**. + +If you want to know more about curiosity, the next section (optional) will explain the basics. + +## The observation space + +In terms of observation, we **use 148 raycasts that can each detect objects** (switch, bricks, golden brick, and walls.) + + + +We also use a **boolean variable indicating the switch state** (did we turn on or off the switch to spawn the Pyramid) and a vector that **contains the agent’s speed**. + +Pyramids obs code + + +## The action space + +The action space is **discrete** with four possible actions: + +Pyramids Environment @@ -67,6 +111,8 @@ What will make the difference during this challenge are **the hyperparameters yo +## MA POCA +https://arxiv.org/pdf/2111.05992.pdf diff --git a/units/en/unit7/self-play.mdx b/units/en/unit7/self-play.mdx index 307354e..f553432 100644 --- a/units/en/unit7/self-play.mdx +++ b/units/en/unit7/self-play.mdx @@ -104,16 +104,19 @@ Player B has a rating of 2300 - We first calculate the expected score: \\(E_{A} = \frac{1}{1+10^{(2300-2600)/400}} = 0.849 \\) + \\(E_{B} = \frac{1}{1+10^{(2600-2300)/400}} = 0.151 \\) - If the organizers determined that K=16 and A wins, the new rating would be: \\(ELO_A = 2600 + 16*(1-0.849) = 2602 \\) + \\(ELO_B = 2300 + 16*(1-0.151) = 2298 \\) - If the organizers determined that K=16 and B wins, the new rating would be: \\(ELO_A = 2600 + 16*(0-0.849) = 2586 \\) + \\(ELO_B = 2300 + 16 *(1-0.151) = 2314 \\) From e4f039d2cc1abfd53c0bceaecb01bec7fffc3b4c Mon Sep 17 00:00:00 2001 From: simoninithomas Date: Tue, 31 Jan 2023 15:36:25 +0100 Subject: [PATCH 11/29] Add Hands on --- units/en/unit7/hands-on.mdx | 284 +++++++++++++++++++++++++++++------- 1 file changed, 230 insertions(+), 54 deletions(-) diff --git a/units/en/unit7/hands-on.mdx b/units/en/unit7/hands-on.mdx index 31db82b..0796a3a 100644 --- a/units/en/unit7/hands-on.mdx +++ b/units/en/unit7/hands-on.mdx @@ -1,120 +1,296 @@ # Hands-on - - - Now that you learned the bases of multi-agents. You're ready to train our first agents in a multi-agents system: **a 2vs2 soccer team that needs to beat the opponent team**. -And you’re going to participate in AI vs. AI challenges where your trained agent will compete against other classmates’ agents every day and be ranked on a new leaderboard. +And you’re going to participate in AI vs. AI challenges where your trained agent will compete against other classmates’ **agents every day and be ranked on a new leaderboard.** -To validate this hands-on for the certification process, you just need to push your trained model. There are no minimal result to attain to validate it. +To validate this hands-on for the certification process, you just need to push a trained model. There **are no minimal result to attain to validate it.** -For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process +For more information about the certification process, check this section 👉 [https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process) - -This hands-on will be different, since to get correct results you need to train your agents for 4h to 5h. And given the risk of timeout in colab we advise you to train on your own computer. +This hands-on will be different, since to get correct results you need to train your agents from 4 hours to 8 hours. And given the risk of timeout in colab we advise you to train on your own computer. You don’t need a super computer, a simple laptop is good enough for this exercise. Let's get started, +## What is AI vs. AI? -## What is AI vs. AI ? +AI vs. AI is an open-source tool we developed at Hugging Face to compete agents on the Hub against one another in a multi-agent setting. These models are then ranked in a leaderboard. -AI vs. AI is a tool we developed at Hugging Face. -It's a matchmaking algorithm where your pushed models are ranked by playing against other models. +The idea of this tool is to have a powerful evaluation tool: **by evaluating your agent with a lot of others you’ll get a good idea of the quality of your policy.** -AI vs. AI is three tools: +More precisely, AI vs. AI is three tools: -- A *matchmaking process* defining which model against which model and running the model fights using a background task in the Space. -- A *leaderboard* getting the match history results and displaying the models ELO ratings: https://huggingface.co/spaces/huggingface-projects/AIvsAI-SoccerTwos -- A *Space demo* to visualize your agents playing against others : https://huggingface.co/spaces/unity/ML-Agents-SoccerTwos +- A *matchmaking process* defining the matches (which model against which) and running the model fights using a background task in the Space. +- A *leaderboard* getting the match history results and displaying the models’ ELO ratings: [https://huggingface.co/spaces/huggingface-projects/AIvsAI-SoccerTwos](https://huggingface.co/spaces/huggingface-projects/AIvsAI-SoccerTwos) +- A *Space demo* to visualize your agents playing against others: [https://huggingface.co/spaces/unity/ML-Agents-SoccerTwos](https://huggingface.co/spaces/unity/ML-Agents-SoccerTwos) +We're going to write a blog post to explain this AI vs. AI tool in detail, but to give you the big picture it works this way: -We're going to write a blogpost to explain this AI vs. AI tool in detail, but to give you the big picture it works this way: -- Every 4h, our algorithm fetch all the available models for a given environment. -- It creates a queue of matches with the matchmaking algorithm. -- Simulate the match in a Unity headless process and gather the match result (1 if first model won, 0.5 if it’s a draw, 0 if the second model won) in a Dataset. -- Then, when all matches from the matches queue are done, we update the elo score for each model and update the leaderboard. +- Every 4h, our algorithm **fetch all the available models for a given environment (in our case ML-Agents-SoccerTwos).** +- It creates a **queue of matches with the matchmaking algorithm.** +- We simulate the match in a Unity headless process and **gather the match result** (1 if the first model won, 0.5 if it’s a draw, 0 if the second model won) in a Dataset. +- Then, when all matches from the matches queue are done, **we update the ELO score for each model and update the leaderboard.** ### Competition Rules -This first AI vs. AI competition is an experiment, the goal is to improve the tool in the future with your feedback. So some **breakups can happen during the challenge**. But don't worry -**all the results are saved in a dataset so we can always restart the calculation correctly without loosing information**. +This first AI vs. AI competition **is an experiment,** the goal is to improve the tool in the future with your feedback. So some **breakups can happen during the challenge**. But don't worry +**all the results are saved in a dataset so we can always restart the calculation correctly without losing information**. -In order that your model get correctly evaluated against others you need to follow these rules: +In order that your model to get correctly evaluated against others you need to follow these rules: -1. You can't change the observation space or action space. By doing that your model will not work in our evaluation. -2. You can't use a custom trainer for now, you need to use Unity MLAgents. +1. **You can't change the observation space or action space of the agent.** By doing that your model will not work during evaluation. +2. You **can't use a custom trainer for now,** you need to use Unity MLAgents ones. 3. We provide executables to train your agents, you can also use the Unity Editor if you prefer **but in order to avoid bugs we advise you to use our executables**. What will make the difference during this challenge are **the hyperparameters you choose**. -# Step 0: Install MLAgents and download the correct executable +The AI vs AI algorithm will run until April the 30th 2023. -You need to install a specific version of MLAgents +We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues). +### Exchange with your classmates, share advice and ask questions on Discord +- We created a new channel called `ai-vs-ai-challenge` to exchange advice and ask questions. +- If you didn’t joined yet the discord server you can [join here](https://discord.gg/ydHrjt3WP5) +## Step 0: Install MLAgents and download the correct executable + +⚠ We're going to use an experimental version of ML-Agents were you can push to hub and load from hub Unity ML-Agents Models **you need to install the same version.** + +⚠ ⚠ ⚠ We’re not going to use the same version than for the Unit 5: Introduction to ML-Agents ⚠ ⚠ ⚠ + +We advise you to use (conda)[[https://docs.conda.io/en/latest/](https://docs.conda.io/en/latest/)] as a package manager and create a new environment. + +With conda, we create a new environment called rl: + +conda create --name rl python=3.8 + +conda activate rl + +To be able to train correctly our agents and push to the Hub, we need to install an experimental version of ML-Agents (the branch aivsai from Hugging Face ML-Agents fork) + +git clone --branch aivsai [https://github.com/huggingface/ml-agents/](https://github.com/huggingface/ml-agents/) + +When the cloning is done (it takes 2.5Go), we go inside the repository and install the package + +cd ml-agents +pip install -e ./ml-agents-envs +pip install -e ./ml-agents + +We also need to install pytorch with: + +pip install torch + +Now that’s installed we need to add the environment training executable. Based on your operating system you need to download one of them, unzip it and place it in a new folder inside `ml-agents`that you call `training-envs-executables` + +At the end your executable should be in `mlagents/training-envs-executables/SoccerTwos` + +Windows: [https://drive.google.com/file/d/1sqFxbEdTMubjVktnV4C6ICjp89wLhUcP/view?usp=sharing](https://drive.google.com/file/d/1sqFxbEdTMubjVktnV4C6ICjp89wLhUcP/view?usp=sharing) + +Linux (Ubuntu): [https://drive.google.com/file/d/1KuqBKYiXiIcU4kNMqEzhgypuFP5_45CL/view?usp=sharing](https://drive.google.com/file/d/1KuqBKYiXiIcU4kNMqEzhgypuFP5_45CL/view?usp=sharing) + +Mac: [ADD Berangere] ## Step 1: Understand the environment -The environment is called `` it was made by the Unity MLAgents Team. +The environment is called `SoccerTwos` it was made by the Unity MLAgents Team. You can find its documentation here: [https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Learning-Environment-Examples.md#soccer-twos](https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Learning-Environment-Examples.md#soccer-twos) -The goal in this +The goal in this environment **is to get the ball into the opponent's goal while preventing the ball from entering its own goal.** -The goal in this environment is to train our agent to **get the gold brick on the top of the Pyramid. To do that, it needs to press a button to spawn a Pyramid, navigate to the Pyramid, knock it over, and move to the gold brick at the top**. +
+”SoccerTwos”/ -Pyramids Environment +
This environment was made by the Unity MLAgents Team
+
-## The reward function +### The reward function The reward function is: -Pyramids Environment +”SoccerTwos -In terms of code, it looks like this -Pyramids Reward +### The observation space -To train this new agent that seeks that button and then the Pyramid to destroy, we’ll use a combination of two types of rewards: +The observation space is composed vector size of 336: -- The *extrinsic one* given by the environment (illustration above). -- But also an *intrinsic* one called **curiosity**. This second will **push our agent to be curious, or in other terms, to better explore its environment**. +- 11 ray-casts forward distributed over 120 degrees (264 state dimensions) +- 3 ray-casts backward distributed over 90 degrees (72 state dimensions) +- Both of these ray-casts can detect 6 objects: + - Ball + - Blue Goal + - Purple Goal + - Wall + - Blue Agent + - Purple Agent -If you want to know more about curiosity, the next section (optional) will explain the basics. +### The action space -## The observation space +The action space is three discrete branches: -In terms of observation, we **use 148 raycasts that can each detect objects** (switch, bricks, golden brick, and walls.) +”SoccerTwos - +## Step 2: Understand MA-POCA -We also use a **boolean variable indicating the switch state** (did we turn on or off the switch to spawn the Pyramid) and a vector that **contains the agent’s speed**. +[https://arxiv.org/pdf/2111.05992.pdf](https://arxiv.org/pdf/2111.05992.pdf) -Pyramids obs code +## Step 3: Define the Config +We already learned in (Unit 5)[https://huggingface.co/deep-rl-course/unit5/introduction] that in ML-Agents, you define **the training hyperparameters into config.yaml files.** -## The action space +There are multiple hyperparameters. To know them better, you should check for each explanation with **[the documentation](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/Training-Configuration-File.md)** -The action space is **discrete** with four possible actions: +The config file we’re going to use here is in `./config/poca/SoccerTwos.yaml` it looks like this: -Pyramids Environment +```csharp +behaviors: + SoccerTwos: + trainer_type: poca + hyperparameters: + batch_size: 2048 + buffer_size: 20480 + learning_rate: 0.0003 + beta: 0.005 + epsilon: 0.2 + lambd: 0.95 + num_epoch: 3 + learning_rate_schedule: constant + network_settings: + normalize: false + hidden_units: 512 + num_layers: 2 + vis_encode_type: simple + reward_signals: + extrinsic: + gamma: 0.99 + strength: 1.0 + keep_checkpoints: 5 + max_steps: 50000000 + time_horizon: 1000 + summary_freq: 10000 + self_play: + save_steps: 50000 + team_change: 200000 + swap_steps: 2000 + window: 10 + play_against_latest_model_ratio: 0.5 + initial_elo: 1200.0 +``` +Compared to Pyramids or SnowballTarget we have new hyperparameters with self-play part. How you modify them can be critical in getting good results. +The advice I can give you here is to check the explanation and recommended value for each parameters (especially self-play ones) with **[the documentation](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/Training-Configuration-File.md).** +Now that you’ve modified our config file, you’re ready to train your agents. +## Step 4: Start the training +To train the agents, we need to **launch mlagents-learn and select the executable containing the environment.** +We define four parameters: +1. `mlagents-learn `: the path where the hyperparameter config file is. +2. `-env`: where the environment executable is. +3. `-run_id`: the name you want to give to your training run id. +4. `-no-graphics`: to not launch the visualization during the training. +For 5M timesteps (which is the recommended value) it will take from 5 to 8 hours of training. You can continue to use your computer in the meantime, but my advice is to deactivate the computer standby mode to avoid the training to be stopped. -## MA POCA -https://arxiv.org/pdf/2111.05992.pdf +Depending on the executable you use (windows, ubuntu, mac) the training command will look like this (your executable path can be different so don’t hesitate to check before running). +`mlagents-learn ./config/poca/SoccerTwos.yaml --env=./training-envs-executables/SoccerTwos.exe --run-id="SoccerTwos" --no-graphics` +The executable contains 8 copies of SoccerTwos. +⚠️ It’s normal if you don’t see a big increase of ELO score (and even a decrease below 1200) before 2M timesteps, since your agents will spend most of their time moving randomly on the field before being able to goal. -- EXE +## Step 5: **Push the agent to the Hugging Face Hub** + +Now that we trained our agents, we’re **ready to push them to the Hub to be able to participate in the AI vs. AI challenge and visualize them playing on your browser🔥.** + +To be able to share your model with the community, there are three more steps to follow: + +1️⃣ (If it’s not already done) create an account to HF ➡ **[https://huggingface.co/join](https://huggingface.co/join)** + +2️⃣ Sign in and store your authentication token from the Hugging Face website. + +Create a new token (**[https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)**) **with write role** + +Create HF Token + +Copy the token, run this, and paste the token + +``` +huggingface-cli login + +``` + +Then, we need to run `mlagents-push-to-hf`. + +And we define four parameters: + +1. `-run-id`: the name of the training run id. +2. `-local-dir`: where the agent was saved, it’s results/, so in my case results/First Training. +3. `-repo-id`: the name of the Hugging Face repo you want to create or update. It’s always / +If the repo does not exist **it will be created automatically** +4. `--commit-message`: since HF repos are git repository you need to define a commit message. + +In my case + +`mlagents-push-to-hf --run-id="SoccerTwos" --local-dir="./results/SoccerTwos" --repo-id="ThomasSimonini/poca-SoccerTwos" --commit-message="First Push"` + +``` +mlagents-push-to-hf --run-id= # Add your run id --local-dir= # Your local dir --repo-id= # Your repo id --commit-message="First Push" +``` + +If everything worked you should have this at the end of the process(but with a different url 😆) : + +``` +Your model is pushed to the hub. You can view your model here: https://huggingface.co/ThomasSimonini/poca-SoccerTwos +``` + +It's the link to your model. It contains a model card that explains how to use it, your Tensorboard, and your config file. **What's awesome is that it's a git repository, which means you can have different commits, update your repository with a new push, etc.** + +## Step 6: Verify that your model is ready for AI vs AI Challenge + +Now that your model is pushed to the Hub, **it’s going to be added automatically to the AI vs AI Challenge model pool.** It can take a little bit of time before your model is added to the leaderboard given we do a run of matches every 4h. + +But in order that everything works perfectly you need to check: + +1. That you have this tag in your model: ML-Agents-SoccerTwos. This is the tag we use to select models to be added to the challenge pool. To do that go to your model and check the tags + +verify1.png + +If it’s not the case you just need to modify readme and add it + +verify2.png + +1. That you have a SoccerTwos onnx file + +verify3.png + +We strongly advise you to create a new model when you push to hub if you want to train it again, or train a new version. + +## Step 7: Visualize some match in our demo + +Now that your model is part of AI vs AI Challenge, you can visualize how good it is compared to others. + +In order to do that, you just need to go on this demo: + +Select your model as team blue (or team purple if you prefer) and another. The best to compare your model is either with the one who’s on top of the leaderboard. Or use the baseline model as opponent [https://huggingface.co/unity/MLAgents-SoccerTwos](https://huggingface.co/unity/MLAgents-SoccerTwos) + +This matches you see live are not used to the calculation of your result but are good way to visualize how good your agent is. + +And don't hesitate to share the best score your agent gets on discord in #rl-i-made-this channel 🔥 + +### Conclusion + +That’s all for today. Congrats on finishing this tutorial! + +The best way to learn is to practice and try stuff. Why not training another agent with a different configuration? + +And don’t hesitate from time to time to check the [leaderboard](https://huggingface.co/spaces/huggingface-projects/AIvsAI-SoccerTwos) + +See you on Unit 8 🔥, + +## Keep Learning, Stay awesome 🤗 From bc7238c1bf107b2cd84396a661bf42d3a3857b99 Mon Sep 17 00:00:00 2001 From: simoninithomas Date: Tue, 31 Jan 2023 15:57:01 +0100 Subject: [PATCH 12/29] Finalize introduction and marl --- units/en/unit7/introduction-to-marl.mdx | 18 ++++++------ units/en/unit7/introduction.mdx | 39 ++++++++++++++++--------- 2 files changed, 34 insertions(+), 23 deletions(-) diff --git a/units/en/unit7/introduction-to-marl.mdx b/units/en/unit7/introduction-to-marl.mdx index f0a3734..423c3a0 100644 --- a/units/en/unit7/introduction-to-marl.mdx +++ b/units/en/unit7/introduction-to-marl.mdx @@ -2,10 +2,10 @@ ## From single agent to multiple agents -From the first unit, we learned to train agents in a single-agent system. Where our agent was alone in its environment: **it was not cooperating or collaborating with other agents**. +In the first unit, we learned to train agents in a single-agent system. Where our agent was alone in its environment: **it was not cooperating or collaborating with other agents**.
-”Patchwork”/ +Patchwork
A patchwork of all the environments you've trained your agents on since the beginning of the course
@@ -16,19 +16,19 @@ When we do Multi agents reinforcement learning (MARL), we are in a situation whe For instance, you can think of a warehouse where **multiple robots need to navigate to load and unload packages**.
-”Warehouse”/ +Warehouse
Image by upklyak on Freepik
Or a road with **several autonomous vehicles**.
-”Self +Self driving cars
Image by jcomp on Freepik
-In these examples, we have multiple agents interacting in the environment and with the other agents. This implies defining a multi-agent system. But first, let's understand the different types of multi-agent environments. +In these examples, we have **multiple agents interacting in the environment and with the other agents**. This implies defining a multi-agents system. But first, let's understand the different types of multi-agent environments. ## Different types of multi-agent environments @@ -42,16 +42,16 @@ For instance, in a warehouse, **robots must collaborate to load and unload the p For example, in a game of tennis, **each agent wants to beat the other agent**. -”Tennis”/ + - *Mixed of both adversarial and cooperative*: like in our SoccerTwos environment, two agents are part of a team (blue or purple): they need to cooperate with each other and beat the opponent team.
-”SoccerTwos”/ +SoccerTwos
-
This environment was made by the Unity MLAgents Team
+
This environment was made by the Unity MLAgents Team
-So now we can ask how we design these multi-agent systems. Said differently, how can we train agents in a multi-agent setting? +So now we can ask how we design these multi-agent systems. Said differently, **how can we train agents in a multi-agents setting** ? diff --git a/units/en/unit7/introduction.mdx b/units/en/unit7/introduction.mdx index 677409c..86c18bd 100644 --- a/units/en/unit7/introduction.mdx +++ b/units/en/unit7/introduction.mdx @@ -1,25 +1,36 @@ # Introduction [[introduction]] -”Thumbnail”/ +Thumbnail -Since the beginning of this course, we learned to train agents in a single-agent system. Where our agent was alone in its environment: it was not cooperating or collaborating with other agents. +Since the beginning of this course, we learned to train agents in a *single-agent system* where our agent was alone in its environment: it was **not cooperating or collaborating with other agents**. -Our different agents worked great, and the single-agent system is useful for many applications. +This worked great, and the single-agent system is useful for many applications. -But, as humans, we live in a multi-agent world. Our intelligence comes from interaction with other agents. And so, our goal is to create agents that can interact with other humans and other agents. - -Consequently, we must study how to train deep reinforcement learning agents in a multi-agent system to build robust agents that can adapt, collaborate, or compete. - -So today, we’re going to learn the basics of this fascinating topic of multi-agents reinforcement learning (MARL). - -And the most exciting part is that during this unit, you’re going to train your first agents in a multi-agents system: a 2vs2 soccer team that needs to beat the opponent team. - -And you’re going to participate in AI vs. AI challenges where your trained agent will compete against other classmates’ agents every day and be ranked on a new leaderboard.
-”SoccerTwos”/ -
This environment was made by the Unity MLAgents Team
+Patchwork + +
+ +A patchwork of all the environments you’ve trained your agents on since the beginning of the course + +
+ +But, as humans, **we live in a multi-agent world**. Our intelligence comes from interaction with other agents. And so, our **goal is to create agents that can interact with other humans and other agents**. + +Consequently, we must study how to train deep reinforcement learning agents in a *multi-agents system* to build robust agents that can adapt, collaborate, or compete. + +So today, we’re going to **learn the basics of this fascinating topic of multi-agents reinforcement learning (MARL)**. + +And the most exciting part is that during this unit, you’re going to train your first agents in a multi-agents system: **a 2vs2 soccer team that needs to beat the opponent team**. + +And you’re going to participate in **AI vs. AI challenge** where your trained agent will compete against other classmates’ agents every day and be ranked on a [new leaderboard](). + +
+”SoccerTwos”/ + +
This environment was made by the Unity MLAgents Team
From 917b2adabe35a10c1c6bf48a6bb4d3e8b7a5bdeb Mon Sep 17 00:00:00 2001 From: simoninithomas Date: Tue, 31 Jan 2023 16:02:26 +0100 Subject: [PATCH 13/29] Update bug --- units/en/unit7/introduction-to-marl.mdx | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/units/en/unit7/introduction-to-marl.mdx b/units/en/unit7/introduction-to-marl.mdx index 423c3a0..de74679 100644 --- a/units/en/unit7/introduction-to-marl.mdx +++ b/units/en/unit7/introduction-to-marl.mdx @@ -18,6 +18,7 @@ For instance, you can think of a warehouse where **multiple robots need to navig
Warehouse
Image by upklyak on Freepik
+
Or a road with **several autonomous vehicles**. @@ -48,10 +49,7 @@ For example, in a game of tennis, **each agent wants to beat the other agent**.
SoccerTwos - +
This environment was made by the Unity MLAgents Team
-
This environment was made by the Unity MLAgents Team
- - So now we can ask how we design these multi-agent systems. Said differently, **how can we train agents in a multi-agents setting** ? From 64766e07dd1e32a896a0026e0d4d7c0b4e98619d Mon Sep 17 00:00:00 2001 From: simoninithomas Date: Tue, 31 Jan 2023 16:07:12 +0100 Subject: [PATCH 14/29] Bug update 2 --- units/en/unit7/introduction-to-marl.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/units/en/unit7/introduction-to-marl.mdx b/units/en/unit7/introduction-to-marl.mdx index de74679..3aa7f3c 100644 --- a/units/en/unit7/introduction-to-marl.mdx +++ b/units/en/unit7/introduction-to-marl.mdx @@ -43,7 +43,7 @@ For instance, in a warehouse, **robots must collaborate to load and unload the p For example, in a game of tennis, **each agent wants to beat the other agent**. - +Tennis - *Mixed of both adversarial and cooperative*: like in our SoccerTwos environment, two agents are part of a team (blue or purple): they need to cooperate with each other and beat the opponent team. From 00974cc6b383384243d4dec284f3942cbb59ba09 Mon Sep 17 00:00:00 2001 From: simoninithomas Date: Tue, 31 Jan 2023 16:11:32 +0100 Subject: [PATCH 15/29] Update --- units/en/unit7/introduction.mdx | 2 +- units/en/unit7/multi-agent-setting.mdx | 31 +++++++++++++------------- 2 files changed, 17 insertions(+), 16 deletions(-) diff --git a/units/en/unit7/introduction.mdx b/units/en/unit7/introduction.mdx index 86c18bd..ead04df 100644 --- a/units/en/unit7/introduction.mdx +++ b/units/en/unit7/introduction.mdx @@ -28,7 +28,7 @@ And the most exciting part is that during this unit, you’re going to train you And you’re going to participate in **AI vs. AI challenge** where your trained agent will compete against other classmates’ agents every day and be ranked on a [new leaderboard]().
-”SoccerTwos”/ +SoccerTwos
This environment was made by the Unity MLAgents Team
diff --git a/units/en/unit7/multi-agent-setting.mdx b/units/en/unit7/multi-agent-setting.mdx index 4bcfa2b..83f670d 100644 --- a/units/en/unit7/multi-agent-setting.mdx +++ b/units/en/unit7/multi-agent-setting.mdx @@ -5,42 +5,43 @@ For this section, you're going to watch this excellent introduction to multi-age -In this video, Brian talked about how to design multi-agents systems? Especially he took a vacuum cleaner example and asked how they can cooperate each other? +In this video, Brian talked about how to design multi-agents systems. Especially he took a vacuum cleaner multi-agents setting example and asked how they **can cooperate each other**? -To design this Multi-Agent Reinforcement Learning system (MARL), we have two solutions. +To design this multi-agents reinforcement learning system (MARL), we have two solutions. ## Decentralized system [ADD illustration decentralized approach] -In decentralized learning, each agent is trained independently from others. In the example given each vacuum learns to clean as much place it can without caring about what other vacuums (agents) are doing. +In decentralized learning, **each agent is trained independently from others**. In the example given each vacuum learns to clean as much place it can **without caring about what other vacuums (agents) are doing**. -The benefit is since no information is shared between agents, these vacuum can be designed and trained like we train single agents. +The benefit is **since no information is shared between agents, these vacuum can be designed and trained like we train single agents**. -The idea, here is that our training agent will consider other agents as part of the environment dynamics. Not as agents. +The idea, here is that **our training agent will consider other agents as part of the environment dynamics**. Not as agents. -However the big drawback of this technique, is that it will make the environment non-stationary since the underlying markov decision process changes over time since other agents are also interacting in the environment. -And this is problematic for many Reinforcement Learning algorithms that can't reach a global optimum. +However the big drawback of this technique, is that it will **make the environment non-stationary** since the underlying markov decision process changes over time since other agents are also interacting in the environment. +And this is problematic for many reinforcement Learning algorithms **that can't reach a global optimum with a non-stationary environment**. ## Centralized approach [ADD illustration centralized approach] -In this architecture, we have a high level process that collect agents experiences: experience buffer. And we'll use these experience to learn a common policy. +In this architecture, **we have a high level process that collect agents experiences**: experience buffer. And we'll use these experience **to learn a common policy**. For instance, in the vacuum cleaner, the observation will be: - The coverage map of the vacuums. - The position of all the vacuums. -We use that collective experience to train a policy that will move all three robots in a most beneficial way as a whole. So each robots is learning from the common experience. +We use that collective experience **to train a policy that will move all three robots in a most beneficial way as a whole**. So each robots is learning from the common experience. And we have a stationary environment since all the agents are treated as a larger entity and they know the change of other agents policy (since it’s the same than their). If we recap: -- In decentralized approach. We **treat all agents independently without considering the existence of the other agents.** - 1. In this case all agents considers others agents as part of the environment. - 2. **It’s a non-stationarity environment condition** ⇒ so non-guaranty of convergence. + +- In *decentralized approach*, we **treat all agents independently without considering the existence of the other agents.** + - In this case all agents **considers others agents as part of the environment**. + - **It’s a non-stationarity environment condition**, so non-guaranty of convergence. - In centralized approach: - 1. A single policy is learned from all the agents. - 2. Takes as input the present state of an environment and the policy output a jointed actions. - 3. The reward is global. + - A **single policy is learned from all the agents**. + - Takes as input the present state of an environment and the policy output a jointed actions. + - The reward is global. From d5bedcee2f735fe1eac2e6819f4664687af3aa70 Mon Sep 17 00:00:00 2001 From: simoninithomas Date: Tue, 31 Jan 2023 16:17:50 +0100 Subject: [PATCH 16/29] Add illustration links --- units/en/unit7/introduction.mdx | 2 +- units/en/unit7/multi-agent-setting.mdx | 14 ++++++++++++-- units/en/unit7/self-play.mdx | 10 +++++----- 3 files changed, 18 insertions(+), 8 deletions(-) diff --git a/units/en/unit7/introduction.mdx b/units/en/unit7/introduction.mdx index ead04df..92be024 100644 --- a/units/en/unit7/introduction.mdx +++ b/units/en/unit7/introduction.mdx @@ -14,7 +14,7 @@ This worked great, and the single-agent system is useful for many applications.
A patchwork of all the environments you’ve trained your agents on since the beginning of the course - +
But, as humans, **we live in a multi-agent world**. Our intelligence comes from interaction with other agents. And so, our **goal is to create agents that can interact with other humans and other agents**. diff --git a/units/en/unit7/multi-agent-setting.mdx b/units/en/unit7/multi-agent-setting.mdx index 83f670d..6185df4 100644 --- a/units/en/unit7/multi-agent-setting.mdx +++ b/units/en/unit7/multi-agent-setting.mdx @@ -11,7 +11,12 @@ To design this multi-agents reinforcement learning system (MARL), we have two so ## Decentralized system -[ADD illustration decentralized approach] +
+Decentralized +
+Source: Introduction to Multi-Agent Reinforcement Learning +
+
In decentralized learning, **each agent is trained independently from others**. In the example given each vacuum learns to clean as much place it can **without caring about what other vacuums (agents) are doing**. @@ -24,7 +29,12 @@ And this is problematic for many reinforcement Learning algorithms **that can't ## Centralized approach -[ADD illustration centralized approach] +
+Centralized +
+Source: Introduction to Multi-Agent Reinforcement Learning +
+
In this architecture, **we have a high level process that collect agents experiences**: experience buffer. And we'll use these experience **to learn a common policy**. diff --git a/units/en/unit7/self-play.mdx b/units/en/unit7/self-play.mdx index f553432..4d5ac48 100644 --- a/units/en/unit7/self-play.mdx +++ b/units/en/unit7/self-play.mdx @@ -1,11 +1,11 @@ # Self-Play: a classic technique to train competitive agents in adversarial games -Now that we studied the basics of multi-agents. We're ready to go deeper. As mentioned in the introduction, we're going to train agents in an adversarial games a Soccer 2vs2 game. +Now that we studied the basics of multi-agents. We're ready to go deeper. As mentioned in the introduction, we're going **to train agents in an adversarial games with SoccerTwos a 2vs2 game.
-”SoccerTwos”/ +SoccerTwos -
This environment was made by the Unity MLAgents Team
+
This environment was made by the Unity MLAgents Team
@@ -80,7 +80,7 @@ After every game: So if A and B have rating Ra, and Rb, then the **expected scores are** given by: -”ELO +ELO Score Then, at the end of the game, we need to update the player’s actual Elo score, we use a linear adjustment **proportional to the amount by which the player over-performed or under-performed.** @@ -91,7 +91,7 @@ We also define a maximum adjustment rating per game: K-factor. If Player A has Ea points but scored Sa points, then the player’s rating is updated using the formula: -”ELO +ELO Score ### Example From bd35700e905232a4d7337448ddbeece9661f8a0a Mon Sep 17 00:00:00 2001 From: simoninithomas Date: Tue, 31 Jan 2023 16:18:09 +0100 Subject: [PATCH 17/29] Bug --- units/en/unit7/introduction.mdx | 1 - 1 file changed, 1 deletion(-) diff --git a/units/en/unit7/introduction.mdx b/units/en/unit7/introduction.mdx index 92be024..2f0cef2 100644 --- a/units/en/unit7/introduction.mdx +++ b/units/en/unit7/introduction.mdx @@ -12,7 +12,6 @@ This worked great, and the single-agent system is useful for many applications. Patchwork
- A patchwork of all the environments you’ve trained your agents on since the beginning of the course
From 02a2b0de3f743a2c802790041db0422ec3b96514 Mon Sep 17 00:00:00 2001 From: simoninithomas Date: Tue, 31 Jan 2023 16:43:38 +0100 Subject: [PATCH 18/29] Add hands on and conclusion --- units/en/unit7/conclusion.mdx | 11 ++++++ units/en/unit7/hands-on.mdx | 71 +++++++++++++++++------------------ units/en/unit7/self-play.mdx | 26 +++++++------ 3 files changed, 60 insertions(+), 48 deletions(-) create mode 100644 units/en/unit7/conclusion.mdx diff --git a/units/en/unit7/conclusion.mdx b/units/en/unit7/conclusion.mdx new file mode 100644 index 0000000..c8b22a8 --- /dev/null +++ b/units/en/unit7/conclusion.mdx @@ -0,0 +1,11 @@ +# Conclusion + +That’s all for today. Congrats on finishing unit and the tutorial! + +The best way to learn is to practice and try stuff. **Why not training another agent with a different configuration?** + +And don’t hesitate from time to time to check the [leaderboard](https://huggingface.co/spaces/huggingface-projects/AIvsAI-SoccerTwos) + +See you on Unit 8 🔥, + +## Keep Learning, Stay awesome 🤗 diff --git a/units/en/unit7/hands-on.mdx b/units/en/unit7/hands-on.mdx index 0796a3a..cd24d07 100644 --- a/units/en/unit7/hands-on.mdx +++ b/units/en/unit7/hands-on.mdx @@ -8,7 +8,7 @@ To validate this hands-on for the certification process, you just need to push a For more information about the certification process, check this section 👉 [https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process) -This hands-on will be different, since to get correct results you need to train your agents from 4 hours to 8 hours. And given the risk of timeout in colab we advise you to train on your own computer. You don’t need a super computer, a simple laptop is good enough for this exercise. +This hands-on will be different, since to get correct results **you need to train your agents from 4 hours to 8 hours**. And given the risk of timeout in colab we advise you to train on your own computer. You don’t need a super computer, a simple laptop is good enough for this exercise. Let's get started, @@ -59,48 +59,55 @@ We're constantly trying to improve our tutorials, so **if you find some issues ⚠ ⚠ ⚠ We’re not going to use the same version than for the Unit 5: Introduction to ML-Agents ⚠ ⚠ ⚠ -We advise you to use (conda)[[https://docs.conda.io/en/latest/](https://docs.conda.io/en/latest/)] as a package manager and create a new environment. +We advise you to use [conda](https://docs.conda.io/en/latest/](https://docs.conda.io/en/latest/) as a package manager and create a new environment. With conda, we create a new environment called rl: +```bash conda create --name rl python=3.8 - conda activate rl +``` To be able to train correctly our agents and push to the Hub, we need to install an experimental version of ML-Agents (the branch aivsai from Hugging Face ML-Agents fork) +```bash git clone --branch aivsai [https://github.com/huggingface/ml-agents/](https://github.com/huggingface/ml-agents/) +``` When the cloning is done (it takes 2.5Go), we go inside the repository and install the package +```bash cd ml-agents pip install -e ./ml-agents-envs pip install -e ./ml-agents +``` We also need to install pytorch with: +```bash pip install torch +``` Now that’s installed we need to add the environment training executable. Based on your operating system you need to download one of them, unzip it and place it in a new folder inside `ml-agents`that you call `training-envs-executables` At the end your executable should be in `mlagents/training-envs-executables/SoccerTwos` -Windows: [https://drive.google.com/file/d/1sqFxbEdTMubjVktnV4C6ICjp89wLhUcP/view?usp=sharing](https://drive.google.com/file/d/1sqFxbEdTMubjVktnV4C6ICjp89wLhUcP/view?usp=sharing) +Windows: Download [this executable](https://drive.google.com/file/d/1sqFxbEdTMubjVktnV4C6ICjp89wLhUcP/view?usp=sharing) -Linux (Ubuntu): [https://drive.google.com/file/d/1KuqBKYiXiIcU4kNMqEzhgypuFP5_45CL/view?usp=sharing](https://drive.google.com/file/d/1KuqBKYiXiIcU4kNMqEzhgypuFP5_45CL/view?usp=sharing) +Linux (Ubuntu): Download [this executable](https://drive.google.com/file/d/1KuqBKYiXiIcU4kNMqEzhgypuFP5_45CL/view?usp=sharing) Mac: [ADD Berangere] ## Step 1: Understand the environment -The environment is called `SoccerTwos` it was made by the Unity MLAgents Team. You can find its documentation here: [https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Learning-Environment-Examples.md#soccer-twos](https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Learning-Environment-Examples.md#soccer-twos) +The environment is called `SoccerTwos` it was made by the Unity MLAgents Team. You can find its documentation [here](https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Learning-Environment-Examples.md#soccer-twos) The goal in this environment **is to get the ball into the opponent's goal while preventing the ball from entering its own goal.**
-”SoccerTwos”/ +SoccerTwos -
This environment was made by the Unity MLAgents Team
+
This environment was made by the Unity MLAgents Team
@@ -108,7 +115,7 @@ The goal in this environment **is to get the ball into the opponent's goal while The reward function is: -”SoccerTwos +SoccerTwos Reward ### The observation space @@ -128,15 +135,15 @@ The observation space is composed vector size of 336: The action space is three discrete branches: -”SoccerTwos +SoccerTwos Action ## Step 2: Understand MA-POCA [https://arxiv.org/pdf/2111.05992.pdf](https://arxiv.org/pdf/2111.05992.pdf) -## Step 3: Define the Config +## Step 3: Define the config file -We already learned in (Unit 5)[https://huggingface.co/deep-rl-course/unit5/introduction] that in ML-Agents, you define **the training hyperparameters into config.yaml files.** +We already learned in (Unit 5)[https://huggingface.co/deep-rl-course/unit5/introduction] that in ML-Agents, you define **the training hyperparameters into `config.yaml` files.** There are multiple hyperparameters. To know them better, you should check for each explanation with **[the documentation](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/Training-Configuration-File.md)** @@ -198,7 +205,9 @@ For 5M timesteps (which is the recommended value) it will take from 5 to 8 hours Depending on the executable you use (windows, ubuntu, mac) the training command will look like this (your executable path can be different so don’t hesitate to check before running). -`mlagents-learn ./config/poca/SoccerTwos.yaml --env=./training-envs-executables/SoccerTwos.exe --run-id="SoccerTwos" --no-graphics` +```bash +mlagents-learn ./config/poca/SoccerTwos.yaml --env=./training-envs-executables/SoccerTwos.exe --run-id="SoccerTwos" --no-graphics +``` The executable contains 8 copies of SoccerTwos. @@ -220,9 +229,8 @@ Create a new token (**[https://huggingface.co/settings/tokens](https://huggingfa Copy the token, run this, and paste the token -``` +```bash huggingface-cli login - ``` Then, we need to run `mlagents-push-to-hf`. @@ -237,9 +245,11 @@ If the repo does not exist **it will be created automatically** In my case -`mlagents-push-to-hf --run-id="SoccerTwos" --local-dir="./results/SoccerTwos" --repo-id="ThomasSimonini/poca-SoccerTwos" --commit-message="First Push"` - +```bash +mlagents-push-to-hf --run-id="SoccerTwos" --local-dir="./results/SoccerTwos" --repo-id="ThomasSimonini/poca-SoccerTwos" --commit-message="First Push"` ``` + +```bash mlagents-push-to-hf --run-id= # Add your run id --local-dir= # Your local dir --repo-id= # Your repo id --commit-message="First Push" ``` @@ -259,38 +269,27 @@ But in order that everything works perfectly you need to check: 1. That you have this tag in your model: ML-Agents-SoccerTwos. This is the tag we use to select models to be added to the challenge pool. To do that go to your model and check the tags -verify1.png +Verify + If it’s not the case you just need to modify readme and add it -verify2.png +Verify -1. That you have a SoccerTwos onnx file +2. That you have a `SoccerTwos.onnx` file -verify3.png +Verify We strongly advise you to create a new model when you push to hub if you want to train it again, or train a new version. ## Step 7: Visualize some match in our demo -Now that your model is part of AI vs AI Challenge, you can visualize how good it is compared to others. +Now that your model is part of AI vs AI Challenge, **you can visualize how good it is compared to others**. In order to do that, you just need to go on this demo: -Select your model as team blue (or team purple if you prefer) and another. The best to compare your model is either with the one who’s on top of the leaderboard. Or use the baseline model as opponent [https://huggingface.co/unity/MLAgents-SoccerTwos](https://huggingface.co/unity/MLAgents-SoccerTwos) +- Select your model as team blue (or team purple if you prefer) and another. The best to compare your model is either with the one who’s on top of the leaderboard. Or use the [baseline model as opponent](https://huggingface.co/unity/MLAgents-SoccerTwos) -This matches you see live are not used to the calculation of your result but are good way to visualize how good your agent is. +This matches you see live are not used to the calculation of your result **but are good way to visualize how good your agent is**. And don't hesitate to share the best score your agent gets on discord in #rl-i-made-this channel 🔥 - -### Conclusion - -That’s all for today. Congrats on finishing this tutorial! - -The best way to learn is to practice and try stuff. Why not training another agent with a different configuration? - -And don’t hesitate from time to time to check the [leaderboard](https://huggingface.co/spaces/huggingface-projects/AIvsAI-SoccerTwos) - -See you on Unit 8 🔥, - -## Keep Learning, Stay awesome 🤗 diff --git a/units/en/unit7/self-play.mdx b/units/en/unit7/self-play.mdx index 4d5ac48..cbabb9a 100644 --- a/units/en/unit7/self-play.mdx +++ b/units/en/unit7/self-play.mdx @@ -1,6 +1,6 @@ # Self-Play: a classic technique to train competitive agents in adversarial games -Now that we studied the basics of multi-agents. We're ready to go deeper. As mentioned in the introduction, we're going **to train agents in an adversarial games with SoccerTwos a 2vs2 game. +Now that we studied the basics of multi-agents. We're ready to go deeper. As mentioned in the introduction, we're going **to train agents in an adversarial games with SoccerTwos a 2vs2 game**.
SoccerTwos @@ -17,9 +17,9 @@ On the one hand, we need to find how to get a well-trained opponent to play agai Think of a child that just started to learn soccer, playing against a very good soccer player will be useless since it will be too hard to win or at least get the ball from time to time. So the child will continuously lose without having time to learn a good policy. -The best solution would be to have an opponent that is on the same level as the agent and will upgrade its level as the agent upgrade its own. Because if the opponent is too strong we’ll learn nothing and if it is too weak, we’re going to overlearn useless behavior against a stronger opponent then. +The best solution would be **to have an opponent that is on the same level as the agent and will upgrade its level as the agent upgrade its own**. Because if the opponent is too strong we’ll learn nothing and if it is too weak, we’re going to overlearn useless behavior against a stronger opponent then. -This solution is called *self-play*. In self-play, the agent uses former copies of itself (of its policy) as an opponent. This way, the agent will play against an agent of the same level (challenging but not too much), have opportunities to improve gradually its policy, and then, as it becomes better update its opponent. It’s a way to bootstrap an opponent and have a gradual increase of opponent complexity. +This solution is called *self-play*. In self-play, **the agent uses former copies of itself (of its policy) as an opponent**. This way, the agent will play against an agent of the same level (challenging but not too much), have opportunities to improve gradually its policy, and then, as it becomes better update its opponent. It’s a way to bootstrap an opponent and have a gradual increase of opponent complexity. It’s the same way human learn in competition: @@ -28,32 +28,34 @@ It’s the same way human learn in competition: We do the same with self-play: -- We start with a copy of our agent as an opponent this way this opponent is on a similar level. -- We learn from it, and when we acquired some skills, we update our opponent with a more recent copy of our training policy. +- We **start with a copy of our agent as an opponent** this way this opponent is on a similar level. +- We **learn from it**, and when we acquired some skills, we **update our opponent with a more recent copy of our training policy**. -The theory behind self-play is not something new, it was already used by Arthur Samuel’s checker player system in the fifties, and by Gerald Tesauro’s TD-Gammon in 1955. If you want to learn more about the history of self-play check this very good blogpost by Andrew Cohen: [https://blog.unity.com/technology/training-intelligent-adversaries-using-self-play-with-ml-agents](https://blog.unity.com/technology/training-intelligent-adversaries-using-self-play-with-ml-agents) +The theory behind self-play is not something new, it was already used by Arthur Samuel’s checker player system in the fifties, and by Gerald Tesauro’s TD-Gammon in 1955. If you want to learn more about the history of self-play [check this very good blogpost by Andrew Cohen](https://blog.unity.com/technology/training-intelligent-adversaries-using-self-play-with-ml-agents) ## Self-Play in MLAgents -Self-Play is integrated into the MLAgents library and is managed by multiple hyperparameters that we’re going to study. But the main focus as explained in the documentation is the tradeoff between the skill level and generality of the final policy and the stability of learning. +Self-Play is integrated into the MLAgents library and is managed by multiple hyperparameters that we’re going to study. But the main focus as explained in the documentation is the **tradeoff between the skill level and generality of the final policy and the stability of learning**. Training against a set of slowly changing or unchanging adversaries with low diversity **results in more stable training. But a risk to overfit if the change is too slow.** We need then to control: -- How often do we change opponents with swap_steps and team_change parameters. -- The number of opponents saved with window parameter. A larger value of `window` +- How **often do we change opponents** with `swap_steps` and `team_change` parameters. +- The **number of opponents saved** with `window` parameter. A larger value of `window`  means that an agent's pool of opponents will contain a larger diversity of behaviors since it will contain policies from earlier in the training run. -- Probability of playing against the current self vs opponent sampled in the pool with play_against_latest_model_ratio. A larger value of `play_against_latest_model_ratio` +- **Probability of playing against the current self vs opponent** sampled in the pool with `play_against_latest_model_ratio`. A larger value of `play_against_latest_model_ratio`  indicates that an agent will be playing against the current opponent more often. -- The number of training steps before saving a new opponent with save_steps parameters. A larger value of `save_steps` +- The **number of training steps before saving a new opponent** with `save_steps` parameters. A larger value of `save_steps`  will yield a set of opponents that cover a wider range of skill levels and possibly play styles since the policy receives more training. -To get more details about these hyperparameters you definitely need to check this part of the documentation: [https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Training-Configuration-File.md#self-play](https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Training-Configuration-File.md#self-play) +To get more details about these hyperparameters you definitely need [to check this part of the documentation](https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Training-Configuration-File.md#self-play) ## The ELO Score to evaluate our agent + ### What is ELO Score? + In adversarial games, tracking the **cumulative reward is not always a meaningful metric to track the learning progress:** because this metric is **dependent only on the skill of the opponent.** Instead, we’re using an ***ELO rating system*** (named after Arpad Elo) that calculates the **relative skill level** between 2 players from a given population in a zero-sum game. From b50ff0b72de48d99f5a768bae5ad9aceafa0450c Mon Sep 17 00:00:00 2001 From: simoninithomas Date: Tue, 31 Jan 2023 16:44:12 +0100 Subject: [PATCH 19/29] Add conclusion --- units/en/_toctree.yml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/units/en/_toctree.yml b/units/en/_toctree.yml index f24a28d..9994167 100644 --- a/units/en/_toctree.yml +++ b/units/en/_toctree.yml @@ -174,6 +174,8 @@ title: Self-Play - local: unit7/hands-on title: Let's train our soccer team to beat your classmates' teams (AI vs. AI) + - local: unit7/conclusion + title: Conclusion - local: unit7/additional-readings title: Additional Readings - title: What's next? New Units Publishing Schedule From 70cb5e6e2c41ee524f5d60caa2e3c6586acfceb2 Mon Sep 17 00:00:00 2001 From: simoninithomas Date: Tue, 31 Jan 2023 16:49:44 +0100 Subject: [PATCH 20/29] Small update --- units/en/unit7/hands-on.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/units/en/unit7/hands-on.mdx b/units/en/unit7/hands-on.mdx index cd24d07..96746fd 100644 --- a/units/en/unit7/hands-on.mdx +++ b/units/en/unit7/hands-on.mdx @@ -105,7 +105,7 @@ The environment is called `SoccerTwos` it was made by the Unity MLAgents Team. Y The goal in this environment **is to get the ball into the opponent's goal while preventing the ball from entering its own goal.**
-SoccerTwos +SoccerTwos
This environment was made by the Unity MLAgents Team
From b22cb68f8b4bf93d2d2fb1ea5d2183953d3600d4 Mon Sep 17 00:00:00 2001 From: simoninithomas Date: Tue, 31 Jan 2023 16:54:54 +0100 Subject: [PATCH 21/29] Update --- units/en/unit7/hands-on.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/units/en/unit7/hands-on.mdx b/units/en/unit7/hands-on.mdx index 96746fd..4d7a29a 100644 --- a/units/en/unit7/hands-on.mdx +++ b/units/en/unit7/hands-on.mdx @@ -107,7 +107,7 @@ The goal in this environment **is to get the ball into the opponent's goal while
SoccerTwos -
This environment was made by the Unity MLAgents Team
+
This environment was made by the Unity MLAgents Team
From 7980d983c2ebda4b7a899aaf0c09d4dba44a0669 Mon Sep 17 00:00:00 2001 From: simoninithomas Date: Tue, 31 Jan 2023 17:04:33 +0100 Subject: [PATCH 22/29] Update hands on --- units/en/unit7/hands-on.mdx | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/units/en/unit7/hands-on.mdx b/units/en/unit7/hands-on.mdx index 4d7a29a..47ed9e0 100644 --- a/units/en/unit7/hands-on.mdx +++ b/units/en/unit7/hands-on.mdx @@ -219,13 +219,13 @@ Now that we trained our agents, we’re **ready to push them to the Hub to be a To be able to share your model with the community, there are three more steps to follow: -1️⃣ (If it’s not already done) create an account to HF ➡ **[https://huggingface.co/join](https://huggingface.co/join)** +1️⃣ (If it’s not already done) create an account to HF ➡ https://huggingface.co/join](https://huggingface.co/join 2️⃣ Sign in and store your authentication token from the Hugging Face website. -Create a new token (**[https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)**) **with write role** +Create a new token (https://huggingface.co/settings/tokens)) **with write role** -Create HF Token +Create HF Token Copy the token, run this, and paste the token From 5098c07a27377f9a5ce6c69ba4674fbc48eabd19 Mon Sep 17 00:00:00 2001 From: simoninithomas Date: Tue, 31 Jan 2023 17:59:08 +0100 Subject: [PATCH 23/29] Add MA-POCA section --- units/en/unit7/hands-on.mdx | 26 +++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/units/en/unit7/hands-on.mdx b/units/en/unit7/hands-on.mdx index 47ed9e0..23cc6a4 100644 --- a/units/en/unit7/hands-on.mdx +++ b/units/en/unit7/hands-on.mdx @@ -139,7 +139,31 @@ The action space is three discrete branches: ## Step 2: Understand MA-POCA -[https://arxiv.org/pdf/2111.05992.pdf](https://arxiv.org/pdf/2111.05992.pdf) +We know how to train agents to play against others: **we can use self-play.** This is a perfect technique for a 1vs1. + +But in our case we’re 2vs2, and each team has 2 agents. How then we can **train cooperative behavior for groups of agents?** + +As explained in the [Unity Blog](https://blog.unity.com/technology/ml-agents-v20-release-now-supports-training-complex-cooperative-behaviors), agents typically receive a reward as a group (+1 - penalty) when the team scores a goal. This implies that **every agent on the team is rewarded even if each agent didn’t contribute the same to the win**, which makes it difficult to learn what to do independently. + +The solution was developed by the Unity MLAgents team in a new multi-agent trainer called *MA-POCA (Multi-Agent POsthumous Credit Assignment)*. + +The idea is simple but powerful: a centralized critic **processes the states of all agents in the team to estimate how well each agent is doing**. This of this critic like a coach. + +This allows each agent to **make decisions based only on what it perceives locally**, and **simultaneously evaluate how good its behavior is in the context of the whole group**. + + +
+MA POCA + +
This illustrates MA-POCA’s centralized learning and decentralized execution. Source: MLAgents Plays Dodgeball +
+ +
+ + +The solution then is to use Self-Play with an MA-POCA trainer (called poca). The poca trainer will help us to train cooperative behavior and self-play to get an opponent team. + +If you want to dive deeper into this MA-POCA algorithm, you need to read the paper they published [here](https://arxiv.org/pdf/2111.05992.pdf) and the sources we put on the additional readings section. ## Step 3: Define the config file From 51c12aac4d0b6b7e302e89a715984b13d339c0e0 Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Wed, 1 Feb 2023 09:30:13 +0100 Subject: [PATCH 24/29] Update hands-on.mdx --- units/en/unit7/hands-on.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/units/en/unit7/hands-on.mdx b/units/en/unit7/hands-on.mdx index 23cc6a4..83279d3 100644 --- a/units/en/unit7/hands-on.mdx +++ b/units/en/unit7/hands-on.mdx @@ -96,7 +96,7 @@ Windows: Download [this executable](https://drive.google.com/file/d/1sqFxbEdTMub Linux (Ubuntu): Download [this executable](https://drive.google.com/file/d/1KuqBKYiXiIcU4kNMqEzhgypuFP5_45CL/view?usp=sharing) -Mac: [ADD Berangere] +Mac: Download [this executable](https://drive.google.com/file/d/14D8w6XYLRlXCSurdZxe70hwYULcuWxWZ/view?usp=share_link) ## Step 1: Understand the environment From 08af6b246e50b39f43da7a47798b34acdb744126 Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Wed, 1 Feb 2023 10:45:48 +0100 Subject: [PATCH 25/29] Update introduction.mdx --- units/en/unit7/introduction.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/units/en/unit7/introduction.mdx b/units/en/unit7/introduction.mdx index 2f0cef2..bd7384f 100644 --- a/units/en/unit7/introduction.mdx +++ b/units/en/unit7/introduction.mdx @@ -24,7 +24,7 @@ So today, we’re going to **learn the basics of this fascinating topic of multi And the most exciting part is that during this unit, you’re going to train your first agents in a multi-agents system: **a 2vs2 soccer team that needs to beat the opponent team**. -And you’re going to participate in **AI vs. AI challenge** where your trained agent will compete against other classmates’ agents every day and be ranked on a [new leaderboard](). +And you’re going to participate in **AI vs. AI challenge** where your trained agent will compete against other classmates’ agents every day and be ranked on a [new leaderboard](https://huggingface.co/spaces/huggingface-projects/AIvsAI-SoccerTwos).
SoccerTwos From b441834b6ec8ed2ab735770c970c506821bd3d1e Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Wed, 1 Feb 2023 10:52:57 +0100 Subject: [PATCH 26/29] Update conclusion.mdx --- units/en/unit7/conclusion.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/units/en/unit7/conclusion.mdx b/units/en/unit7/conclusion.mdx index c8b22a8..743d9ac 100644 --- a/units/en/unit7/conclusion.mdx +++ b/units/en/unit7/conclusion.mdx @@ -1,6 +1,6 @@ # Conclusion -That’s all for today. Congrats on finishing unit and the tutorial! +That’s all for today. Congrats on finishing this unit and the tutorial! The best way to learn is to practice and try stuff. **Why not training another agent with a different configuration?** From 4492a087d45d5f3451ab6d24bacec1f5f8f97c18 Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Wed, 1 Feb 2023 12:45:43 +0100 Subject: [PATCH 27/29] Apply suggestions from code review Co-authored-by: Omar Sanseviero --- units/en/unit7/hands-on.mdx | 40 ++++++++++++------------- units/en/unit7/introduction-to-marl.mdx | 10 +++---- units/en/unit7/multi-agent-setting.mdx | 24 +++++++-------- units/en/unit7/self-play.mdx | 32 ++++++++++---------- 4 files changed, 52 insertions(+), 54 deletions(-) diff --git a/units/en/unit7/hands-on.mdx b/units/en/unit7/hands-on.mdx index 83279d3..2992472 100644 --- a/units/en/unit7/hands-on.mdx +++ b/units/en/unit7/hands-on.mdx @@ -1,22 +1,22 @@ # Hands-on -Now that you learned the bases of multi-agents. You're ready to train our first agents in a multi-agents system: **a 2vs2 soccer team that needs to beat the opponent team**. +Now that you learned the bases of multi-agents. You're ready to train our first agents in a multi-agent system: **a 2vs2 soccer team that needs to beat the opponent team**. And you’re going to participate in AI vs. AI challenges where your trained agent will compete against other classmates’ **agents every day and be ranked on a new leaderboard.** -To validate this hands-on for the certification process, you just need to push a trained model. There **are no minimal result to attain to validate it.** +To validate this hands-on for the certification process, you just need to push a trained model. There **are no minimal results to attain to validate it.** For more information about the certification process, check this section 👉 [https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process) -This hands-on will be different, since to get correct results **you need to train your agents from 4 hours to 8 hours**. And given the risk of timeout in colab we advise you to train on your own computer. You don’t need a super computer, a simple laptop is good enough for this exercise. +This hands-on will be different since to get correct results **you need to train your agents from 4 hours to 8 hours**. And given the risk of timeout in Colab, we advise you to train on your computer. You don’t need a supercomputer: a simple laptop is good enough for this exercise. -Let's get started, +Let's get started! 🔥 ## What is AI vs. AI? AI vs. AI is an open-source tool we developed at Hugging Face to compete agents on the Hub against one another in a multi-agent setting. These models are then ranked in a leaderboard. -The idea of this tool is to have a powerful evaluation tool: **by evaluating your agent with a lot of others you’ll get a good idea of the quality of your policy.** +The idea of this tool is to have a robust evaluation tool: **by evaluating your agent with a lot of others, you’ll get a good idea of the quality of your policy.** More precisely, AI vs. AI is three tools: @@ -26,25 +26,25 @@ More precisely, AI vs. AI is three tools: We're going to write a blog post to explain this AI vs. AI tool in detail, but to give you the big picture it works this way: -- Every 4h, our algorithm **fetch all the available models for a given environment (in our case ML-Agents-SoccerTwos).** +- Every four hours, our algorithm **fetches all the available models for a given environment (in our case ML-Agents-SoccerTwos).** - It creates a **queue of matches with the matchmaking algorithm.** - We simulate the match in a Unity headless process and **gather the match result** (1 if the first model won, 0.5 if it’s a draw, 0 if the second model won) in a Dataset. - Then, when all matches from the matches queue are done, **we update the ELO score for each model and update the leaderboard.** ### Competition Rules -This first AI vs. AI competition **is an experiment,** the goal is to improve the tool in the future with your feedback. So some **breakups can happen during the challenge**. But don't worry +This first AI vs. AI competition **is an experiment**: the goal is to improve the tool in the future with your feedback. So some **breakups can happen during the challenge**. But don't worry **all the results are saved in a dataset so we can always restart the calculation correctly without losing information**. In order that your model to get correctly evaluated against others you need to follow these rules: 1. **You can't change the observation space or action space of the agent.** By doing that your model will not work during evaluation. 2. You **can't use a custom trainer for now,** you need to use Unity MLAgents ones. -3. We provide executables to train your agents, you can also use the Unity Editor if you prefer **but in order to avoid bugs we advise you to use our executables**. +3. We provide executables to train your agents. You can also use the Unity Editor if you prefer **, but to avoid bugs, we advise you to use our executables**. What will make the difference during this challenge are **the hyperparameters you choose**. -The AI vs AI algorithm will run until April the 30th 2023. +The AI vs AI algorithm will run until April the 30th, 2023. We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues). @@ -55,7 +55,7 @@ We're constantly trying to improve our tutorials, so **if you find some issues ## Step 0: Install MLAgents and download the correct executable -⚠ We're going to use an experimental version of ML-Agents were you can push to hub and load from hub Unity ML-Agents Models **you need to install the same version.** +⚠ We're going to use an experimental version of ML-Agents which allows you to push and load your models to/from the Hub. **You need to install the same version.** ⚠ ⚠ ⚠ We’re not going to use the same version than for the Unit 5: Introduction to ML-Agents ⚠ ⚠ ⚠ @@ -71,7 +71,7 @@ conda activate rl To be able to train correctly our agents and push to the Hub, we need to install an experimental version of ML-Agents (the branch aivsai from Hugging Face ML-Agents fork) ```bash -git clone --branch aivsai [https://github.com/huggingface/ml-agents/](https://github.com/huggingface/ml-agents/) +git clone --branch aivsai https://github.com/huggingface/ml-agents ``` When the cloning is done (it takes 2.5Go), we go inside the repository and install the package @@ -88,7 +88,7 @@ We also need to install pytorch with: pip install torch ``` -Now that’s installed we need to add the environment training executable. Based on your operating system you need to download one of them, unzip it and place it in a new folder inside `ml-agents`that you call `training-envs-executables` +Now that it’s installed, we need to add the environment training executable. Based on your operating system you need to download one of them, unzip it and place it in a new folder inside `ml-agents` that you call `training-envs-executables` At the end your executable should be in `mlagents/training-envs-executables/SoccerTwos` @@ -145,9 +145,9 @@ But in our case we’re 2vs2, and each team has 2 agents. How then we can **trai As explained in the [Unity Blog](https://blog.unity.com/technology/ml-agents-v20-release-now-supports-training-complex-cooperative-behaviors), agents typically receive a reward as a group (+1 - penalty) when the team scores a goal. This implies that **every agent on the team is rewarded even if each agent didn’t contribute the same to the win**, which makes it difficult to learn what to do independently. -The solution was developed by the Unity MLAgents team in a new multi-agent trainer called *MA-POCA (Multi-Agent POsthumous Credit Assignment)*. +The Unity MLAgents team developed the solution in a new multi-agent trainer called *MA-POCA (Multi-Agent POsthumous Credit Assignment)*. -The idea is simple but powerful: a centralized critic **processes the states of all agents in the team to estimate how well each agent is doing**. This of this critic like a coach. +The idea is simple but powerful: a centralized critic **processes the states of all agents in the team to estimate how well each agent is doing**. Think of this critic as a coach. This allows each agent to **make decisions based only on what it perceives locally**, and **simultaneously evaluate how good its behavior is in the context of the whole group**. @@ -167,7 +167,7 @@ If you want to dive deeper into this MA-POCA algorithm, you need to read the pap ## Step 3: Define the config file -We already learned in (Unit 5)[https://huggingface.co/deep-rl-course/unit5/introduction] that in ML-Agents, you define **the training hyperparameters into `config.yaml` files.** +We already learned in [Unit 5](https://huggingface.co/deep-rl-course/unit5/introduction) that in ML-Agents, you define **the training hyperparameters into `config.yaml` files.** There are multiple hyperparameters. To know them better, you should check for each explanation with **[the documentation](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/Training-Configuration-File.md)** @@ -208,7 +208,7 @@ behaviors: initial_elo: 1200.0 ``` -Compared to Pyramids or SnowballTarget we have new hyperparameters with self-play part. How you modify them can be critical in getting good results. +Compared to Pyramids or SnowballTarget, we have new hyperparameters with a self-play part. How you modify them can be critical in getting good results. The advice I can give you here is to check the explanation and recommended value for each parameters (especially self-play ones) with **[the documentation](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/Training-Configuration-File.md).** @@ -225,7 +225,7 @@ We define four parameters: 3. `-run_id`: the name you want to give to your training run id. 4. `-no-graphics`: to not launch the visualization during the training. -For 5M timesteps (which is the recommended value) it will take from 5 to 8 hours of training. You can continue to use your computer in the meantime, but my advice is to deactivate the computer standby mode to avoid the training to be stopped. +Depending on your hardware, 5M timesteps (the recommended value) will take 5 to 8 hours of training. You can continue using your computer in the meantime, but I advise deactivating the computer standby mode to prevent the training from being stopped. Depending on the executable you use (windows, ubuntu, mac) the training command will look like this (your executable path can be different so don’t hesitate to check before running). @@ -279,9 +279,7 @@ mlagents-push-to-hf --run-id= # Add your run id --local-dir= # Your local dir If everything worked you should have this at the end of the process(but with a different url 😆) : -``` -Your model is pushed to the hub. You can view your model here: https://huggingface.co/ThomasSimonini/poca-SoccerTwos -``` +Your model is pushed to the Hub. You can view your model here: https://huggingface.co/ThomasSimonini/poca-SoccerTwos It's the link to your model. It contains a model card that explains how to use it, your Tensorboard, and your config file. **What's awesome is that it's a git repository, which means you can have different commits, update your repository with a new push, etc.** @@ -304,7 +302,7 @@ If it’s not the case you just need to modify readme and add it Verify -We strongly advise you to create a new model when you push to hub if you want to train it again, or train a new version. +We strongly suggest that you create a new model when you push to the Hub if you want to train it again or train a new version. ## Step 7: Visualize some match in our demo diff --git a/units/en/unit7/introduction-to-marl.mdx b/units/en/unit7/introduction-to-marl.mdx index 3aa7f3c..340c8ea 100644 --- a/units/en/unit7/introduction-to-marl.mdx +++ b/units/en/unit7/introduction-to-marl.mdx @@ -11,13 +11,13 @@ A patchwork of all the environments you've trained your agents on since the begi
-When we do Multi agents reinforcement learning (MARL), we are in a situation where we have multiple agents **that share and interact in a common environment**. +When we do multi-agents reinforcement learning (MARL), we are in a situation where we have multiple agents **that share and interact in a common environment**. For instance, you can think of a warehouse where **multiple robots need to navigate to load and unload packages**.
Warehouse -
Image by upklyak on Freepik
+
[Image by upklyak](https://www.freepik.com/free-vector/robots-warehouse-interior-automated-machines_32117680.htm#query=warehouse robot&position=17&from_view=keyword) on Freepik
Or a road with **several autonomous vehicles**. @@ -25,7 +25,7 @@ Or a road with **several autonomous vehicles**.
Self driving cars
-Image by jcomp on Freepik +[Image by jcomp](https://www.freepik.com/free-vector/autonomous-smart-car-automatic-wireless-sensor-driving-road-around-car-autonomous-smart-car-goes-scans-roads-observe-distance-automatic-braking-system_26413332.htm#query=self driving cars highway&position=34&from_view=search&track=ais) on Freepik
@@ -33,7 +33,7 @@ In these examples, we have **multiple agents interacting in the environment and ## Different types of multi-agent environments -Given that in a multi-agent system, agents interact with other agents we can have different types of environments: +Given that in a multi-agent system, agents interact with other agents, we can have different types of environments: - *Cooperative environments*: where your agents needs **to maximize the common benefits**. @@ -52,4 +52,4 @@ For example, in a game of tennis, **each agent wants to beat the other agent**.
This environment was made by the Unity MLAgents Team
-So now we can ask how we design these multi-agent systems. Said differently, **how can we train agents in a multi-agents setting** ? +So now we might wonder: how can we design these multi-agent systems? Said differently, **how can we train agents in a multi-agent setting** ? diff --git a/units/en/unit7/multi-agent-setting.mdx b/units/en/unit7/multi-agent-setting.mdx index 6185df4..c90fb7f 100644 --- a/units/en/unit7/multi-agent-setting.mdx +++ b/units/en/unit7/multi-agent-setting.mdx @@ -5,9 +5,9 @@ For this section, you're going to watch this excellent introduction to multi-age -In this video, Brian talked about how to design multi-agents systems. Especially he took a vacuum cleaner multi-agents setting example and asked how they **can cooperate each other**? +In this video, Brian talked about how to design multi-agent systems. He specifically took a vacuum cleaner multi-agents setting and asked how they **can cooperate with each other**? -To design this multi-agents reinforcement learning system (MARL), we have two solutions. +We have two solutions to design this multi-agent reinforcement learning system (MARL). ## Decentralized system @@ -18,14 +18,14 @@ Source: Introduction to M
-In decentralized learning, **each agent is trained independently from others**. In the example given each vacuum learns to clean as much place it can **without caring about what other vacuums (agents) are doing**. +In decentralized learning, **each agent is trained independently from others**. In the example given, each vacuum learns to clean as many places as it can **without caring about what other vacuums (agents) are doing**. -The benefit is **since no information is shared between agents, these vacuum can be designed and trained like we train single agents**. +The benefit is that **since no information is shared between agents, these vacuums can be designed and trained like we train single agents**. -The idea, here is that **our training agent will consider other agents as part of the environment dynamics**. Not as agents. +The idea here is that **our training agent will consider other agents as part of the environment dynamics**. Not as agents. -However the big drawback of this technique, is that it will **make the environment non-stationary** since the underlying markov decision process changes over time since other agents are also interacting in the environment. -And this is problematic for many reinforcement Learning algorithms **that can't reach a global optimum with a non-stationary environment**. +However, the big drawback of this technique is that it will **make the environment non-stationary** since the underlying Markov decision process changes over time as other agents are also interacting in the environment. +And this is problematic for many Reinforcement Learning algorithms **that can't reach a global optimum with a non-stationary environment**. ## Centralized approach @@ -36,22 +36,22 @@ Source: Introduction to M
-In this architecture, **we have a high level process that collect agents experiences**: experience buffer. And we'll use these experience **to learn a common policy**. +In this architecture, **we have a high-level process that collects agents' experiences**: experience buffer. And we'll use these experiences **to learn a common policy**. For instance, in the vacuum cleaner, the observation will be: - The coverage map of the vacuums. - The position of all the vacuums. -We use that collective experience **to train a policy that will move all three robots in a most beneficial way as a whole**. So each robots is learning from the common experience. -And we have a stationary environment since all the agents are treated as a larger entity and they know the change of other agents policy (since it’s the same than their). +We use that collective experience **to train a policy that will move all three robots in the most beneficial way as a whole**. So each robot is learning from the common experience. +And we have a stationary environment since all the agents are treated as a larger entity, and they know the change of other agents' policies (since it's the same as theirs). If we recap: - In *decentralized approach*, we **treat all agents independently without considering the existence of the other agents.** - - In this case all agents **considers others agents as part of the environment**. + - In this case, all agents **consider others agents as part of the environment**. - **It’s a non-stationarity environment condition**, so non-guaranty of convergence. - In centralized approach: - A **single policy is learned from all the agents**. - - Takes as input the present state of an environment and the policy output a jointed actions. + - Takes as input the present state of an environment and the policy output joint actions. - The reward is global. diff --git a/units/en/unit7/self-play.mdx b/units/en/unit7/self-play.mdx index cbabb9a..3fbd401 100644 --- a/units/en/unit7/self-play.mdx +++ b/units/en/unit7/self-play.mdx @@ -1,6 +1,6 @@ # Self-Play: a classic technique to train competitive agents in adversarial games -Now that we studied the basics of multi-agents. We're ready to go deeper. As mentioned in the introduction, we're going **to train agents in an adversarial games with SoccerTwos a 2vs2 game**. +Now that we studied the basics of multi-agents. We're ready to go deeper. As mentioned in the introduction, we're going **to train agents in an adversarial game with SoccerTwos, a 2vs2 game**.
SoccerTwos @@ -11,27 +11,27 @@ Now that we studied the basics of multi-agents. We're ready to go deeper. As men ## What is Self-Play? -Training correctly agents in an adversarial game can be **quite complex**. +Training agents correctly in an adversarial game can be **quite complex**. On the one hand, we need to find how to get a well-trained opponent to play against your training agent. And on the other hand, even if you have a very good trained opponent, it's not a good solution since how your agent is going to improve its policy when the opponent is too strong? -Think of a child that just started to learn soccer, playing against a very good soccer player will be useless since it will be too hard to win or at least get the ball from time to time. So the child will continuously lose without having time to learn a good policy. +Think of a child that just started to learn soccer. Playing against a very good soccer player will be useless since it will be too hard to win or at least get the ball from time to time. So the child will continuously lose without having time to learn a good policy. -The best solution would be **to have an opponent that is on the same level as the agent and will upgrade its level as the agent upgrade its own**. Because if the opponent is too strong we’ll learn nothing and if it is too weak, we’re going to overlearn useless behavior against a stronger opponent then. +The best solution would be **to have an opponent that is on the same level as the agent and will upgrade its level as the agent upgrades its own**. Because if the opponent is too strong, we’ll learn nothing; if it is too weak, we’ll overlearn useless behavior against a stronger opponent then. -This solution is called *self-play*. In self-play, **the agent uses former copies of itself (of its policy) as an opponent**. This way, the agent will play against an agent of the same level (challenging but not too much), have opportunities to improve gradually its policy, and then, as it becomes better update its opponent. It’s a way to bootstrap an opponent and have a gradual increase of opponent complexity. +This solution is called *self-play*. In self-play, **the agent uses former copies of itself (of its policy) as an opponent**. This way, the agent will play against an agent of the same level (challenging but not too much), have opportunities to gradually improve its policy, and then update its opponent as it becomes better. It’s a way to bootstrap an opponent and progressively increase the opponent's complexity. -It’s the same way human learn in competition: +It’s the same way humans learn in competition: - We start to train against an opponent of similar level - Then we learn from it, and when we acquired some skills, we can move further with stronger opponents. We do the same with self-play: -- We **start with a copy of our agent as an opponent** this way this opponent is on a similar level. -- We **learn from it**, and when we acquired some skills, we **update our opponent with a more recent copy of our training policy**. +- We **start with a copy of our agent as an opponent** this way, this opponent is on a similar level. +- We **learn from it**, and when we acquire some skills, we **update our opponent with a more recent copy of our training policy**. -The theory behind self-play is not something new, it was already used by Arthur Samuel’s checker player system in the fifties, and by Gerald Tesauro’s TD-Gammon in 1955. If you want to learn more about the history of self-play [check this very good blogpost by Andrew Cohen](https://blog.unity.com/technology/training-intelligent-adversaries-using-self-play-with-ml-agents) +The theory behind self-play is not something new. It was already used by Arthur Samuel’s checker player system in the fifties and by Gerald Tesauro’s TD-Gammon in 1955. If you want to learn more about the history of self-play [check this very good blogpost by Andrew Cohen](https://blog.unity.com/technology/training-intelligent-adversaries-using-self-play-with-ml-agents) ## Self-Play in MLAgents @@ -49,7 +49,7 @@ We need then to control: - The **number of training steps before saving a new opponent** with `save_steps` parameters. A larger value of `save_steps`  will yield a set of opponents that cover a wider range of skill levels and possibly play styles since the policy receives more training. -To get more details about these hyperparameters you definitely need [to check this part of the documentation](https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Training-Configuration-File.md#self-play) +To get more details about these hyperparameters, you definitely need [to check this part of the documentation](https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Training-Configuration-File.md#self-play) ## The ELO Score to evaluate our agent @@ -60,9 +60,9 @@ In adversarial games, tracking the **cumulative reward is not always a meaningfu Instead, we’re using an ***ELO rating system*** (named after Arpad Elo) that calculates the **relative skill level** between 2 players from a given population in a zero-sum game. -In a zero-sum game: one agent wins is the other agent loss. It’s a mathematical representation of a situation in which each participant’s gain or loss of utility **is exactly balanced by the gain or loss of the utility of the other participants.** We talk about zero-sum games because the sum of utility is equal to zero. +In a zero-sum game: one agent wins, and the other agent loses. It’s a mathematical representation of a situation in which each participant’s gain or loss of utility **is exactly balanced by the gain or loss of the utility of the other participants.** We talk about zero-sum games because the sum of utility is equal to zero. -During the training, this ELO (starting at a certain score: in general 1200), can decrease in the beginning but should then increase progressively. +This ELO (starting at a specific score: frequently 1200) can decrease initially but should increase progressively during the training. The Elo system is **inferred from the losses and draws against other players.** It means that player ratings depend **on the ratings of their opponents and the results scored against them.** @@ -70,7 +70,7 @@ Elo defines an Elo score that is the relative skills of a player in a zero-sum g The central idea is to think of the performance of a player **as a random variable that is normally distributed.** -The difference in rating between 2 players serves as **the predictor of the outcomes of a match.** If the player wins but the probability is high it will not win a lot of points from their opponent, since it means that it was much stronger than it. +The difference in rating between 2 players serves as **the predictor of the outcomes of a match.** If the player wins, but the probability of winning is high, it will only win a few points from its opponent since it means that it is much stronger than it. After every game: @@ -84,7 +84,7 @@ So if A and B have rating Ra, and Rb, then the **expected scores are** given by: ELO Score -Then, at the end of the game, we need to update the player’s actual Elo score, we use a linear adjustment **proportional to the amount by which the player over-performed or under-performed.** +Then, at the end of the game, we need to update the player’s actual Elo score. We use a linear adjustment **proportional to the amount by which the player over-performed or under-performed.** We also define a maximum adjustment rating per game: K-factor. @@ -126,8 +126,8 @@ Player B has a rating of 2300 Using ELO score has multiple advantages: -- Points are **always balanced** (more points are exchanged when there is an unexpected outcome but the sum is always the same). -- It is a **self-corrected system** since if a player wins against a weak player, you will not win a lot of points. +- Points are **always balanced** (more points are exchanged when there is an unexpected outcome, but the sum is always the same). +- It is a **self-corrected system** since if a player wins against a weak player, you will only win a few points. - If **works with team games**: we calculate the average for each team and use it in Elo. ### The Disadvantages From 5989ec70f0a98cf91d7da4187421e09c4102dc6a Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Wed, 1 Feb 2023 15:33:30 +0100 Subject: [PATCH 28/29] Update units/en/unit7/hands-on.mdx Co-authored-by: Omar Sanseviero --- units/en/unit7/hands-on.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/units/en/unit7/hands-on.mdx b/units/en/unit7/hands-on.mdx index 2992472..262f985 100644 --- a/units/en/unit7/hands-on.mdx +++ b/units/en/unit7/hands-on.mdx @@ -100,7 +100,7 @@ Mac: Download [this executable](https://drive.google.com/file/d/14D8w6XYLRlXCSur ## Step 1: Understand the environment -The environment is called `SoccerTwos` it was made by the Unity MLAgents Team. You can find its documentation [here](https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Learning-Environment-Examples.md#soccer-twos) +The environment is called `SoccerTwos`. The Unity MLAgents Team made it. You can find its documentation [here](https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Learning-Environment-Examples.md#soccer-twos) The goal in this environment **is to get the ball into the opponent's goal while preventing the ball from entering its own goal.** From e394055facba2150f488aab2ddc9ca07542f33dd Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Wed, 1 Feb 2023 15:36:14 +0100 Subject: [PATCH 29/29] Update hands-on.mdx --- units/en/unit7/hands-on.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/units/en/unit7/hands-on.mdx b/units/en/unit7/hands-on.mdx index 262f985..b46a614 100644 --- a/units/en/unit7/hands-on.mdx +++ b/units/en/unit7/hands-on.mdx @@ -306,7 +306,7 @@ We strongly suggest that you create a new model when you push to the Hub if you ## Step 7: Visualize some match in our demo -Now that your model is part of AI vs AI Challenge, **you can visualize how good it is compared to others**. +Now that your model is part of AI vs AI Challenge, **you can visualize how good it is compared to others**: https://huggingface.co/spaces/unity/ML-Agents-SoccerTwos In order to do that, you just need to go on this demo: