diff --git a/units/en/unit7/additional-readings.mdx b/units/en/unit7/additional-readings.mdx index 71aba31..6cb3239 100644 --- a/units/en/unit7/additional-readings.mdx +++ b/units/en/unit7/additional-readings.mdx @@ -1,3 +1,17 @@ # Additional Readings [[additional-readings]] -## Self-Play +## An introduction to multi-agents + +- [Multi-agent reinforcement learning: An overview](https://www.dcsc.tudelft.nl/~bdeschutter/pub/rep/10_003.pdf) +- [Multiagent Reinforcement Learning, Marc Lanctot](https://rlss.inria.fr/files/2019/07/RLSS_Multiagent.pdf) +- [Example of a multi-agent environment](https://www.mathworks.com/help/reinforcement-learning/ug/train-3-agents-for-area-coverage.html?s_eid=PSM_15028) +- [A list of different multi-agent environments](https://agents.inf.ed.ac.uk/blog/multiagent-learning-environments/) +- [Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents](https://bit.ly/3nVK7My) +- [Dealing with Non-Stationarity in Multi-Agent Deep Reinforcement Learning](https://bit.ly/3v7LxaT) + +## Self-Play and MA-POCA + +- [Self Play Theory and with MLAgents](https://blog.unity.com/technology/training-intelligent-adversaries-using-self-play-with-ml-agents) +- [Training complex behavior with MLAgents](https://blog.unity.com/technology/ml-agents-v20-release-now-supports-training-complex-cooperative-behaviors) +- [MLAgents plays dodgeball](https://blog.unity.com/technology/ml-agents-plays-dodgeball) +- [On the Use and Misuse of Absorbing States in Multi-agent Reinforcement Learning (MA-POCA)](https://arxiv.org/pdf/2111.05992.pdf) diff --git a/units/en/unit7/hands-on.mdx b/units/en/unit7/hands-on.mdx index 27b3255..31db82b 100644 --- a/units/en/unit7/hands-on.mdx +++ b/units/en/unit7/hands-on.mdx @@ -29,7 +29,7 @@ It's a matchmaking algorithm where your pushed models are ranked by playing aga AI vs. AI is three tools: - A *matchmaking process* defining which model against which model and running the model fights using a background task in the Space. -- A *leaderboard* getting the match history results and displaying the models ELO ratings: [ADD LEADERBOARD] +- A *leaderboard* getting the match history results and displaying the models ELO ratings: https://huggingface.co/spaces/huggingface-projects/AIvsAI-SoccerTwos - A *Space demo* to visualize your agents playing against others : https://huggingface.co/spaces/unity/ML-Agents-SoccerTwos @@ -54,10 +54,54 @@ What will make the difference during this challenge are **the hyperparameters yo # Step 0: Install MLAgents and download the correct executable +You need to install a specific version of MLAgents -# Step 1: Understand the environment - + + +## Step 1: Understand the environment + +The environment is called `` it was made by the Unity MLAgents Team. + +The goal in this + +The goal in this environment is to train our agent to **get the gold brick on the top of the Pyramid. To do that, it needs to press a button to spawn a Pyramid, navigate to the Pyramid, knock it over, and move to the gold brick at the top**. + +Pyramids Environment + + +## The reward function + +The reward function is: + +Pyramids Environment + +In terms of code, it looks like this +Pyramids Reward + +To train this new agent that seeks that button and then the Pyramid to destroy, we’ll use a combination of two types of rewards: + +- The *extrinsic one* given by the environment (illustration above). +- But also an *intrinsic* one called **curiosity**. This second will **push our agent to be curious, or in other terms, to better explore its environment**. + +If you want to know more about curiosity, the next section (optional) will explain the basics. + +## The observation space + +In terms of observation, we **use 148 raycasts that can each detect objects** (switch, bricks, golden brick, and walls.) + + + +We also use a **boolean variable indicating the switch state** (did we turn on or off the switch to spawn the Pyramid) and a vector that **contains the agent’s speed**. + +Pyramids obs code + + +## The action space + +The action space is **discrete** with four possible actions: + +Pyramids Environment @@ -67,6 +111,8 @@ What will make the difference during this challenge are **the hyperparameters yo +## MA POCA +https://arxiv.org/pdf/2111.05992.pdf diff --git a/units/en/unit7/self-play.mdx b/units/en/unit7/self-play.mdx index 307354e..f553432 100644 --- a/units/en/unit7/self-play.mdx +++ b/units/en/unit7/self-play.mdx @@ -104,16 +104,19 @@ Player B has a rating of 2300 - We first calculate the expected score: \\(E_{A} = \frac{1}{1+10^{(2300-2600)/400}} = 0.849 \\) + \\(E_{B} = \frac{1}{1+10^{(2600-2300)/400}} = 0.151 \\) - If the organizers determined that K=16 and A wins, the new rating would be: \\(ELO_A = 2600 + 16*(1-0.849) = 2602 \\) + \\(ELO_B = 2300 + 16*(1-0.151) = 2298 \\) - If the organizers determined that K=16 and B wins, the new rating would be: \\(ELO_A = 2600 + 16*(0-0.849) = 2586 \\) + \\(ELO_B = 2300 + 16 *(1-0.151) = 2314 \\)