Add additional readings

2026-02-03 02:14:53 +08:00 · 2023-01-31 09:35:20 +01:00
parent f22b8160f3
commit 85f5ba0f59
3 changed files with 67 additions and 4 deletions
--- a/units/en/unit7/additional-readings.mdx
+++ b/units/en/unit7/additional-readings.mdx
@@ -1,3 +1,17 @@
 # Additional Readings [[additional-readings]]

-## Self-Play
+##  An introduction to multi-agents
+
+- [Multi-agent reinforcement learning: An overview](https://www.dcsc.tudelft.nl/~bdeschutter/pub/rep/10_003.pdf)
+- [Multiagent Reinforcement Learning, Marc Lanctot](https://rlss.inria.fr/files/2019/07/RLSS_Multiagent.pdf)
+- [Example of a multi-agent environment](https://www.mathworks.com/help/reinforcement-learning/ug/train-3-agents-for-area-coverage.html?s_eid=PSM_15028)
+- [A list of different multi-agent environments](https://agents.inf.ed.ac.uk/blog/multiagent-learning-environments/)
+- [Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents](https://bit.ly/3nVK7My)
+- [Dealing with Non-Stationarity in Multi-Agent Deep Reinforcement Learning](https://bit.ly/3v7LxaT)
+
+## Self-Play and MA-POCA
+
+- [Self Play Theory and with MLAgents](https://blog.unity.com/technology/training-intelligent-adversaries-using-self-play-with-ml-agents)
+- [Training complex behavior with MLAgents](https://blog.unity.com/technology/ml-agents-v20-release-now-supports-training-complex-cooperative-behaviors)
+- [MLAgents plays dodgeball](https://blog.unity.com/technology/ml-agents-plays-dodgeball)
+- [On the Use and Misuse of Absorbing States in Multi-agent Reinforcement Learning (MA-POCA)](https://arxiv.org/pdf/2111.05992.pdf)
--- a/units/en/unit7/hands-on.mdx
+++ b/units/en/unit7/hands-on.mdx
@@ -29,7 +29,7 @@ It's a matchmaking algorithm  where your pushed models are ranked by playing aga
 AI vs. AI is three tools:

 - A *matchmaking process* defining which model against which model and running the model fights using a background task in the Space.
- A *leaderboard* getting the match history results and displaying the models ELO ratings: [ADD LEADERBOARD]
+- A *leaderboard* getting the match history results and displaying the models ELO ratings: https://huggingface.co/spaces/huggingface-projects/AIvsAI-SoccerTwos
 - A *Space demo* to visualize your agents playing against others : https://huggingface.co/spaces/unity/ML-Agents-SoccerTwos


@@ -54,10 +54,54 @@ What will make the difference during this challenge are **the hyperparameters yo

 # Step 0: Install MLAgents and download the correct executable

+You need to install a specific version of MLAgents


-# Step 1: Understand the environment
-
+
+
+## Step 1: Understand the environment
+
+The environment is called `` it was made by the Unity MLAgents Team.
+
+The goal in this
+
+The goal in this environment is to train our agent to **get the gold brick on the top of the Pyramid. To do that, it needs to press a button to spawn a Pyramid, navigate to the Pyramid, knock it over, and move to the gold brick at the top**.
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids.png" alt="Pyramids Environment"/>
+
+
+## The reward function
+
+The reward function is:
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-reward.png" alt="Pyramids Environment"/>
+
+In terms of code, it looks like this
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-reward-code.png" alt="Pyramids Reward"/>
+
+To train this new agent that seeks that button and then the Pyramid to destroy, we’ll use a combination of two types of rewards:
+
+- The *extrinsic one* given by the environment (illustration above).
+- But also an *intrinsic* one called **curiosity**. This second will **push our agent to be curious, or in other terms, to better explore its environment**.
+
+If you want to know more about curiosity, the next section (optional) will explain the basics.
+
+## The observation space
+
+In terms of observation, we **use 148 raycasts that can each detect objects** (switch, bricks, golden brick, and walls.)
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids_raycasts.png"/>
+
+We also use a **boolean variable indicating the switch state** (did we turn on or off the switch to spawn the Pyramid) and a vector that **contains the agent’s speed**.
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-obs-code.png" alt="Pyramids obs code"/>
+
+
+## The action space
+
+The action space is **discrete** with four possible actions:
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-action.png" alt="Pyramids Environment"/>



@@ -67,6 +111,8 @@ What will make the difference during this challenge are **the hyperparameters yo



+## MA POCA
+https://arxiv.org/pdf/2111.05992.pdf



--- a/units/en/unit7/self-play.mdx
+++ b/units/en/unit7/self-play.mdx
@@ -104,16 +104,19 @@ Player B has a rating of 2300
 - We first calculate the expected score:

 \\(E_{A} = \frac{1}{1+10^{(2300-2600)/400}} = 0.849 \\)
+
 \\(E_{B} = \frac{1}{1+10^{(2600-2300)/400}} = 0.151 \\)

 - If the organizers determined that K=16 and A wins, the new rating would be:

 \\(ELO_A = 2600 + 16*(1-0.849) = 2602 \\)
+
 \\(ELO_B = 2300 + 16*(1-0.151) = 2298 \\)

 - If the organizers determined that K=16 and B wins, the new rating would be:

 \\(ELO_A = 2600 + 16*(0-0.849) = 2586 \\)
+
 \\(ELO_B = 2300 + 16 *(1-0.151) = 2314 \\)