mirror of
https://github.com/huggingface/deep-rl-class.git
synced 2026-02-03 02:14:53 +08:00
Add additional readings
This commit is contained in:
@@ -1,3 +1,17 @@
|
||||
# Additional Readings [[additional-readings]]
|
||||
|
||||
## Self-Play
|
||||
## An introduction to multi-agents
|
||||
|
||||
- [Multi-agent reinforcement learning: An overview](https://www.dcsc.tudelft.nl/~bdeschutter/pub/rep/10_003.pdf)
|
||||
- [Multiagent Reinforcement Learning, Marc Lanctot](https://rlss.inria.fr/files/2019/07/RLSS_Multiagent.pdf)
|
||||
- [Example of a multi-agent environment](https://www.mathworks.com/help/reinforcement-learning/ug/train-3-agents-for-area-coverage.html?s_eid=PSM_15028)
|
||||
- [A list of different multi-agent environments](https://agents.inf.ed.ac.uk/blog/multiagent-learning-environments/)
|
||||
- [Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents](https://bit.ly/3nVK7My)
|
||||
- [Dealing with Non-Stationarity in Multi-Agent Deep Reinforcement Learning](https://bit.ly/3v7LxaT)
|
||||
|
||||
## Self-Play and MA-POCA
|
||||
|
||||
- [Self Play Theory and with MLAgents](https://blog.unity.com/technology/training-intelligent-adversaries-using-self-play-with-ml-agents)
|
||||
- [Training complex behavior with MLAgents](https://blog.unity.com/technology/ml-agents-v20-release-now-supports-training-complex-cooperative-behaviors)
|
||||
- [MLAgents plays dodgeball](https://blog.unity.com/technology/ml-agents-plays-dodgeball)
|
||||
- [On the Use and Misuse of Absorbing States in Multi-agent Reinforcement Learning (MA-POCA)](https://arxiv.org/pdf/2111.05992.pdf)
|
||||
|
||||
@@ -29,7 +29,7 @@ It's a matchmaking algorithm where your pushed models are ranked by playing aga
|
||||
AI vs. AI is three tools:
|
||||
|
||||
- A *matchmaking process* defining which model against which model and running the model fights using a background task in the Space.
|
||||
- A *leaderboard* getting the match history results and displaying the models ELO ratings: [ADD LEADERBOARD]
|
||||
- A *leaderboard* getting the match history results and displaying the models ELO ratings: https://huggingface.co/spaces/huggingface-projects/AIvsAI-SoccerTwos
|
||||
- A *Space demo* to visualize your agents playing against others : https://huggingface.co/spaces/unity/ML-Agents-SoccerTwos
|
||||
|
||||
|
||||
@@ -54,10 +54,54 @@ What will make the difference during this challenge are **the hyperparameters yo
|
||||
|
||||
# Step 0: Install MLAgents and download the correct executable
|
||||
|
||||
You need to install a specific version of MLAgents
|
||||
|
||||
|
||||
# Step 1: Understand the environment
|
||||
|
||||
|
||||
|
||||
## Step 1: Understand the environment
|
||||
|
||||
The environment is called `` it was made by the Unity MLAgents Team.
|
||||
|
||||
The goal in this
|
||||
|
||||
The goal in this environment is to train our agent to **get the gold brick on the top of the Pyramid. To do that, it needs to press a button to spawn a Pyramid, navigate to the Pyramid, knock it over, and move to the gold brick at the top**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids.png" alt="Pyramids Environment"/>
|
||||
|
||||
|
||||
## The reward function
|
||||
|
||||
The reward function is:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-reward.png" alt="Pyramids Environment"/>
|
||||
|
||||
In terms of code, it looks like this
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-reward-code.png" alt="Pyramids Reward"/>
|
||||
|
||||
To train this new agent that seeks that button and then the Pyramid to destroy, we’ll use a combination of two types of rewards:
|
||||
|
||||
- The *extrinsic one* given by the environment (illustration above).
|
||||
- But also an *intrinsic* one called **curiosity**. This second will **push our agent to be curious, or in other terms, to better explore its environment**.
|
||||
|
||||
If you want to know more about curiosity, the next section (optional) will explain the basics.
|
||||
|
||||
## The observation space
|
||||
|
||||
In terms of observation, we **use 148 raycasts that can each detect objects** (switch, bricks, golden brick, and walls.)
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids_raycasts.png"/>
|
||||
|
||||
We also use a **boolean variable indicating the switch state** (did we turn on or off the switch to spawn the Pyramid) and a vector that **contains the agent’s speed**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-obs-code.png" alt="Pyramids obs code"/>
|
||||
|
||||
|
||||
## The action space
|
||||
|
||||
The action space is **discrete** with four possible actions:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-action.png" alt="Pyramids Environment"/>
|
||||
|
||||
|
||||
|
||||
@@ -67,6 +111,8 @@ What will make the difference during this challenge are **the hyperparameters yo
|
||||
|
||||
|
||||
|
||||
## MA POCA
|
||||
https://arxiv.org/pdf/2111.05992.pdf
|
||||
|
||||
|
||||
|
||||
|
||||
@@ -104,16 +104,19 @@ Player B has a rating of 2300
|
||||
- We first calculate the expected score:
|
||||
|
||||
\\(E_{A} = \frac{1}{1+10^{(2300-2600)/400}} = 0.849 \\)
|
||||
|
||||
\\(E_{B} = \frac{1}{1+10^{(2600-2300)/400}} = 0.151 \\)
|
||||
|
||||
- If the organizers determined that K=16 and A wins, the new rating would be:
|
||||
|
||||
\\(ELO_A = 2600 + 16*(1-0.849) = 2602 \\)
|
||||
|
||||
\\(ELO_B = 2300 + 16*(1-0.151) = 2298 \\)
|
||||
|
||||
- If the organizers determined that K=16 and B wins, the new rating would be:
|
||||
|
||||
\\(ELO_A = 2600 + 16*(0-0.849) = 2586 \\)
|
||||
|
||||
\\(ELO_B = 2300 + 16 *(1-0.151) = 2314 \\)
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user