Add additional readings

This commit is contained in:
simoninithomas
2023-01-31 09:35:20 +01:00
parent f22b8160f3
commit 85f5ba0f59
3 changed files with 67 additions and 4 deletions

View File

@@ -1,3 +1,17 @@
# Additional Readings [[additional-readings]]
## Self-Play
## An introduction to multi-agents
- [Multi-agent reinforcement learning: An overview](https://www.dcsc.tudelft.nl/~bdeschutter/pub/rep/10_003.pdf)
- [Multiagent Reinforcement Learning, Marc Lanctot](https://rlss.inria.fr/files/2019/07/RLSS_Multiagent.pdf)
- [Example of a multi-agent environment](https://www.mathworks.com/help/reinforcement-learning/ug/train-3-agents-for-area-coverage.html?s_eid=PSM_15028)
- [A list of different multi-agent environments](https://agents.inf.ed.ac.uk/blog/multiagent-learning-environments/)
- [Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents](https://bit.ly/3nVK7My)
- [Dealing with Non-Stationarity in Multi-Agent Deep Reinforcement Learning](https://bit.ly/3v7LxaT)
## Self-Play and MA-POCA
- [Self Play Theory and with MLAgents](https://blog.unity.com/technology/training-intelligent-adversaries-using-self-play-with-ml-agents)
- [Training complex behavior with MLAgents](https://blog.unity.com/technology/ml-agents-v20-release-now-supports-training-complex-cooperative-behaviors)
- [MLAgents plays dodgeball](https://blog.unity.com/technology/ml-agents-plays-dodgeball)
- [On the Use and Misuse of Absorbing States in Multi-agent Reinforcement Learning (MA-POCA)](https://arxiv.org/pdf/2111.05992.pdf)

View File

@@ -29,7 +29,7 @@ It's a matchmaking algorithm where your pushed models are ranked by playing aga
AI vs. AI is three tools:
- A *matchmaking process* defining which model against which model and running the model fights using a background task in the Space.
- A *leaderboard* getting the match history results and displaying the models ELO ratings: [ADD LEADERBOARD]
- A *leaderboard* getting the match history results and displaying the models ELO ratings: https://huggingface.co/spaces/huggingface-projects/AIvsAI-SoccerTwos
- A *Space demo* to visualize your agents playing against others : https://huggingface.co/spaces/unity/ML-Agents-SoccerTwos
@@ -54,10 +54,54 @@ What will make the difference during this challenge are **the hyperparameters yo
# Step 0: Install MLAgents and download the correct executable
You need to install a specific version of MLAgents
# Step 1: Understand the environment
## Step 1: Understand the environment
The environment is called `` it was made by the Unity MLAgents Team.
The goal in this
The goal in this environment is to train our agent to **get the gold brick on the top of the Pyramid. To do that, it needs to press a button to spawn a Pyramid, navigate to the Pyramid, knock it over, and move to the gold brick at the top**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids.png" alt="Pyramids Environment"/>
## The reward function
The reward function is:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-reward.png" alt="Pyramids Environment"/>
In terms of code, it looks like this
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-reward-code.png" alt="Pyramids Reward"/>
To train this new agent that seeks that button and then the Pyramid to destroy, well use a combination of two types of rewards:
- The *extrinsic one* given by the environment (illustration above).
- But also an *intrinsic* one called **curiosity**. This second will **push our agent to be curious, or in other terms, to better explore its environment**.
If you want to know more about curiosity, the next section (optional) will explain the basics.
## The observation space
In terms of observation, we **use 148 raycasts that can each detect objects** (switch, bricks, golden brick, and walls.)
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids_raycasts.png"/>
We also use a **boolean variable indicating the switch state** (did we turn on or off the switch to spawn the Pyramid) and a vector that **contains the agents speed**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-obs-code.png" alt="Pyramids obs code"/>
## The action space
The action space is **discrete** with four possible actions:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-action.png" alt="Pyramids Environment"/>
@@ -67,6 +111,8 @@ What will make the difference during this challenge are **the hyperparameters yo
## MA POCA
https://arxiv.org/pdf/2111.05992.pdf

View File

@@ -104,16 +104,19 @@ Player B has a rating of 2300
- We first calculate the expected score:
\\(E_{A} = \frac{1}{1+10^{(2300-2600)/400}} = 0.849 \\)
\\(E_{B} = \frac{1}{1+10^{(2600-2300)/400}} = 0.151 \\)
- If the organizers determined that K=16 and A wins, the new rating would be:
\\(ELO_A = 2600 + 16*(1-0.849) = 2602 \\)
\\(ELO_B = 2300 + 16*(1-0.151) = 2298 \\)
- If the organizers determined that K=16 and B wins, the new rating would be:
\\(ELO_A = 2600 + 16*(0-0.849) = 2586 \\)
\\(ELO_B = 2300 + 16 *(1-0.151) = 2314 \\)