From 5098c07a27377f9a5ce6c69ba4674fbc48eabd19 Mon Sep 17 00:00:00 2001 From: simoninithomas Date: Tue, 31 Jan 2023 17:59:08 +0100 Subject: [PATCH] Add MA-POCA section --- units/en/unit7/hands-on.mdx | 26 +++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/units/en/unit7/hands-on.mdx b/units/en/unit7/hands-on.mdx index 47ed9e0..23cc6a4 100644 --- a/units/en/unit7/hands-on.mdx +++ b/units/en/unit7/hands-on.mdx @@ -139,7 +139,31 @@ The action space is three discrete branches: ## Step 2: Understand MA-POCA -[https://arxiv.org/pdf/2111.05992.pdf](https://arxiv.org/pdf/2111.05992.pdf) +We know how to train agents to play against others: **we can use self-play.** This is a perfect technique for a 1vs1. + +But in our case we’re 2vs2, and each team has 2 agents. How then we can **train cooperative behavior for groups of agents?** + +As explained in the [Unity Blog](https://blog.unity.com/technology/ml-agents-v20-release-now-supports-training-complex-cooperative-behaviors), agents typically receive a reward as a group (+1 - penalty) when the team scores a goal. This implies that **every agent on the team is rewarded even if each agent didn’t contribute the same to the win**, which makes it difficult to learn what to do independently. + +The solution was developed by the Unity MLAgents team in a new multi-agent trainer called *MA-POCA (Multi-Agent POsthumous Credit Assignment)*. + +The idea is simple but powerful: a centralized critic **processes the states of all agents in the team to estimate how well each agent is doing**. This of this critic like a coach. + +This allows each agent to **make decisions based only on what it perceives locally**, and **simultaneously evaluate how good its behavior is in the context of the whole group**. + + +
+MA POCA + +
This illustrates MA-POCA’s centralized learning and decentralized execution. Source: MLAgents Plays Dodgeball +
+ +
+ + +The solution then is to use Self-Play with an MA-POCA trainer (called poca). The poca trainer will help us to train cooperative behavior and self-play to get an opponent team. + +If you want to dive deeper into this MA-POCA algorithm, you need to read the paper they published [here](https://arxiv.org/pdf/2111.05992.pdf) and the sources we put on the additional readings section. ## Step 3: Define the config file