Merge branch 'main' into ThomasSimonini/CertificationAndNext

This commit is contained in:
Thomas Simonini
2023-02-04 23:36:40 +01:00
committed by GitHub
53 changed files with 8508 additions and 1594 deletions

View File

@@ -13,7 +13,7 @@ This repository contains the Deep Reinforcement Learning Course mdx files and no
<br>
<br>
# The documentation below is for v1.0 (depreciated)
# The documentation below is for v1.0 (deprecated)
We're launching a **new version (v2.0) of the course starting December the 5th,**
@@ -26,7 +26,7 @@ The syllabus 📚: https://simoninithomas.github.io/deep-rl-course
<br>
<br>
# The documentation below is for v1.0 (depreciated)
# The documentation below is for v1.0 (deprecated)
In this free course, you will:

View File

@@ -509,7 +509,7 @@
"\n",
"This step is the simplest:\n",
"\n",
"- Open the game Huggy in your browser: https://huggingface.co/spaces/ThomasSimonini/Huggy\n",
"- Open the game Huggy in your browser: https://singularite.itch.io/huggy\n",
"\n",
"- Click on Play with my Huggy model\n",
"\n",
@@ -569,4 +569,4 @@
},
"nbformat": 4,
"nbformat_minor": 0
}
}

View File

@@ -230,15 +230,6 @@
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"TODO CHANGE LINK OF THE REQUIREMENTS"
],
"metadata": {
"id": "32e3NPYgH5ET"
}
},
{
"cell_type": "code",
"execution_count": null,
@@ -1155,4 +1146,4 @@
},
"nbformat": 4,
"nbformat_minor": 0
}
}

View File

@@ -127,7 +127,7 @@
"source": [
"# Let's train a Deep Q-Learning agent playing Atari' Space Invaders 👾 and upload it to the Hub.\n",
"\n",
"To validate this hands-on for the certification process, you need to push your trained model to the Hub and **get a result of >= 500**.\n",
"To validate this hands-on for the certification process, you need to push your trained model to the Hub and **get a result of >= 200**.\n",
"\n",
"To find your result, go to the leaderboard and find your model, **the result = mean_reward - std of reward**\n",
"\n",
@@ -799,4 +799,4 @@
},
"nbformat": 4,
"nbformat_minor": 0
}
}

View File

@@ -0,0 +1,6 @@
gym
git+https://github.com/ntasfi/PyGame-Learning-Environment.git
git+https://github.com/qlan3/gym-games.git
huggingface_hub
imageio-ffmpeg
pyyaml==6.0

1614
notebooks/unit4/unit4.ipynb Normal file

File diff suppressed because it is too large Load Diff

844
notebooks/unit5/unit5.ipynb Normal file
View File

@@ -0,0 +1,844 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/github/huggingface/deep-rl-class/blob/ThomasSimonini%2FMLAgents/notebooks/unit5/unit5.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2D3NL_e4crQv"
},
"source": [
"# Unit 5: An Introduction to ML-Agents\n",
"\n"
]
},
{
"cell_type": "markdown",
"source": [
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/thumbnail.png\" alt=\"Thumbnail\"/>\n",
"\n",
"In this notebook, you'll learn about ML-Agents and train two agents.\n",
"\n",
"- The first one will learn to **shoot snowballs onto spawning targets**.\n",
"- The second need to press a button to spawn a pyramid, then navigate to the pyramid, knock it over, **and move to the gold brick at the top**. To do that, it will need to explore its environment, and we will use a technique called curiosity.\n",
"\n",
"After that, you'll be able **to watch your agents playing directly on your browser**.\n",
"\n",
"For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process"
],
"metadata": {
"id": "97ZiytXEgqIz"
}
},
{
"cell_type": "markdown",
"source": [
"⬇️ Here is an example of what **you will achieve at the end of this unit.** ⬇️\n"
],
"metadata": {
"id": "FMYrDriDujzX"
}
},
{
"cell_type": "markdown",
"source": [
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids.gif\" alt=\"Pyramids\"/>\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballtarget.gif\" alt=\"SnowballTarget\"/>"
],
"metadata": {
"id": "cBmFlh8suma-"
}
},
{
"cell_type": "markdown",
"source": [
"### 🎮 Environments: \n",
"\n",
"- [Pyramids](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Learning-Environment-Examples.md#pyramids)\n",
"- SnowballTarget\n",
"\n",
"### 📚 RL-Library: \n",
"\n",
"- [ML-Agents (HuggingFace Experimental Version)](https://github.com/huggingface/ml-agents)\n",
"\n",
"⚠ We're going to use an experimental version of ML-Agents were you can push to hub and load from hub Unity ML-Agents Models **you need to install the same version**"
],
"metadata": {
"id": "A-cYE0K5iL-w"
}
},
{
"cell_type": "markdown",
"source": [
"We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues)."
],
"metadata": {
"id": "qEhtaFh9i31S"
}
},
{
"cell_type": "markdown",
"source": [
"## Objectives of this notebook 🏆\n",
"\n",
"At the end of the notebook, you will:\n",
"\n",
"- Understand how works **ML-Agents**, the environment library.\n",
"- Be able to **train agents in Unity Environments**.\n"
],
"metadata": {
"id": "j7f63r3Yi5vE"
}
},
{
"cell_type": "markdown",
"source": [
"## This notebook is from the Deep Reinforcement Learning Course\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/deep-rl-course-illustration.jpg\" alt=\"Deep RL Course illustration\"/>"
],
"metadata": {
"id": "viNzVbVaYvY3"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "6p5HnEefISCB"
},
"source": [
"In this free course, you will:\n",
"\n",
"- 📖 Study Deep Reinforcement Learning in **theory and practice**.\n",
"- 🧑‍💻 Learn to **use famous Deep RL libraries** such as Stable Baselines3, RL Baselines3 Zoo, CleanRL and Sample Factory 2.0.\n",
"- 🤖 Train **agents in unique environments** \n",
"\n",
"And more check 📚 the syllabus 👉 https://huggingface.co/deep-rl-course/communication/publishing-schedule\n",
"\n",
"Dont forget to **<a href=\"http://eepurl.com/ic5ZUD\">sign up to the course</a>** (we are collecting your email to be able to **send you the links when each Unit is published and give you information about the challenges and updates).**\n",
"\n",
"\n",
"The best way to keep in touch is to join our discord server to exchange with the community and with us 👉🏻 https://discord.gg/ydHrjt3WP5"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Y-mo_6rXIjRi"
},
"source": [
"## Prerequisites 🏗️\n",
"Before diving into the notebook, you need to:\n",
"\n",
"🔲 📚 **Study [what is ML-Agents and how it works by reading Unit 5](https://huggingface.co/deep-rl-course/unit5/introduction)** 🤗 "
]
},
{
"cell_type": "markdown",
"source": [
"# Let's train our agents 🚀\n",
"\n",
"The ML-Agents integration on the Hub is **still experimental**, some features will be added in the future. \n",
"\n",
"But for now, **to validate this hands-on for the certification process, you just need to push your trained models to the Hub**. Theres no results to attain to validate this one. But if you want to get nice results you can try to attain:\n",
"\n",
"- For `Pyramids` : Mean Reward = 1.75\n",
"- For `SnowballTarget` : Mean Reward = 15 or 30 targets hit in an episode.\n"
],
"metadata": {
"id": "xYO1uD5Ujgdh"
}
},
{
"cell_type": "markdown",
"source": [
"## Set the GPU 💪\n",
"- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg\" alt=\"GPU Step 1\">"
],
"metadata": {
"id": "DssdIjk_8vZE"
}
},
{
"cell_type": "markdown",
"source": [
"- `Hardware Accelerator > GPU`\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg\" alt=\"GPU Step 2\">"
],
"metadata": {
"id": "sTfCXHy68xBv"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "an3ByrXYQ4iK"
},
"source": [
"## Clone the repository and install the dependencies 🔽\n",
"- We need to clone the repository, that **contains the experimental version of the library that allows you to push your trained agent to the Hub.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "6WNoL04M7rTa"
},
"outputs": [],
"source": [
"%%capture\n",
"# Clone the repository\n",
"!git clone --depth 1 https://github.com/huggingface/ml-agents/ "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "d8wmVcMk7xKo"
},
"outputs": [],
"source": [
"%%capture\n",
"# Go inside the repository and install the package\n",
"%cd ml-agents\n",
"!pip3 install -e ./ml-agents-envs\n",
"!pip3 install -e ./ml-agents"
]
},
{
"cell_type": "markdown",
"source": [
"## SnowballTarget ⛄\n",
"\n",
"If you need a refresher on how this environments work check this section 👉\n",
"https://huggingface.co/deep-rl-course/unit5/snowball-target"
],
"metadata": {
"id": "R5_7Ptd_kEcG"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "HRY5ufKUKfhI"
},
"source": [
"### Download and move the environment zip file in `./training-envs-executables/linux/`\n",
"- Our environment executable is in a zip file.\n",
"- We need to download it and place it to `./training-envs-executables/linux/`\n",
"- We use a linux executable because we use colab, and colab machines OS is Ubuntu (linux)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "C9Ls6_6eOKiA"
},
"outputs": [],
"source": [
"# Here, we create training-envs-executables and linux\n",
"!mkdir ./training-envs-executables\n",
"!mkdir ./training-envs-executables/linux"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jsoZGxr1MIXY"
},
"source": [
"Download the file SnowballTarget.zip from https://drive.google.com/file/d/1YHHLjyj6gaZ3Gemx1hQgqrPgSS2ZhmB5 using `wget`. \n",
"\n",
"Check out the full solution to download large files from GDrive [here](https://bcrf.biochem.wisc.edu/2021/02/05/download-google-drive-files-using-wget/)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "QU6gi8CmWhnA"
},
"outputs": [],
"source": [
"!wget --load-cookies /tmp/cookies.txt \"https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1YHHLjyj6gaZ3Gemx1hQgqrPgSS2ZhmB5' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\\1\\n/p')&id=1YHHLjyj6gaZ3Gemx1hQgqrPgSS2ZhmB5\" -O ./training-envs-executables/linux/SnowballTarget.zip && rm -rf /tmp/cookies.txt"
]
},
{
"cell_type": "markdown",
"source": [
"We unzip the executable.zip file"
],
"metadata": {
"id": "_LLVaEEK3ayi"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "8FPx0an9IAwO"
},
"outputs": [],
"source": [
"%%capture\n",
"!unzip -d ./training-envs-executables/linux/ ./training-envs-executables/linux/SnowballTarget.zip"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nyumV5XfPKzu"
},
"source": [
"Make sure your file is accessible "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "EdFsLJ11JvQf"
},
"outputs": [],
"source": [
"!chmod -R 755 ./training-envs-executables/linux/SnowballTarget"
]
},
{
"cell_type": "markdown",
"source": [
"### Define the SnowballTarget config file\n",
"- In ML-Agents, you define the **training hyperparameters into config.yaml files.**\n",
"\n",
"There are multiple hyperparameters. To know them better, you should check for each explanation with [the documentation](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/Training-Configuration-File.md)\n",
"\n",
"\n",
"So you need to create a `SnowballTarget.yaml` config file in ./content/ml-agents/config/ppo/\n",
"\n",
"We'll give you here a first version of this config (to copy and paste into your `SnowballTarget.yaml file`), **but you should modify it**.\n",
"\n",
"```\n",
"behaviors:\n",
" SnowballTarget:\n",
" trainer_type: ppo\n",
" summary_freq: 10000\n",
" keep_checkpoints: 10\n",
" checkpoint_interval: 50000\n",
" max_steps: 200000\n",
" time_horizon: 64\n",
" threaded: true\n",
" hyperparameters:\n",
" learning_rate: 0.0003\n",
" learning_rate_schedule: linear\n",
" batch_size: 128\n",
" buffer_size: 2048\n",
" beta: 0.005\n",
" epsilon: 0.2\n",
" lambd: 0.95\n",
" num_epoch: 3\n",
" network_settings:\n",
" normalize: false\n",
" hidden_units: 256\n",
" num_layers: 2\n",
" vis_encode_type: simple\n",
" reward_signals:\n",
" extrinsic:\n",
" gamma: 0.99\n",
" strength: 1.0\n",
"```"
],
"metadata": {
"id": "NAuEq32Mwvtz"
}
},
{
"cell_type": "markdown",
"source": [
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballfight_config1.png\" alt=\"Config SnowballTarget\"/>\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballfight_config2.png\" alt=\"Config SnowballTarget\"/>"
],
"metadata": {
"id": "4U3sRH4N4h_l"
}
},
{
"cell_type": "markdown",
"source": [
"As an experimentation, you should also try to modify some other hyperparameters. Unity provides very [good documentation explaining each of them here](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Training-Configuration-File.md).\n",
"\n",
"Now that you've created the config file and understand what most hyperparameters do, we're ready to train our agent 🔥."
],
"metadata": {
"id": "JJJdo_5AyoGo"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "f9fI555bO12v"
},
"source": [
"### Train the agent\n",
"\n",
"To train our agent, we just need to **launch mlagents-learn and select the executable containing the environment.**\n",
"\n",
"We define four parameters:\n",
"\n",
"1. `mlagents-learn <config>`: the path where the hyperparameter config file is.\n",
"2. `--env`: where the environment executable is.\n",
"3. `--run_id`: the name you want to give to your training run id.\n",
"4. `--no-graphics`: to not launch the visualization during the training.\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/mlagentslearn.png\" alt=\"MlAgents learn\"/>\n",
"\n",
"Train the model and use the `--resume` flag to continue training in case of interruption. \n",
"\n",
"> It will fail first time if and when you use `--resume`, try running the block again to bypass the error. \n",
"\n"
]
},
{
"cell_type": "markdown",
"source": [
"The training will take 10 to 35min depending on your config, go take a ☕you deserve it 🤗."
],
"metadata": {
"id": "lN32oWF8zPjs"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "bS-Yh1UdHfzy"
},
"outputs": [],
"source": [
"!mlagents-learn ./config/ppo/SnowballTarget.yaml --env=./training-envs-executables/linux/SnowballTarget/SnowballTarget --run-id=\"SnowballTarget1\" --no-graphics"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5Vue94AzPy1t"
},
"source": [
"### Push the agent to the 🤗 Hub\n",
"\n",
"- Now that we trained our agent, were **ready to push it to the Hub to be able to visualize it playing on your browser🔥.**"
]
},
{
"cell_type": "markdown",
"source": [
"To be able to share your model with the community there are three more steps to follow:\n",
"\n",
"1⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join\n",
"\n",
"2⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.\n",
"- Create a new token (https://huggingface.co/settings/tokens) **with write role**\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg\" alt=\"Create HF Token\">\n",
"\n",
"- Copy the token \n",
"- Run the cell below and paste the token"
],
"metadata": {
"id": "izT6FpgNzZ6R"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "rKt2vsYoK56o"
},
"outputs": [],
"source": [
"from huggingface_hub import notebook_login\n",
"notebook_login()"
]
},
{
"cell_type": "markdown",
"source": [
"If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`"
],
"metadata": {
"id": "aSU9qD9_6dem"
}
},
{
"cell_type": "markdown",
"source": [
"Then, we simply need to run `mlagents-push-to-hf`.\n",
"\n",
"And we define 4 parameters:\n",
"\n",
"1. `--run-id`: the name of the training run id.\n",
"2. `--local-dir`: where the agent was saved, its results/<run_id name>, so in my case results/First Training.\n",
"3. `--repo-id`: the name of the Hugging Face repo you want to create or update. Its always <your huggingface username>/<the repo name>\n",
"If the repo does not exist **it will be created automatically**\n",
"4. `--commit-message`: since HF repos are git repository you need to define a commit message.\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/mlagentspushtohub.png\" alt=\"Push to Hub\"/>\n",
"\n",
"For instance:\n",
"\n",
"`!mlagents-push-to-hf --run-id=\"SnowballTarget1\" --local-dir=\"./results/SnowballTarget1\" --repo-id=\"ThomasSimonini/ppo-SnowballTarget\" --commit-message=\"First Push\"`"
],
"metadata": {
"id": "KK4fPfnczunT"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "dGEFAIboLVc6"
},
"outputs": [],
"source": [
"!mlagents-push-to-hf --run-id= # Add your run id --local-dir= # Your local dir --repo-id= # Your repo id --commit-message= # Your commit message"
]
},
{
"cell_type": "markdown",
"source": [
"Else, if everything worked you should have this at the end of the process(but with a different url 😆) :\n",
"\n",
"\n",
"\n",
"```\n",
"Your model is pushed to the hub. You can view your model here: https://huggingface.co/ThomasSimonini/ppo-SnowballTarget\n",
"```\n",
"\n",
"Its the link to your model, it contains a model card that explains how to use it, your Tensorboard and your config file. **Whats awesome is that its a git repository, that means you can have different commits, update your repository with a new push etc.**"
],
"metadata": {
"id": "yborB0850FTM"
}
},
{
"cell_type": "markdown",
"source": [
"But now comes the best: **being able to visualize your agent online 👀.**"
],
"metadata": {
"id": "5Uaon2cg0NrL"
}
},
{
"cell_type": "markdown",
"source": [
"### Watch your agent playing 👀\n",
"\n",
"For this step its simple:\n",
"\n",
"1. Remember your repo-id\n",
"\n",
"2. Go here: https://singularite.itch.io/snowballtarget\n",
"\n",
"3. Launch the game and put it in full screen by clicking on the bottom right button\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballtarget_load.png\" alt=\"Snowballtarget load\"/>"
],
"metadata": {
"id": "VMc4oOsE0QiZ"
}
},
{
"cell_type": "markdown",
"source": [
"1. In step 1, choose your model repository which is the model id (in my case ThomasSimonini/ppo-SnowballTarget).\n",
"\n",
"2. In step 2, **choose what model you want to replay**:\n",
" - I have multiple one, since we saved a model every 500000 timesteps. \n",
" - But if I want the more recent I choose `SnowballTarget.onnx`\n",
"\n",
"👉 Whats nice **is to try with different models step to see the improvement of the agent.**\n",
"\n",
"And don't hesitate to share the best score your agent gets on discord in #rl-i-made-this channel 🔥\n",
"\n",
"Let's now try a harder environment called Pyramids..."
],
"metadata": {
"id": "Djs8c5rR0Z8a"
}
},
{
"cell_type": "markdown",
"source": [
"## Pyramids 🏆\n",
"\n",
"### Download and move the environment zip file in `./training-envs-executables/linux/`\n",
"- Our environment executable is in a zip file.\n",
"- We need to download it and place it to `./training-envs-executables/linux/`\n",
"- We use a linux executable because we use colab, and colab machines OS is Ubuntu (linux)"
],
"metadata": {
"id": "rVMwRi4y_tmx"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "NyqYYkLyAVMK"
},
"source": [
"Download the file Pyramids.zip from https://drive.google.com/uc?export=download&id=1UiFNdKlsH0NTu32xV-giYUEVKV4-vc7H using `wget`. Check out the full solution to download large files from GDrive [here](https://bcrf.biochem.wisc.edu/2021/02/05/download-google-drive-files-using-wget/)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "AxojCsSVAVMP"
},
"outputs": [],
"source": [
"!wget --load-cookies /tmp/cookies.txt \"https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1UiFNdKlsH0NTu32xV-giYUEVKV4-vc7H' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\\1\\n/p')&id=1UiFNdKlsH0NTu32xV-giYUEVKV4-vc7H\" -O ./training-envs-executables/linux/Pyramids.zip && rm -rf /tmp/cookies.txt"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "bfs6CTJ1AVMP"
},
"source": [
"**OR** Download directly to local machine and then drag and drop the file from local machine to `./training-envs-executables/linux`"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "H7JmgOwcSSmF"
},
"source": [
"Wait for the upload to finish and then run the command below. \n",
"\n",
"![image.png](data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAASYAAAAfCAYAAABKxmALAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAADsMAAA7DAcdvqGQAAAmZSURBVHhe7d0NTNTnHQfwL+rxcigHFHUH9LCCVuaKCWgB4029Gq7NcBuLPdPKEhh6iRhEmauLY2V0ptEUSZGwTJTA5jSBOnGDmpyxEMUqxEH0rOIQfDmtt/oCnsof9BT2PPAcx8nbeTQi9vcxF56X+///h8n98vs9/xfcYmJiukEIIS+RCeInIYS8NChjIoS4LCwsDMHBwfDx8cEc76u4dNcT5y63orm5WbzDNRSYCCHPhQejuLg4xMbGwtvbW4wCymNLRAuQFAtwpW0Sio+34+uzN8So8ygwEUKc4uXlheTkZCxbtkyMOOofmGy63P1R37EAm4u+QUdHhxgdGa0xEUJGxLOknJycQYPSvXv3cPnyZTx0n4Fut4litNeEx61YMNGAQxun9uzDWZQxEUKGxQNKVlaWQ9nGg9Hhw4dx8uRJ3Lx5U4wCQYouvD3bB+qfTEWsT70Ytfv1l287tf5EgYkQMiRevvFMSalUihHAYDCgpKQEjx49EiODWzwL+MN73Zji1iZGgO/8l+ODbQ0jlnXjtJQLheZ9HTQzRXcAHbKK8pEeK7rfi5GOScirh68p9Q9K+/fvx65du0YMStyxS0DcTjecuh8lRoDprRXY8pvFoje0IQJTNNLzilBU5PjK35qO+Dly8Z6xFAF1nBZaTbjovwhjcUxCxg4v4fqvKfFM6cCBA6LnvIziK7jdFSh6wBKfE4iaGyJ6gxs2Y5LOFiIlJYW90pC5rQT1CEfCulRoxzw2lSN7TQo272kU/RdhLI5JyNjhlwTY8DUlXr65avMBq2gBkyQTUjUeoje4IdaYeMakR1hLIdJ21okxJigJ2z+JgeVQBpoj86GVVSEzcx/M/ecqMvCVajv0Ac1okEcg0s8EQ0o2KiNX4XeJaqgUMqDLCkuLAQXbytHCt41NR/7qqWhmhwpboISchUvJXIeyf9yGWh+PUAV7T5cFLV8W4NNDfAteqmmBIynILuU7CEXC79chfhZ7Y5cE87lrkM0Lwc09acg7xablaui36BCt7I2okrkB5TsKUGUvfXutzEJRnEp07Ew9x3E8ZvT6/J7fsW5CWO9+2XFNJ4rx2d8aIIntCBnP9u7d27fgzUs4V7Kl/nKT3+hbEO+Y8hY0W7/taQ/GxTUmCZUXTYAyDGqRPckXhSDAakLjEfG1DAqHsqkMBTuKYUA8Nug1UJjKkJmWgrQdNbDMiEdyor12ZTtDiKIWxR9nILOYfbmV0Uj6rRqPq/OQsSkbZedZ+Hl3JdvTQBF6PeJnSKgpzkTGx8Wo9QhEgJjj4tevQrS8Gfsy05DGAmmzPBK61AQMSPxKs0WGyF+ZMFxlUd5iRFWFmH9WUAh86oqRuSkTJSctUP40GalLxRwh4xgv4/qfheNn30ar5pvbogV4Si3DXj7gfGDyi0D8h1Es+JjR3MBCU4URLVYVwpfzr7cc2jAlrNcbYbClC7dqUVBoQMMFEyzyepTu/BSfFVbBzOali/vQyNIs5Rtq8WbOjPqCSjSYLTCzzMPIg6m5Hn9hx7G0sayr7hokmRKhAxa0o7F0bgAsZ8tQcsIMC8uGKgvqe7M4Qe4hg9V8EbXs4JK5CmUHK1Fz8Q54IjYU5ft6aFiwMx7cjZqhUqBv+edrgLnNzIJiGc61yRE2n2VVhIxz/DYTG17G9b8kwFXHL9wXLVaqPZUQPWuy6A00bGCSz9PbF79z0pGgsqCh9HOU8aAhGdB43cqSJi0LSxqEv85KnvMGexnTyUoq0WTRAHemxEC/9a99+9Pyisnh6FZY+zaWYH3Kfjy1OlEWhcB3shV3bhhFn5HYvkSTq2lohHW2Dvm5W7GFZU8Rlirs+2eNQ/ByEKSD/h0VpLPl2H1imE/g8PmMMN6QIJMPF+4IGR/4vW82ra2tojU6d9snoHPSNNFjFZB/p2gNNGxgsi9+i1daJgqqLbZZGM6zcu71cGji5kCFfmXcs9gXfUMyL+XKkb2pd18GtumLYq7IQdpHeSg7ZcLjaVHQbdiO7aujxeyzQqFbrYFKMqJ8T40TgdFOLpOJFiFkMN39Qk433ERrIOdLuUFIh4wsHKmgficE6F/GPUulRIDMhNrPDTD1LDjLIXO8cn0UruHeQxkCgiNEn5HLYA8REdB+uAraYCMMXxQiJzMDef+REPBWDCsCBwpdmQiNSkLDF3lDl3CDUkLpz0pGyRa4CRm/7t+3l13+/v6iNTp+8i54Pfmf6AHNrZ6iNdCoAhNQCeN1ICBADnNzvzLuWZYONqdE1Go1VDNYIEveAnWQmBu1OlSfvwPFPB2SFimhUEZCtzGGHc0uZJ4GCSuTEO3HOn4RiORn0To7wP/r5ZEJSErUsDyJmalD4lJWwjWUooCfzRuJKgablkdAwf5FJOoRM01C4wmDmCRk/Lpxw/5EAF9fXwQG2q9DctWicPsyR/ckb5y+9ED0BhplYGKhiZdzvIyrGCa9uFCIwgoTZNFJyPpjFnSht3GOn/HylDsEEFcZCwtReVXOAt5W5H6SjIgH13BHzPG1n8JdZWhEFPQ5vWtlUR7NqNxVyMZY0IqMgXqxmuVVzIJwqFiqpYjst7bGXvnrhyj7bt2ENXoNcotykb44AOavirHbmYBGyEuO38/W3t4uesDChQtFy3VLIuznyjvks4e9Z27U98r1XM/jV4O07LLnWo8Z72y/dwr7vQl5FaWmpvZd+c3PzK1du9apW1EGMz/EDfm/vCt6LFfpikFKfpPoDeRyxiRXhiJ8/iq8+2MZWs5W/qCCEiE/BEeOHBGt3nIuKSlJ9J7fn39hX+jukilQMsICrsuBac6v1mHTWg0CrhpQeojCEiGvGl5qHT16VPQArVaLFStWiJ7z8lOC4etmX1w50zkfNWeGf6olPfaEEDKk0Tz2JCRgEnI/8EbghCtiBHjo9SZ+nvdwxMeeTAwODv6TaBNCiIMnT56gqampZ/Hb3d29Z8z21AGZTNaz9vTggf3s2mSPbqjn+kH/Xgg2xH4HRbf98oAnMn+sORiIW7duiZGhUcZECBkRD0YbN250yJxseHDiV4er7v8LP5K+hsxqL9tszK+twEd/Nzn911MoYyKEjIgHnurqaigUCsyc6fi0RE9PT/j5+UHx6L/wanW82bfL/TUcf/wzrN95yqlMyYYCEyHEKbysO336NOrr6+Hm5obp06f3lXech+UM3O+dwVOPqZB8ImFsfxMFx4A9/zb2bPs8qJQjhLiMl3i2P3ip9j2DpjZv1F7qxLmL9gVvV1BgIoS8dEZ9SwohhHzfKDARQl4ywP8B/eN9dc0U7ocAAAAASUVORK5CYII=)"
]
},
{
"cell_type": "markdown",
"source": [
"Unzip it"
],
"metadata": {
"id": "iWUUcs0_794U"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "i2E3K4V2AVMP"
},
"outputs": [],
"source": [
"%%capture\n",
"!unzip -d ./training-envs-executables/linux/ ./training-envs-executables/linux/Pyramids.zip"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KmKYBgHTAVMP"
},
"source": [
"Make sure your file is accessible "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Im-nwvLPAVMP"
},
"outputs": [],
"source": [
"!chmod -R 755 ./training-envs-executables/linux/Pyramids/Pyramids"
]
},
{
"cell_type": "markdown",
"source": [
"### Modify the PyramidsRND config file\n",
"- Contrary to the first environment which was a custom one, **Pyramids was made by the Unity team**.\n",
"- So the PyramidsRND config file already exists and is in ./content/ml-agents/config/ppo/PyramidsRND.yaml\n",
"- You might asked why \"RND\" in PyramidsRND. RND stands for *random network distillation* it's a way to generate curiosity rewards. If you want to know more on that we wrote an article explaning this technique: https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-random-network-distillation-488ffd8e5938\n",
"\n",
"For this training, well modify one thing:\n",
"- The total training steps hyperparameter is too high since we can hit the benchmark (mean reward = 1.75) in only 1M training steps.\n",
"👉 To do that, we go to config/ppo/PyramidsRND.yaml,**and modify these to max_steps to 1000000.**\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-config.png\" alt=\"Pyramids config\"/>"
],
"metadata": {
"id": "fqceIATXAgih"
}
},
{
"cell_type": "markdown",
"source": [
"As an experimentation, you should also try to modify some other hyperparameters, Unity provides a very [good documentation explaining each of them here](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Training-Configuration-File.md).\n",
"\n",
"Were now ready to train our agent 🔥."
],
"metadata": {
"id": "RI-5aPL7BWVk"
}
},
{
"cell_type": "markdown",
"source": [
"### Train the agent\n",
"\n",
"The training will take 30 to 45min depending on your machine, go take a ☕you deserve it 🤗."
],
"metadata": {
"id": "s5hr1rvIBdZH"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "fXi4-IaHBhqD"
},
"outputs": [],
"source": [
"!mlagents-learn ./config/ppo/PyramidsRND.yaml --env=./training-envs-executables/linux/Pyramids/Pyramids --run-id=\"Pyramids Training\" --no-graphics"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "txonKxuSByut"
},
"source": [
"### Push the agent to the 🤗 Hub\n",
"\n",
"- Now that we trained our agent, were **ready to push it to the Hub to be able to visualize it playing on your browser🔥.**"
]
},
{
"cell_type": "code",
"source": [],
"metadata": {
"id": "JZ53caJ99sX_"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"!mlagents-push-to-hf --run-id= # Add your run id --local-dir= # Your local dir --repo-id= # Your repo id --commit-message= # Your commit message"
],
"metadata": {
"id": "yiEQbv7rB4mU"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Watch your agent playing 👀\n",
"\n",
"The temporary link for Pyramids demo is: https://singularite.itch.io/pyramids"
],
"metadata": {
"id": "7aZfgxo-CDeQ"
}
},
{
"cell_type": "markdown",
"source": [
"### 🎁 Bonus: Why not train on another environment?\n",
"Now that you know how to train an agent using MLAgents, **why not try another environment?** \n",
"\n",
"MLAgents provides 18 different and were building some custom ones. The best way to learn is to try things of your own, have fun.\n",
"\n"
],
"metadata": {
"id": "hGG_oq2n0wjB"
}
},
{
"cell_type": "markdown",
"source": [
"![cover](https://miro.medium.com/max/1400/0*xERdThTRRM2k_U9f.png)"
],
"metadata": {
"id": "KSAkJxSr0z6-"
}
},
{
"cell_type": "markdown",
"source": [
"You have the full list of the one currently available on Hugging Face here 👉 https://github.com/huggingface/ml-agents#the-environments\n",
"\n",
"For the demos to visualize your agent, the temporary link is: https://singularite.itch.io (temporary because we'll also put the demos on Hugging Face Space)\n",
"\n",
"For now we have integrated: \n",
"- [Worm](https://singularite.itch.io/worm) demo where you teach a **worm to crawl**.\n",
"- [Walker](https://singularite.itch.io/walker) demo where you teach an agent **to walk towards a goal**.\n",
"\n",
"If you want new demos to be added, please open an issue: https://github.com/huggingface/deep-rl-class 🤗"
],
"metadata": {
"id": "YiyF4FX-04JB"
}
},
{
"cell_type": "markdown",
"source": [
"Thats all for today. Congrats on finishing this tutorial!\n",
"\n",
"The best way to learn is to practice and try stuff. Why not try another environment? ML-Agents has 18 different environments, but you can also create your own? Check the documentation and have fun!\n",
"\n",
"See you on Unit 6 🔥,\n",
"\n",
"## Keep Learning, Stay awesome 🤗"
],
"metadata": {
"id": "PI6dPWmh064H"
}
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"provenance": [],
"private_outputs": true,
"include_colab_link": true
},
"gpuClass": "standard",
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

View File

@@ -0,0 +1,4 @@
stable-baselines3[extra]
huggingface_sb3
panda_gym==2.0.0
pyglet==1.5.1

918
notebooks/unit6/unit6.ipynb Normal file
View File

@@ -0,0 +1,918 @@
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"private_outputs": true,
"authorship_tag": "ABX9TyPDFLK3trc6MCLJLqUUuAbl",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
},
"accelerator": "GPU",
"gpuClass": "standard"
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/unit6/unit6.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"source": [
"# Unit 6: Advantage Actor Critic (A2C) using Robotics Simulations with PyBullet and Panda-Gym 🤖\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/thumbnail.png\" alt=\"Thumbnail\"/>\n",
"\n",
"In this notebook, you'll learn to use A2C with PyBullet and Panda-Gym, two set of robotics environments. \n",
"\n",
"With [PyBullet](https://github.com/bulletphysics/bullet3), you're going to **train a robot to move**:\n",
"- `AntBulletEnv-v0` 🕸️ More precisely, a spider (they say Ant but come on... it's a spider 😆) 🕸️\n",
"\n",
"Then, with [Panda-Gym](https://github.com/qgallouedec/panda-gym), you're going **to train a robotic arm** (Franka Emika Panda robot) to perform a task:\n",
"- `Reach`: the robot must place its end-effector at a target position.\n",
"\n",
"After that, you'll be able **to train in other robotics environments**.\n"
],
"metadata": {
"id": "-PTReiOw-RAN"
}
},
{
"cell_type": "markdown",
"source": [
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/environments.gif\" alt=\"Robotics environments\"/>"
],
"metadata": {
"id": "2VGL_0ncoAJI"
}
},
{
"cell_type": "markdown",
"source": [
"### 🎮 Environments: \n",
"\n",
"- [PyBullet](https://github.com/bulletphysics/bullet3)\n",
"- [Panda-Gym](https://github.com/qgallouedec/panda-gym)\n",
"\n",
"###📚 RL-Library: \n",
"\n",
"- [Stable-Baselines3](https://stable-baselines3.readthedocs.io/)"
],
"metadata": {
"id": "QInFitfWno1Q"
}
},
{
"cell_type": "markdown",
"source": [
"We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues)."
],
"metadata": {
"id": "2CcdX4g3oFlp"
}
},
{
"cell_type": "markdown",
"source": [
"## Objectives of this notebook 🏆\n",
"\n",
"At the end of the notebook, you will:\n",
"\n",
"- Be able to use **PyBullet** and **Panda-Gym**, the environment libraries.\n",
"- Be able to **train robots using A2C**.\n",
"- Understand why **we need to normalize the input**.\n",
"- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.\n",
"\n",
"\n"
],
"metadata": {
"id": "MoubJX20oKaQ"
}
},
{
"cell_type": "markdown",
"source": [
"## This notebook is from the Deep Reinforcement Learning Course\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/deep-rl-course-illustration.jpg\" alt=\"Deep RL Course illustration\"/>\n",
"\n",
"In this free course, you will:\n",
"\n",
"- 📖 Study Deep Reinforcement Learning in **theory and practice**.\n",
"- 🧑‍💻 Learn to **use famous Deep RL libraries** such as Stable Baselines3, RL Baselines3 Zoo, CleanRL and Sample Factory 2.0.\n",
"- 🤖 Train **agents in unique environments** \n",
"\n",
"And more check 📚 the syllabus 👉 https://simoninithomas.github.io/deep-rl-course\n",
"\n",
"Dont forget to **<a href=\"http://eepurl.com/ic5ZUD\">sign up to the course</a>** (we are collecting your email to be able to **send you the links when each Unit is published and give you information about the challenges and updates).**\n",
"\n",
"\n",
"The best way to keep in touch is to join our discord server to exchange with the community and with us 👉🏻 https://discord.gg/ydHrjt3WP5"
],
"metadata": {
"id": "DoUNkTExoUED"
}
},
{
"cell_type": "markdown",
"source": [
"## Prerequisites 🏗️\n",
"Before diving into the notebook, you need to:\n",
"\n",
"🔲 📚 Study [Actor-Critic methods by reading Unit 6](https://huggingface.co/deep-rl-course/unit6/introduction) 🤗 "
],
"metadata": {
"id": "BTuQAUAPoa5E"
}
},
{
"cell_type": "markdown",
"source": [
"# Let's train our first robots 🤖"
],
"metadata": {
"id": "iajHvVDWoo01"
}
},
{
"cell_type": "markdown",
"source": [
"To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push your two trained models to the Hub and get the following results:\n",
"\n",
"- `AntBulletEnv-v0` get a result of >= 650.\n",
"- `PandaReachDense-v2` get a result of >= -3.5.\n",
"\n",
"To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**\n",
"\n",
"If you don't find your model, **go to the bottom of the page and click on the refresh button**\n",
"\n",
"For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process"
],
"metadata": {
"id": "zbOENTE2os_D"
}
},
{
"cell_type": "markdown",
"source": [
"## Set the GPU 💪\n",
"- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg\" alt=\"GPU Step 1\">"
],
"metadata": {
"id": "PU4FVzaoM6fC"
}
},
{
"cell_type": "markdown",
"source": [
"- `Hardware Accelerator > GPU`\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg\" alt=\"GPU Step 2\">"
],
"metadata": {
"id": "KV0NyFdQM9ZG"
}
},
{
"cell_type": "markdown",
"source": [
"## Create a virtual display 🔽\n",
"\n",
"During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames). \n",
"\n",
"Hence the following cell will install the librairies and create and run a virtual screen 🖥"
],
"metadata": {
"id": "bTpYcVZVMzUI"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "jV6wjQ7Be7p5"
},
"outputs": [],
"source": [
"%%capture\n",
"!apt install python-opengl\n",
"!apt install ffmpeg\n",
"!apt install xvfb\n",
"!pip3 install pyvirtualdisplay"
]
},
{
"cell_type": "code",
"source": [
"# Virtual display\n",
"from pyvirtualdisplay import Display\n",
"\n",
"virtual_display = Display(visible=0, size=(1400, 900))\n",
"virtual_display.start()"
],
"metadata": {
"id": "ww5PQH1gNLI4"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Install dependencies 🔽\n",
"The first step is to install the dependencies, well install multiple ones:\n",
"\n",
"- `pybullet`: Contains the walking robots environments.\n",
"- `panda-gym`: Contains the robotics arm environments.\n",
"- `stable-baselines3[extra]`: The SB3 deep reinforcement learning library.\n",
"- `huggingface_sb3`: Additional code for Stable-baselines3 to load and upload models from the Hugging Face 🤗 Hub.\n",
"- `huggingface_hub`: Library allowing anyone to work with the Hub repositories."
],
"metadata": {
"id": "e1obkbdJ_KnG"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "2yZRi_0bQGPM"
},
"outputs": [],
"source": [
"!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit6/requirements-unit6.txt"
]
},
{
"cell_type": "markdown",
"source": [
"## Import the packages 📦"
],
"metadata": {
"id": "QTep3PQQABLr"
}
},
{
"cell_type": "code",
"source": [
"import pybullet_envs\n",
"import panda_gym\n",
"import gym\n",
"\n",
"import os\n",
"\n",
"from huggingface_sb3 import load_from_hub, package_to_hub\n",
"\n",
"from stable_baselines3 import A2C\n",
"from stable_baselines3.common.evaluation import evaluate_policy\n",
"from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize\n",
"from stable_baselines3.common.env_util import make_vec_env\n",
"\n",
"from huggingface_hub import notebook_login"
],
"metadata": {
"id": "HpiB8VdnQ7Bk"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Environment 1: AntBulletEnv-v0 🕸\n",
"\n"
],
"metadata": {
"id": "lfBwIS_oAVXI"
}
},
{
"cell_type": "markdown",
"source": [
"### Create the AntBulletEnv-v0\n",
"#### The environment 🎮\n",
"In this environment, the agent needs to use correctly its different joints to walk correctly.\n",
"You can find a detailled explanation of this environment here: https://hackmd.io/@jeffreymo/SJJrSJh5_#PyBullet"
],
"metadata": {
"id": "frVXOrnlBerQ"
}
},
{
"cell_type": "code",
"source": [
"env_id = \"AntBulletEnv-v0\"\n",
"# Create the env\n",
"env = gym.make(env_id)\n",
"\n",
"# Get the state space and action space\n",
"s_size = env.observation_space.shape[0]\n",
"a_size = env.action_space"
],
"metadata": {
"id": "JpU-JCDQYYax"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"print(\"_____OBSERVATION SPACE_____ \\n\")\n",
"print(\"The State Space is: \", s_size)\n",
"print(\"Sample observation\", env.observation_space.sample()) # Get a random observation"
],
"metadata": {
"id": "2ZfvcCqEYgrg"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"The observation Space (from [Jeffrey Y Mo](https://hackmd.io/@jeffreymo/SJJrSJh5_#PyBullet)):\n",
"\n",
"The difference is that our observation space is 28 not 29.\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/obs_space.png\" alt=\"PyBullet Ant Obs space\"/>\n"
],
"metadata": {
"id": "QzMmsdMJS7jh"
}
},
{
"cell_type": "code",
"source": [
"print(\"\\n _____ACTION SPACE_____ \\n\")\n",
"print(\"The Action Space is: \", a_size)\n",
"print(\"Action Space Sample\", env.action_space.sample()) # Take a random action"
],
"metadata": {
"id": "Tc89eLTYYkK2"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"The action Space (from [Jeffrey Y Mo](https://hackmd.io/@jeffreymo/SJJrSJh5_#PyBullet)):\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/action_space.png\" alt=\"PyBullet Ant Obs space\"/>\n"
],
"metadata": {
"id": "3RfsHhzZS9Pw"
}
},
{
"cell_type": "markdown",
"source": [
"### Normalize observation and rewards"
],
"metadata": {
"id": "S5sXcg469ysB"
}
},
{
"cell_type": "markdown",
"source": [
"A good practice in reinforcement learning is to [normalize input features](https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html). \n",
"\n",
"For that purpose, there is a wrapper that will compute a running average and standard deviation of input features.\n",
"\n",
"We also normalize rewards with this same wrapper by adding `norm_reward = True`\n",
"\n",
"[You should check the documentation to fill this cell](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize)"
],
"metadata": {
"id": "1ZyX6qf3Zva9"
}
},
{
"cell_type": "code",
"source": [
"env = make_vec_env(env_id, n_envs=4)\n",
"\n",
"# Adding this wrapper to normalize the observation and the reward\n",
"env = # TODO: Add the wrapper"
],
"metadata": {
"id": "1RsDtHHAQ9Ie"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"#### Solution"
],
"metadata": {
"id": "tF42HvI7-gs5"
}
},
{
"cell_type": "code",
"source": [
"env = make_vec_env(env_id, n_envs=4)\n",
"\n",
"env = VecNormalize(env, norm_obs=True, norm_reward=False, clip_obs=10.)"
],
"metadata": {
"id": "2O67mqgC-hol"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Create the A2C Model 🤖\n",
"\n",
"In this case, because we have a vector of 28 values as input, we'll use an MLP (multi-layer perceptron) as policy.\n",
"\n",
"For more information about A2C implementation with StableBaselines3 check: https://stable-baselines3.readthedocs.io/en/master/modules/a2c.html#notes\n",
"\n",
"To find the best parameters I checked the [official trained agents by Stable-Baselines3 team](https://huggingface.co/sb3)."
],
"metadata": {
"id": "4JmEVU6z1ZA-"
}
},
{
"cell_type": "code",
"source": [
"model = # Create the A2C model and try to find the best parameters"
],
"metadata": {
"id": "vR3T4qFt164I"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"#### Solution"
],
"metadata": {
"id": "nWAuOOLh-oQf"
}
},
{
"cell_type": "code",
"source": [
"model = A2C(policy = \"MlpPolicy\",\n",
" env = env,\n",
" gae_lambda = 0.9,\n",
" gamma = 0.99,\n",
" learning_rate = 0.00096,\n",
" max_grad_norm = 0.5,\n",
" n_steps = 8,\n",
" vf_coef = 0.4,\n",
" ent_coef = 0.0,\n",
" policy_kwargs=dict(\n",
" log_std_init=-2, ortho_init=False),\n",
" normalize_advantage=False,\n",
" use_rms_prop= True,\n",
" use_sde= True,\n",
" verbose=1)"
],
"metadata": {
"id": "FKFLY54T-pU1"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Train the A2C agent 🏃\n",
"- Let's train our agent for 2,000,000 timesteps, don't forget to use GPU on Colab. It will take approximately ~25-40min"
],
"metadata": {
"id": "opyK3mpJ1-m9"
}
},
{
"cell_type": "code",
"source": [
"model.learn(2_000_000)"
],
"metadata": {
"id": "4TuGHZD7RF1G"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Save the model and VecNormalize statistics when saving the agent\n",
"model.save(\"a2c-AntBulletEnv-v0\")\n",
"env.save(\"vec_normalize.pkl\")"
],
"metadata": {
"id": "MfYtjj19cKFr"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Evaluate the agent 📈\n",
"- Now that's our agent is trained, we need to **check its performance**.\n",
"- Stable-Baselines3 provides a method to do that: `evaluate_policy`\n",
"- In my case, I got a mean reward of `2371.90 +/- 16.50`"
],
"metadata": {
"id": "01M9GCd32Ig-"
}
},
{
"cell_type": "code",
"source": [
"from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize\n",
"\n",
"# Load the saved statistics\n",
"eval_env = DummyVecEnv([lambda: gym.make(\"AntBulletEnv-v0\")])\n",
"eval_env = VecNormalize.load(\"vec_normalize.pkl\", eval_env)\n",
"\n",
"# do not update them at test time\n",
"eval_env.training = False\n",
"# reward normalization is not needed at test time\n",
"eval_env.norm_reward = False\n",
"\n",
"# Load the agent\n",
"model = A2C.load(\"a2c-AntBulletEnv-v0\")\n",
"\n",
"mean_reward, std_reward = evaluate_policy(model, eval_env)\n",
"\n",
"print(f\"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}\")"
],
"metadata": {
"id": "liirTVoDkHq3"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Publish your trained model on the Hub 🔥\n",
"Now that we saw we got good results after the training, we can publish our trained model on the Hub with one line of code.\n",
"\n",
"📚 The libraries documentation 👉 https://github.com/huggingface/huggingface_sb3/tree/main#hugging-face--x-stable-baselines3-v20\n",
"\n",
"Here's an example of a Model Card (with a PyBullet environment):\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/modelcardpybullet.png\" alt=\"Model Card Pybullet\"/>"
],
"metadata": {
"id": "44L9LVQaavR8"
}
},
{
"cell_type": "markdown",
"source": [
"By using `package_to_hub`, as we already mentionned in the former units, **you evaluate, record a replay, generate a model card of your agent and push it to the hub**.\n",
"\n",
"This way:\n",
"- You can **showcase our work** 🔥\n",
"- You can **visualize your agent playing** 👀\n",
"- You can **share with the community an agent that others can use** 💾\n",
"- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard\n"
],
"metadata": {
"id": "MkMk99m8bgaQ"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "JquRrWytA6eo"
},
"source": [
"To be able to share your model with the community there are three more steps to follow:\n",
"\n",
"1⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join\n",
"\n",
"2⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.\n",
"- Create a new token (https://huggingface.co/settings/tokens) **with write role**\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg\" alt=\"Create HF Token\">\n",
"\n",
"- Copy the token \n",
"- Run the cell below and paste the token"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "GZiFBBlzxzxY"
},
"outputs": [],
"source": [
"notebook_login()\n",
"!git config --global credential.helper store"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_tsf2uv0g_4p"
},
"source": [
"If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FGNh9VsZok0i"
},
"source": [
"3⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥 using `package_to_hub()` function"
]
},
{
"cell_type": "code",
"source": [
"package_to_hub(\n",
" model=model,\n",
" model_name=f\"a2c-{env_id}\",\n",
" model_architecture=\"A2C\",\n",
" env_id=env_id,\n",
" eval_env=eval_env,\n",
" repo_id=f\"ThomasSimonini/a2c-{env_id}\", # Change the username\n",
" commit_message=\"Initial commit\",\n",
")"
],
"metadata": {
"id": "ueuzWVCUTkfS"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Take a coffee break ☕\n",
"- You already trained your first robot that learned to move congratutlations 🥳!\n",
"- It's **time to take a break**. Don't hesitate to **save this notebook** `File > Save a copy to Drive` to work on this second part later.\n"
],
"metadata": {
"id": "Qk9ykOk9D6Qh"
}
},
{
"cell_type": "markdown",
"source": [
"## Environment 2: PandaReachDense-v2 🦾\n",
"\n",
"The agent we're going to train is a robotic arm that needs to do controls (moving the arm and using the end-effector).\n",
"\n",
"In robotics, the *end-effector* is the device at the end of a robotic arm designed to interact with the environment.\n",
"\n",
"In `PandaReach`, the robot must place its end-effector at a target position (green ball).\n",
"\n",
"We're going to use the dense version of this environment. It means we'll get a *dense reward function* that **will provide a reward at each timestep** (the closer the agent is to completing the task, the higher the reward). Contrary to a *sparse reward function* where the environment **return a reward if and only if the task is completed**.\n",
"\n",
"Also, we're going to use the *End-effector displacement control*, it means the **action corresponds to the displacement of the end-effector**. We don't control the individual motion of each joint (joint control).\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/robotics.jpg\" alt=\"Robotics\"/>\n",
"\n",
"\n",
"This way **the training will be easier**.\n",
"\n"
],
"metadata": {
"id": "5VWfwAA7EJg7"
}
},
{
"cell_type": "markdown",
"source": [
"\n",
"\n",
"In `PandaReachDense-v2` the robotic arm must place its end-effector at a target position (green ball).\n",
"\n"
],
"metadata": {
"id": "oZ7FyDEi7G3T"
}
},
{
"cell_type": "code",
"source": [
"import gym\n",
"\n",
"env_id = \"PandaReachDense-v2\"\n",
"\n",
"# Create the env\n",
"env = gym.make(env_id)\n",
"\n",
"# Get the state space and action space\n",
"s_size = env.observation_space.shape\n",
"a_size = env.action_space"
],
"metadata": {
"id": "zXzAu3HYF1WD"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"print(\"_____OBSERVATION SPACE_____ \\n\")\n",
"print(\"The State Space is: \", s_size)\n",
"print(\"Sample observation\", env.observation_space.sample()) # Get a random observation"
],
"metadata": {
"id": "E-U9dexcF-FB"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"The observation space **is a dictionary with 3 different elements**:\n",
"- `achieved_goal`: (x,y,z) position of the goal.\n",
"- `desired_goal`: (x,y,z) distance between the goal position and the current object position.\n",
"- `observation`: position (x,y,z) and velocity of the end-effector (vx, vy, vz).\n",
"\n",
"Given it's a dictionary as observation, **we will need to use a MultiInputPolicy policy instead of MlpPolicy**."
],
"metadata": {
"id": "g_JClfElGFnF"
}
},
{
"cell_type": "code",
"source": [
"print(\"\\n _____ACTION SPACE_____ \\n\")\n",
"print(\"The Action Space is: \", a_size)\n",
"print(\"Action Space Sample\", env.action_space.sample()) # Take a random action"
],
"metadata": {
"id": "ib1Kxy4AF-FC"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"The action space is a vector with 3 values:\n",
"- Control x, y, z movement"
],
"metadata": {
"id": "5MHTHEHZS4yp"
}
},
{
"cell_type": "markdown",
"source": [
"Now it's your turn:\n",
"\n",
"1. Define the environment called \"PandaReachDense-v2\"\n",
"2. Make a vectorized environment\n",
"3. Add a wrapper to normalize the observations and rewards. [Check the documentation](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize)\n",
"4. Create the A2C Model (don't forget verbose=1 to print the training logs).\n",
"5. Train it for 1M Timesteps\n",
"6. Save the model and VecNormalize statistics when saving the agent\n",
"7. Evaluate your agent\n",
"8. Publish your trained model on the Hub 🔥 with `package_to_hub`"
],
"metadata": {
"id": "nIhPoc5t9HjG"
}
},
{
"cell_type": "markdown",
"source": [
"### Solution (fill the todo)"
],
"metadata": {
"id": "sKGbFXZq9ikN"
}
},
{
"cell_type": "code",
"source": [
"# 1 - 2\n",
"env_id = \"PandaReachDense-v2\"\n",
"env = make_vec_env(env_id, n_envs=4)\n",
"\n",
"# 3\n",
"env = VecNormalize(env, norm_obs=True, norm_reward=False, clip_obs=10.)\n",
"\n",
"# 4\n",
"model = A2C(policy = \"MultiInputPolicy\",\n",
" env = env,\n",
" verbose=1)\n",
"# 5\n",
"model.learn(1_000_000)"
],
"metadata": {
"id": "J-cC-Feg9iMm"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# 6\n",
"model_name = \"a2c-PandaReachDense-v2\"; \n",
"model.save(model_name)\n",
"env.save(\"vec_normalize.pkl\")\n",
"\n",
"# 7\n",
"from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize\n",
"\n",
"# Load the saved statistics\n",
"eval_env = DummyVecEnv([lambda: gym.make(\"PandaReachDense-v2\")])\n",
"eval_env = VecNormalize.load(\"vec_normalize.pkl\", eval_env)\n",
"\n",
"# do not update them at test time\n",
"eval_env.training = False\n",
"# reward normalization is not needed at test time\n",
"eval_env.norm_reward = False\n",
"\n",
"# Load the agent\n",
"model = A2C.load(model_name)\n",
"\n",
"mean_reward, std_reward = evaluate_policy(model, eval_env)\n",
"\n",
"print(f\"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}\")\n",
"\n",
"# 8\n",
"package_to_hub(\n",
" model=model,\n",
" model_name=f\"a2c-{env_id}\",\n",
" model_architecture=\"A2C\",\n",
" env_id=env_id,\n",
" eval_env=eval_env,\n",
" repo_id=f\"ThomasSimonini/a2c-{env_id}\", # TODO: Change the username\n",
" commit_message=\"Initial commit\",\n",
")"
],
"metadata": {
"id": "-UnlKLmpg80p"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Some additional challenges 🏆\n",
"The best way to learn **is to try things by your own**! Why not trying `HalfCheetahBulletEnv-v0` for PyBullet and `PandaPickAndPlace-v1` for Panda-Gym?\n",
"\n",
"If you want to try more advanced tasks for panda-gym, you need to check what was done using **TQC or SAC** (a more sample-efficient algorithm suited for robotics tasks). In real robotics, you'll use a more sample-efficient algorithm for a simple reason: contrary to a simulation **if you move your robotic arm too much, you have a risk of breaking it**.\n",
"\n",
"PandaPickAndPlace-v1: https://huggingface.co/sb3/tqc-PandaPickAndPlace-v1\n",
"\n",
"And don't hesitate to check panda-gym documentation here: https://panda-gym.readthedocs.io/en/latest/usage/train_with_sb3.html\n",
"\n",
"Here are some ideas to achieve so:\n",
"* Train more steps\n",
"* Try different hyperparameters by looking at what your classmates have done 👉 https://huggingface.co/models?other=https://huggingface.co/models?other=AntBulletEnv-v0\n",
"* **Push your new trained model** on the Hub 🔥\n"
],
"metadata": {
"id": "G3xy3Nf3c2O1"
}
},
{
"cell_type": "markdown",
"source": [
"See you on Unit 7! 🔥\n",
"## Keep learning, stay awesome 🤗"
],
"metadata": {
"id": "usatLaZ8dM4P"
}
}
]
}

View File

@@ -365,8 +365,6 @@
},
"outputs": [],
"source": [
"import gym\n",
"\n",
"# First, we create our environment called LunarLander-v2\n",
"env = gym.make(\"LunarLander-v2\")\n",
"\n",

File diff suppressed because one or more lines are too long

View File

@@ -110,6 +110,74 @@
title: Optuna
- local: unitbonus2/hands-on
title: Hands-on
- title: Unit 4. Policy Gradient with PyTorch
sections:
- local: unit4/introduction
title: Introduction
- local: unit4/what-are-policy-based-methods
title: What are the policy-based methods?
- local: unit4/advantages-disadvantages
title: The advantages and disadvantages of policy-gradient methods
- local: unit4/policy-gradient
title: Diving deeper into policy-gradient
- local: unit4/pg-theorem
title: (Optional) the Policy Gradient Theorem
- local: unit4/hands-on
title: Hands-on
- local: unit4/quiz
title: Quiz
- local: unit4/conclusion
title: Conclusion
- local: unit4/additional-readings
title: Additional Readings
- title: Unit 5. Introduction to Unity ML-Agents
sections:
- local: unit5/introduction
title: Introduction
- local: unit5/how-mlagents-works
title: How ML-Agents works?
- local: unit5/snowball-target
title: The SnowballTarget environment
- local: unit5/pyramids
title: The Pyramids environment
- local: unit5/curiosity
title: (Optional) What is curiosity in Deep Reinforcement Learning?
- local: unit5/hands-on
title: Hands-on
- local: unit5/bonus
title: Bonus. Learn to create your own environments with Unity and MLAgents
- local: unit5/conclusion
title: Conclusion
- title: Unit 6. Actor Critic methods with Robotics environments
sections:
- local: unit6/introduction
title: Introduction
- local: unit6/variance-problem
title: The Problem of Variance in Reinforce
- local: unit6/advantage-actor-critic
title: Advantage Actor Critic (A2C)
- local: unit6/hands-on
title: Advantage Actor Critic (A2C) using Robotics Simulations with PyBullet and Panda-Gym 🤖
- local: unit6/conclusion
title: Conclusion
- local: unit6/additional-readings
title: Additional Readings
- title: Unit 7. Introduction to Multi-Agents and AI vs AI
sections:
- local: unit7/introduction
title: Introduction
- local: unit7/introduction-to-marl
title: An introduction to Multi-Agents Reinforcement Learning (MARL)
- local: unit7/multi-agent-setting
title: Designing Multi-Agents systems
- local: unit7/self-play
title: Self-Play
- local: unit7/hands-on
title: Let's train our soccer team to beat your classmates' teams (AI vs. AI)
- local: unit7/conclusion
title: Conclusion
- local: unit7/additional-readings
title: Additional Readings
- title: Certification and congratulations
sections:
- local: communication/conclusion

View File

@@ -23,7 +23,7 @@ In this course, you will:
- 📖 Study Deep Reinforcement Learning in **theory and practice.**
- 🧑‍💻 Learn to **use famous Deep RL libraries** such as [Stable Baselines3](https://stable-baselines3.readthedocs.io/en/master/), [RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo), [Sample Factory](https://samplefactory.dev/) and [CleanRL](https://github.com/vwxyzjn/cleanrl).
- 🤖 **Train agents in unique environments** such as [SnowballFight](https://huggingface.co/spaces/ThomasSimonini/SnowballFight), [Huggy the Doggo 🐶](https://huggingface.co/spaces/ThomasSimonini/Huggy), [VizDoom (Doom)](https://vizdoom.cs.put.edu.pl/) and classical ones such as [Space Invaders](https://www.gymlibrary.dev/environments/atari/), [PyBullet](https://pybullet.org/wordpress/) and more.
- 🤖 **Train agents in unique environments** such as [SnowballFight](https://huggingface.co/spaces/ThomasSimonini/SnowballFight), [Huggy the Doggo 🐶](https://singularite.itch.io/huggy), [VizDoom (Doom)](https://vizdoom.cs.put.edu.pl/) and classical ones such as [Space Invaders](https://www.gymlibrary.dev/environments/atari/), [PyBullet](https://pybullet.org/wordpress/) and more.
- 💾 Share your **trained agents with one line of code to the Hub** and also download powerful agents from the community.
- 🏆 Participate in challenges where you will **evaluate your agents against other teams. You'll also get to play against the agents you'll train.**
@@ -52,8 +52,8 @@ The course is composed of:
You can choose to follow this course either:
- *To get a certificate of completion*: you need to complete 80% of the assignments before the end of March 2023.
- *To get a certificate of honors*: you need to complete 100% of the assignments before the end of March 2023.
- *To get a certificate of completion*: you need to complete 80% of the assignments before the end of April 2023.
- *To get a certificate of honors*: you need to complete 100% of the assignments before the end of April 2023.
- *As a simple audit*: you can participate in all challenges and do assignments if you want, but you have no deadlines.
Both paths **are completely free**.
@@ -65,8 +65,8 @@ You don't need to tell us which path you choose. At the end of March, when we wi
The certification process is **completely free**:
- *To get a certificate of completion*: you need to complete 80% of the assignments before the end of March 2023.
- *To get a certificate of honors*: you need to complete 100% of the assignments before the end of March 2023.
- *To get a certificate of completion*: you need to complete 80% of the assignments before the end of April 2023.
- *To get a certificate of honors*: you need to complete 100% of the assignments before the end of April 2023.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/certification.jpg" alt="Course certification" width="100%"/>

View File

@@ -18,7 +18,7 @@ You can now sign up for our Discord Server. This is the place where you **can ex
When you join, remember to introduce yourself in #introduce-yourself and sign-up for reinforcement channels in #role-assignments.
We have multiple RL-related channels:
- `rl-announcements`: where we give the last information about the course.
- `rl-announcements`: where we give the latest information about the course.
- `rl-discussions`: where you can exchange about RL and share information.
- `rl-study-group`: where you can create and join study groups.
- `rl-i-made-this`: where you can share your projects and models.

View File

@@ -24,6 +24,8 @@ To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggi
For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
And you can check your progress here 👉 https://huggingface.co/spaces/ThomasSimonini/Check-my-progress-Deep-RL-Course
So let's get started! 🚀
**To start the hands-on click on Open In Colab button** 👇 :

View File

@@ -14,12 +14,14 @@ This is a community-created glossary. Contributions are welcomed!
- **The action-value function.** In contrast to the state-value function, the action-value calculates for each state and action pair the expected return if the agent starts in that state and takes an action. Then it follows the policy forever after.
### Epsilon-greedy strategy:
- Common exploration strategy used in reinforcement learning that involves balancing exploration and exploitation.
- Chooses the action with the highest expected reward with a probability of 1-epsilon.
- Chooses a random action with a probability of epsilon.
- Epsilon is typically decreased over time to shift focus towards exploitation.
### Greedy strategy:
- Involves always choosing the action that is expected to lead to the highest reward, based on the current knowledge of the environment. (only exploitation)
- Always chooses the action with the highest expected reward.
- Does not include any exploration.

View File

@@ -22,6 +22,8 @@ To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggi
For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
And you can check your progress here 👉 https://huggingface.co/spaces/ThomasSimonini/Check-my-progress-Deep-RL-Course
**To start the hands-on click on Open In Colab button** 👇 :

View File

@@ -18,7 +18,7 @@ Q-Learning worked well with small state space environments like:
But think of what we're going to do today: we will train an agent to learn to play Space Invaders a more complex game, using the frames as input.
As **[Nikita Melkozerov mentioned](https://twitter.com/meln1k), Atari environments** have an observation space with a shape of (210, 160, 3)*, containing values ranging from 0 to 255 so that gives us \\(256^{210x160x3} = 256^{100800}\\) (for comparison, we have approximately \\(10^{80}\\) atoms in the observable universe).
As **[Nikita Melkozerov mentioned](https://twitter.com/meln1k), Atari environments** have an observation space with a shape of (210, 160, 3)*, containing values ranging from 0 to 255 so that gives us \\(256^{210 \times 160 \times 3} = 256^{100800}\\) (for comparison, we have approximately \\(10^{80}\\) atoms in the observable universe).
* A single frame in Atari is composed of an image of 210x160 pixels. Given the images are in color (RGB), there are 3 channels. This is why the shape is (210, 160, 3). For each pixel, the value can go from 0 to 255.

View File

@@ -18,12 +18,15 @@ We're using the [RL-Baselines-3 Zoo integration](https://github.com/DLR-RM/rl-ba
Also, **if you want to learn to implement Deep Q-Learning by yourself after this hands-on**, you definitely should look at CleanRL implementation: https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/dqn_atari.py
To validate this hands-on for the certification process, you need to push your trained model to the Hub and **get a result of >= 500**.
To validate this hands-on for the certification process, you need to push your trained model to the Hub and **get a result of >= 200**.
To find your result, go to the leaderboard and find your model, **the result = mean_reward - std of reward**
For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
And you can check your progress here 👉 https://huggingface.co/spaces/ThomasSimonini/Check-my-progress-Deep-RL-Course
**To start the hands-on click on Open In Colab button** 👇 :
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit3/unit3.ipynb)
@@ -68,13 +71,6 @@ Before diving into the notebook, you need to:
We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the Github Repo](https://github.com/huggingface/deep-rl-class/issues).
# Let's train a Deep Q-Learning agent playing Atari' Space Invaders 👾 and upload it to the Hub.
To validate this hands-on for the certification process, you need to push your trained model to the Hub and **get a result of >= 500**.
To find your result, go to the leaderboard and find your model, **the result = mean_reward - std of reward**
For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
## Set the GPU 💪
- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`
@@ -141,7 +137,7 @@ To train an agent with RL-Baselines3-Zoo, we just need to do two things:
Here we see that:
- We use the `Atari Wrapper` that does the pre-processing (Frame reduction, grayscale, stack four frames frames),
- We use the `Atari Wrapper` that does the pre-processing (Frame reduction, grayscale, stack four frames),
- We use `CnnPolicy`, since we use Convolutional layers to process the frames.
- We train the model for 10 million `n_timesteps`.
- Memory (Experience Replay) size is 100000, i.e. the number of experience steps you saved to train again your agent with.

View File

@@ -76,8 +76,8 @@ For instance, in pong, our agent **will be unable to know the ball direction if
**1. Make more efficient use of the experiences during the training**
Usually, in online reinforcement learning, the agent interacts in the environment, gets experiences (state, action, reward, and next state), learns from them (updates the neural network), and discards them. This is not efficient
But with experience replay, **we create a replay buffer that saves experience samples that we can reuse during the training**.
Usually, in online reinforcement learning, the agent interacts in the environment, gets experiences (state, action, reward, and next state), learns from them (updates the neural network), and discards them. This is not efficient.
But, with experience replay, **we create a replay buffer that saves experience samples that we can reuse during the training**.
**2. Avoid forgetting previous experiences and reduce the correlation between experiences**

View File

@@ -0,0 +1,20 @@
# Additional Readings
These are **optional readings** if you want to go deeper.
## Introduction to Policy Optimization
- [Part 3: Intro to Policy Optimization - Spinning Up documentation](https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html)
## Policy Gradient
- [https://johnwlambert.github.io/policy-gradients/](https://johnwlambert.github.io/policy-gradients/)
- [RL - Policy Gradient Explained](https://jonathan-hui.medium.com/rl-policy-gradients-explained-9b13b688b146)
- [Chapter 13, Policy Gradient Methods; Reinforcement Learning, an introduction by Richard Sutton and Andrew G. Barto](http://incompleteideas.net/book/RLbook2020.pdf)
## Implementation
- [PyTorch Reinforce implementation](https://github.com/pytorch/examples/blob/main/reinforcement_learning/reinforce.py)
- [Implementations from DDPG to PPO](https://github.com/MrSyee/pg-is-all-you-need)

View File

@@ -0,0 +1,74 @@
# The advantages and disadvantages of policy-gradient methods
At this point, you might ask, "but Deep Q-Learning is excellent! Why use policy-gradient methods?". To answer this question, let's study the **advantages and disadvantages of policy-gradient methods**.
## Advantages
There are multiple advantages over value-based methods. Let's see some of them:
### The simplicity of integration
We can estimate the policy directly without storing additional data (action values).
### Policy-gradient methods can learn a stochastic policy
Policy-gradient methods can **learn a stochastic policy while value functions can't**.
This has two consequences:
1. We **don't need to implement an exploration/exploitation trade-off by hand**. Since we output a probability distribution over actions, the agent explores **the state space without always taking the same trajectory.**
2. We also get rid of the problem of **perceptual aliasing**. Perceptual aliasing is when two states seem (or are) the same but need different actions.
Let's take an example: we have an intelligent vacuum cleaner whose goal is to suck the dust and avoid killing the hamsters.
<figure class="image table text-center m-0 w-full">
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/hamster1.jpg" alt="Hamster 1"/>
</figure>
Our vacuum cleaner can only perceive where the walls are.
The problem is that the **two rose cases are aliased states because the agent perceives an upper and lower wall for each**.
<figure class="image table text-center m-0 w-full">
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/hamster2.jpg" alt="Hamster 1"/>
</figure>
Under a deterministic policy, the policy either will move right when in a red state or move left. **Either case will cause our agent to get stuck and never suck the dust**.
Under a value-based Reinforcement learning algorithm, we learn a **quasi-deterministic policy** ("greedy epsilon strategy"). Consequently, our agent can **spend a lot of time before finding the dust**.
On the other hand, an optimal stochastic policy **will randomly move left or right in rose states**. Consequently, **it will not be stuck and will reach the goal state with a high probability**.
<figure class="image table text-center m-0 w-full">
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/hamster3.jpg" alt="Hamster 1"/>
</figure>
### Policy-gradient methods are more effective in high-dimensional action spaces and continuous actions spaces
The problem with Deep Q-learning is that their **predictions assign a score (maximum expected future reward) for each possible action**, at each time step, given the current state.
But what if we have an infinite possibility of actions?
For instance, with a self-driving car, at each state, you can have a (near) infinite choice of actions (turning the wheel at 15°, 17.2°, 19,4°, honking, etc.). **We'll need to output a Q-value for each possible action**! And **taking the max action of a continuous output is an optimization problem itself**!
Instead, with policy-gradient methods, we output a **probability distribution over actions.**
### Policy-gradient methods have better convergence properties
In value-based methods, we use an aggressive operator to **change the value function: we take the maximum over Q-estimates**.
Consequently, the action probabilities may change dramatically for an arbitrarily small change in the estimated action values if that change results in a different action having the maximal value.
For instance, if during the training, the best action was left (with a Q-value of 0.22) and the training step after it's right (since the right Q-value becomes 0.23), we dramatically changed the policy since now the policy will take most of the time right instead of left.
On the other hand, in policy-gradient methods, stochastic policy action preferences (probability of taking action) **change smoothly over time**.
## Disadvantages
Naturally, policy-gradient methods also have some disadvantages:
- **Frequently, policy-gradient converges on a local maximum instead of a global optimum.**
- Policy-gradient goes slower, **step by step: it can take longer to train (inefficient).**
- Policy-gradient can have high variance. We'll see in actor-critic unit why and how we can solve this problem.
👉 If you want to go deeper into the advantages and disadvantages of policy-gradient methods, [you can check this video](https://youtu.be/y3oqOjHilio).

View File

@@ -0,0 +1,17 @@
# Conclusion
**Congrats on finishing this unit**! There was a lot of information.
And congrats on finishing the tutorial. You've just coded your first Deep Reinforcement Learning agent from scratch using PyTorch and shared it on the Hub 🥳.
Don't hesitate to iterate on this unit **by improving the implementation for more complex environments** (for instance, what about changing the network to a Convolutional Neural Network to handle
frames as observation)?
In the next unit, **we're going to learn more about Unity MLAgents**, by training agents in Unity environments. This way, you will be ready to participate in the **AI vs AI challenges where you'll train your agents
to compete against other agents in a snowball fight and a soccer game.**
Sounds fun? See you next time!
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
### Keep Learning, stay awesome 🤗

1015
units/en/unit4/hands-on.mdx Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,24 @@
# Introduction [[introduction]]
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/thumbnail.png" alt="thumbnail"/>
In the last unit, we learned about Deep Q-Learning. In this value-based deep reinforcement learning algorithm, we **used a deep neural network to approximate the different Q-values for each possible action at a state.**
Since the beginning of the course, we only studied value-based methods, **where we estimate a value function as an intermediate step towards finding an optimal policy.**
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy" />
In value-based methods, the policy ** \\(π\\) only exists because of the action value estimates since the policy is just a function** (for instance, greedy-policy) that will select the action with the highest value given a state.
But, with policy-based methods, we want to optimize the policy directly **without having an intermediate step of learning a value function.**
So today, **we'll learn about policy-based methods and study a subset of these methods called policy gradient**. Then we'll implement our first policy gradient algorithm called Monte Carlo **Reinforce** from scratch using PyTorch.
Then, we'll test its robustness using the CartPole-v1 and PixelCopter environments.
You'll then be able to iterate and improve this implementation for more advanced environments.
<figure class="image table text-center m-0 w-full">
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/envs.gif" alt="Environments"/>
</figure>
Let's get started,

View File

@@ -0,0 +1,83 @@
# (Optional) the Policy Gradient Theorem
In this optional section where we're **going to study how we differentiate the objective function that we will use to approximate the policy gradient**.
Let's first recap our different formulas:
1. The Objective function
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/expected_reward.png" alt="Return"/>
2. The probability of a trajectory (given that action comes from \\(\pi_\theta\\)):
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/probability.png" alt="Probability"/>
So we have:
\\(\nabla_\theta J(\theta) = \nabla_\theta \sum_{\tau}P(\tau;\theta)R(\tau)\\)
We can rewrite the gradient of the sum as the sum of the gradient:
\\( = \sum_{\tau} \nabla_\theta P(\tau;\theta)R(\tau) \\)
We then multiply every term in the sum by \\(\frac{P(\tau;\theta)}{P(\tau;\theta)}\\)(which is possible since it's = 1)
\\( = \sum_{\tau} \frac{P(\tau;\theta)}{P(\tau;\theta)}\nabla_\theta P(\tau;\theta)R(\tau) \\)
We can simplify further this since \\( \frac{P(\tau;\theta)}{P(\tau;\theta)}\nabla_\theta P(\tau;\theta) = P(\tau;\theta)\frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)} \\)
\\(= \sum_{\tau} P(\tau;\theta) \frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)}R(\tau) \\)
We can then use the *derivative log trick* (also called *likelihood ratio trick* or *REINFORCE trick*), a simple rule in calculus that implies that \\( \nabla_x log f(x) = \frac{\nabla_x f(x)}{f(x)} \\)
So given we have \\(\frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)} \\) we transform it as \\(\nabla_\theta log P(\tau|\theta) \\)
So this is our likelihood policy gradient:
\\( \nabla_\theta J(\theta) = \sum_{\tau} P(\tau;\theta) \nabla_\theta log P(\tau;\theta) R(\tau) \\)
Thanks for this new formula, we can estimate the gradient using trajectory samples (we can approximate the likelihood ratio policy gradient with sample-based estimate if you prefer).
\\(\nabla_\theta J(\theta) = \frac{1}{m} \sum^{m}_{i=1} \nabla_\theta log P(\tau^{(i)};\theta)R(\tau^{(i)})\\) where each \\(\tau^{(i)}\\) is a sampled trajectory.
But we still have some mathematics work to do there: we need to simplify \\( \nabla_\theta log P(\tau|\theta) \\)
We know that:
\\(\nabla_\theta log P(\tau^{(i)};\theta)= \nabla_\theta log[ \mu(s_0) \prod_{t=0}^{H} P(s_{t+1}^{(i)}|s_{t}^{(i)}, a_{t}^{(i)}) \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})]\\)
Where \\(\mu(s_0)\\) is the initial state distribution and \\( P(s_{t+1}^{(i)}|s_{t}^{(i)}, a_{t}^{(i)}) \\) is the state transition dynamics of the MDP.
We know that the log of a product is equal to the sum of the logs:
\\(\nabla_\theta log P(\tau^{(i)};\theta)= \nabla_\theta \left[log \mu(s_0) + \sum\limits_{t=0}^{H}log P(s_{t+1}^{(i)}|s_{t}^{(i)} a_{t}^{(i)}) + \sum\limits_{t=0}^{H}log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})\right] \\)
We also know that the gradient of the sum is equal to the sum of gradient:
\\( \nabla_\theta log P(\tau^{(i)};\theta)=\nabla_\theta log\mu(s_0) + \nabla_\theta \sum\limits_{t=0}^{H} log P(s_{t+1}^{(i)}|s_{t}^{(i)} a_{t}^{(i)}) + \nabla_\theta \sum\limits_{t=0}^{H} log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)}) \\)
Since neither initial state distribution or state transition dynamics of the MDP are dependent of \\(\theta\\), the derivate of both terms are 0. So we can remove them:
Since:
\\(\nabla_\theta \sum_{t=0}^{H} log P(s_{t+1}^{(i)}|s_{t}^{(i)} a_{t}^{(i)}) = 0 \\) and \\( \nabla_\theta \mu(s_0) = 0\\)
\\(\nabla_\theta log P(\tau^{(i)};\theta) = \nabla_\theta \sum_{t=0}^{H} log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})\\)
We can rewrite the gradient of the sum as the sum of gradients:
\\( \nabla_\theta log P(\tau^{(i)};\theta)= \sum_{t=0}^{H} \nabla_\theta log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)}) \\)
So, the final formula for estimating the policy gradient is:
\\( \nabla_{\theta} J(\theta) = \hat{g} = \frac{1}{m} \sum^{m}_{i=1} \sum^{H}_{t=0} \nabla_\theta \log \pi_\theta(a^{(i)}_{t} | s_{t}^{(i)})R(\tau^{(i)}) \\)

View File

@@ -0,0 +1,120 @@
# Diving deeper into policy-gradient methods
## Getting the big picture
We just learned that policy-gradient methods aim to find parameters \\( \theta \\) that **maximize the expected return**.
The idea is that we have a *parameterized stochastic policy*. In our case, a neural network outputs a probability distribution over actions. The probability of taking each action is also called *action preference*.
If we take the example of CartPole-v1:
- As input, we have a state.
- As output, we have a probability distribution over actions at that state.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_based.png" alt="Policy based" />
Our goal with policy-gradient is to **control the probability distribution of actions** by tuning the policy such that **good actions (that maximize the return) are sampled more frequently in the future.**
Each time the agent interacts with the environment, we tweak the parameters such that good actions will be sampled more likely in the future.
But **how are we going to optimize the weights using the expected return**?
The idea is that we're going to **let the agent interact during an episode**. And if we win the episode, we consider that each action taken was good and must be more sampled in the future
since they lead to win.
So for each state-action pair, we want to increase the \\(P(a|s)\\): the probability of taking that action at that state. Or decrease if we lost.
The Policy-gradient algorithm (simplified) looks like this:
<figure class="image table text-center m-0 w-full">
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/pg_bigpicture.jpg" alt="Policy Gradient Big Picture"/>
</figure>
Now that we got the big picture, let's dive deeper into policy-gradient methods.
## Diving deeper into policy-gradient methods
We have our stochastic policy \\(\pi\\) which has a parameter \\(\theta\\). This \\(\pi\\), given a state, **outputs a probability distribution of actions**.
<figure class="image table text-center m-0 w-full">
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/stochastic_policy.png" alt="Policy"/>
</figure>
Where \\(\pi_\theta(a_t|s_t)\\) is the probability of the agent selecting action \\(a_t\\) from state \\(s_t\\) given our policy.
**But how do we know if our policy is good?** We need to have a way to measure it. To know that, we define a score/objective function called \\(J(\theta)\\).
### The objective function
The *objective function* gives us the **performance of the agent** given a trajectory (state action sequence without considering reward (contrary to an episode)), and it outputs the *expected cumulative reward*.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/objective.jpg" alt="Return"/>
Let's detail a little bit more this formula:
- The *expected return* (also called expected cumulative reward), is the weighted average (where the weights are given by \\(P(\tau;\theta)\\) of all possible values that the return \\(R(\tau)\\) can take.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/expected_reward.png" alt="Return"/>
- \\(R(\tau)\\) : Return from an arbitrary trajectory. To take this quantity and use it to calculate the expected return, we need to multiply it by the probability of each possible trajectory.
- \\(P(\tau;\theta)\\) : Probability of each possible trajectory \\(\tau\\) (that probability depends on \\( \theta\\) since it defines the policy that it uses to select the actions of the trajectory which as an impact of the states visited).
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/probability.png" alt="Probability"/>
- \\(J(\theta)\\) : Expected return, we calculate it by summing for all trajectories, the probability of taking that trajectory given \\(\theta \\), and the return of this trajectory.
Our objective then is to maximize the expected cumulative reward by finding \\(\theta \\) that will output the best action probability distributions:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/max_objective.png" alt="Max objective"/>
## Gradient Ascent and the Policy-gradient Theorem
Policy-gradient is an optimization problem: we want to find the values of \\(\theta\\) that maximize our objective function \\(J(\theta)\\), we need to use **gradient-ascent**. It's the inverse of *gradient-descent* since it gives the direction of the steepest increase of \\(J(\theta)\\).
(If you need a refresher on the difference between gradient descent and gradient ascent [check this](https://www.baeldung.com/cs/gradient-descent-vs-ascent) and [this](https://stats.stackexchange.com/questions/258721/gradient-ascent-vs-gradient-descent-in-logistic-regression)).
Our update step for gradient-ascent is:
\\( \theta \leftarrow \theta + \alpha * \nabla_\theta J(\theta) \\)
We can repeatedly apply this update state in the hope that \\(\theta \\) converges to the value that maximizes \\(J(\theta)\\).
However, we have two problems to obtain the derivative of \\(J(\theta)\\):
1. We can't calculate the true gradient of the objective function since it would imply calculating the probability of each possible trajectory which is computationally super expensive.
We want then to **calculate a gradient estimation with a sample-based estimate (collect some trajectories)**.
2. We have another problem that I detail in the next optional section. To differentiate this objective function, we need to differentiate the state distribution, called Markov Decision Process dynamics. This is attached to the environment. It gives us the probability of the environment going into the next state, given the current state and the action taken by the agent. The problem is that we can't differentiate it because we might not know about it.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/probability.png" alt="Probability"/>
Fortunately we're going to use a solution called the Policy Gradient Theorem that will help us to reformulate the objective function into a differentiable function that does not involve the differentiation of the state distribution.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_gradient_theorem.png" alt="Policy Gradient"/>
If you want to understand how we derivate this formula that we will use to approximate the gradient, check the next (optional) section.
## The Reinforce algorithm (Monte Carlo Reinforce)
The Reinforce algorithm, also called Monte-Carlo policy-gradient, is a policy-gradient algorithm that **uses an estimated return from an entire episode to update the policy parameter** \\(\theta\\):
In a loop:
- Use the policy \\(\pi_\theta\\) to collect an episode \\(\tau\\)
- Use the episode to estimate the gradient \\(\hat{g} = \nabla_\theta J(\theta)\\)
<figure class="image table text-center m-0 w-full">
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_gradient_one.png" alt="Policy Gradient"/>
</figure>
- Update the weights of the policy: \\(\theta \leftarrow \theta + \alpha \hat{g}\\)
The interpretation we can make is this one:
- \\(\nabla_\theta log \pi_\theta(a_t|s_t)\\) is the direction of **steepest increase of the (log) probability** of selecting action at from state st.
This tells us **how we should change the weights of policy** if we want to increase/decrease the log probability of selecting action \\(a_t\\) at state \\(s_t\\).
- \\(R(\tau)\\): is the scoring function:
- If the return is high, it will **push up the probabilities** of the (state, action) combinations.
- Else, if the return is low, it will **push down the probabilities** of the (state, action) combinations.
We can also **collect multiple episodes (trajectories)** to estimate the gradient:
<figure class="image table text-center m-0 w-full">
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_gradient_multiple.png" alt="Policy Gradient"/>
</figure>

82
units/en/unit4/quiz.mdx Normal file
View File

@@ -0,0 +1,82 @@
# Quiz
The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
### Q1: What are the advantages of policy-gradient over value-based methods? (Check all that apply)
<Question
choices={[
{
text: "Policy-gradient methods can learn a stochastic policy",
explain: "",
correct: true,
},
{
text: "Policy-gradient methods are more effective in high-dimensional action spaces and continuous actions spaces",
explain: "",
correct: true,
},
{
text: "Policy-gradient converges most of the time on a global maximum.",
explain: "No, frequently, policy-gradient converges on a local maximum instead of a global optimum.",
},
]}
/>
### Q2: What is the Policy Gradient Theorem?
<details>
<summary>Solution</summary>
*The Policy Gradient Theorem* is a formula that will help us to reformulate the objective function into a differentiable function that does not involve the differentiation of the state distribution.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_gradient_theorem.png" alt="Policy Gradient"/>
</details>
### Q3: What's the difference between policy-based methods and policy-gradient methods? (Check all that apply)
<Question
choices={[
{
text: "Policy-based methods are a subset of policy-gradient methods.",
explain: "",
},
{
text: "Policy-gradient methods are a subset of policy-based methods.",
explain: "",
correct: true,
},
{
text: "In Policy-based methods, we can optimize the parameter θ **indirectly** by maximizing the local approximation of the objective function with techniques like hill climbing, simulated annealing, or evolution strategies.",
explain: "",
correct: true,
},
{
text: "In Policy-gradient methods, we optimize the parameter θ **directly** by performing the gradient ascent on the performance of the objective function.",
explain: "",
correct: true,
},
]}
/>
### Q4: Why do we use gradient ascent instead of gradient descent to optimize J(θ)?
<Question
choices={[
{
text: "We want to minimize J(θ) and gradient ascent gives us the gives the direction of the steepest increase of J(θ)",
explain: "",
},
{
text: "We want to maximize J(θ) and gradient ascent gives us the gives the direction of the steepest increase of J(θ)",
explain: "",
correct: true
},
]}
/>
Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read the chapter again to reinforce (😏) your knowledge.

View File

@@ -0,0 +1,42 @@
# What are the policy-based methods?
The main goal of Reinforcement learning is to **find the optimal policy \\(\pi^{*}\\) that will maximize the expected cumulative reward**.
Because Reinforcement Learning is based on the *reward hypothesis*: **all goals can be described as the maximization of the expected cumulative reward.**
For instance, in a soccer game (where you're going to train the agents in two units), the goal is to win the game. We can describe this goal in reinforcement learning as
**maximizing the number of goals scored** (when the ball crosses the goal line) into your opponent's soccer goals. And **minimizing the number of goals in your soccer goals**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/soccer.jpg" alt="Soccer" />
## Value-based, Policy-based, and Actor-critic methods
We studied in the first unit, that we had two methods to find (most of the time approximate) this optimal policy \\(\pi^{*}\\).
- In *value-based methods*, we learn a value function.
- The idea is that an optimal value function leads to an optimal policy \\(\pi^{*}\\).
- Our objective is to **minimize the loss between the predicted and target value** to approximate the true action-value function.
- We have a policy, but it's implicit since it **was generated directly from the value function**. For instance, in Q-Learning, we defined an epsilon-greedy policy.
- On the other hand, in *policy-based methods*, we directly learn to approximate \\(\pi^{*}\\) without having to learn a value function.
- The idea is **to parameterize the policy**. For instance, using a neural network \\(\pi_\theta\\), this policy will output a probability distribution over actions (stochastic policy).
- <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/stochastic_policy.png" alt="stochastic policy" />
- Our objective then is **to maximize the performance of the parameterized policy using gradient ascent**.
- To do that, we control the parameter \\(\theta\\) that will affect the distribution of actions over a state.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_based.png" alt="Policy based" />
- Finally, we'll study the next time *actor-critic* which is a combination of value-based and policy-based methods.
Consequently, thanks to policy-based methods, we can directly optimize our policy \\(\pi_\theta\\) to output a probability distribution over actions \\(\pi_\theta(a|s)\\) that leads to the best cumulative return.
To do that, we define an objective function \\(J(\theta)\\), that is, the expected cumulative reward, and we **want to find \\(\theta\\) that maximizes this objective function**.
## The difference between policy-based and policy-gradient methods
Policy-gradient methods, what we're going to study in this unit, is a subclass of policy-based methods. In policy-based methods, the optimization is most of the time *on-policy* since for each update, we only use data (trajectories) collected **by our most recent version of** \\(\pi_\theta\\).
The difference between these two methods **lies on how we optimize the parameter** \\(\theta\\):
- In *policy-based methods*, we search directly for the optimal policy. We can optimize the parameter \\(\theta\\) **indirectly** by maximizing the local approximation of the objective function with techniques like hill climbing, simulated annealing, or evolution strategies.
- In *policy-gradient methods*, because it is a subclass of the policy-based methods, we search directly for the optimal policy. But we optimize the parameter \\(\theta\\) **directly** by performing the gradient ascent on the performance of the objective function \\(J(\theta)\\).
Before diving more into how policy-gradient methods work (the objective function, policy gradient theorem, gradient ascent, etc.), let's study the advantages and disadvantages of policy-based methods.

19
units/en/unit5/bonus.mdx Normal file
View File

@@ -0,0 +1,19 @@
# Bonus: Learn to create your own environments with Unity and MLAgents
**You can create your own reinforcement learning environments with Unity and MLAgents**. But, using a game engine such as Unity, can be intimidating at first but here are the steps you can do to learn smoothly.
## Step 1: Know how to use Unity
- The best way to learn Unity is to do ["Create with Code" course](https://learn.unity.com/course/create-with-code): it's a series of videos for beginners where **you will create 5 small games with Unity**.
## Step 2: Create the simplest environment with this tutorial
- Then, when you know how to use Unity, you can create your [first basic RL environment using this tutorial](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/Learning-Environment-Create-New.md).
## Step 3: Iterate and create nice environments
- Now that you've created a first simple environment you can iterate in more complex one using the [MLAgents documentation (especially Designing Agents and Agent part)](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/)
- In addition, you can follow this free course ["Create a hummingbird environment"](https://learn.unity.com/course/ml-agents-hummingbirds) by [Adam Kelly](https://twitter.com/aktwelve)
Have fun! And if you create custom environments don't hesitate to share them to `#rl-i-made-this` discord channel.

View File

@@ -0,0 +1,22 @@
# Conclusion
Congrats on finishing this unit! Youve just trained your first ML-Agents and shared it to the Hub 🥳.
The best way to learn is to **practice and try stuff**. Why not try another environment? [ML-Agents has 18 different environments](https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Learning-Environment-Examples.md).
For instance:
- [Worm](https://singularite.itch.io/worm), where you teach a worm to crawl.
- [Walker](https://singularite.itch.io/walker): teach an agent to walk towards a goal.
Check the documentation to find how to train them and the list of already integrated MLAgents environments on the Hub: https://github.com/huggingface/ml-agents#getting-started
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/envs-unity.jpeg" alt="Example envs"/>
In the next unit, we're going to learn about multi-agents. And you're going to train your first multi-agents to compete in Soccer and Snowball fight against other classmate's agents.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballfight.gif" alt="Snownball fight"/>
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
### Keep Learning, stay awesome 🤗

View File

@@ -0,0 +1,50 @@
# (Optional) What is Curiosity in Deep Reinforcement Learning?
This is an (optional) introduction to Curiosity. If you want to learn more, you can read two additional articles where we dive into the mathematical details:
- [Curiosity-Driven Learning through Next State Prediction](https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-next-state-prediction-f7f4e2f592fa)
- [Random Network Distillation: a new take on Curiosity-Driven Learning](https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-random-network-distillation-488ffd8e5938)
## Two Major Problems in Modern RL
To understand what is Curiosity, we need first to understand the two major problems with RL:
First, the *sparse rewards problem:* that is, **most rewards do not contain information, and hence are set to zero**.
Remember that RL is based on the *reward hypothesis*, which is the idea that each goal can be described as the maximization of the rewards. Therefore, rewards act as feedback for RL agents; **if they dont receive any, their knowledge of which action is appropriate (or not) cannot change**.
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/curiosity1.png" alt="Curiosity"/>
<figcaption>Source: Thanks to the reward, our agent knows that this action at that state was good</figcaption>
</figure>
For instance, in [Vizdoom](https://vizdoom.cs.put.edu.pl/), a set of environments based on the game Doom “DoomMyWayHome,” your agent is only rewarded **if it finds the vest**.
However, the vest is far away from your starting point, so most of your rewards will be zero. Therefore, if our agent does not receive useful feedback (dense rewards), it will take much longer to learn an optimal policy, and **it can spend time turning around without finding the goal**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/curiosity2.png" alt="Curiosity"/>
The second big problem is that **the extrinsic reward function is handmade; in each environment, a human has to implement a reward function**. But how we can scale that in big and complex environments?
## So what is Curiosity?
A solution to these problems is **to develop a reward function intrinsic to the agent, i.e., generated by the agent itself**. The agent will act as a self-learner since it will be the student and its own feedback master.
**This intrinsic reward mechanism is known as Curiosity** because this reward pushes the agent to explore states that are novel/unfamiliar. To achieve that, our agent will receive a high reward when exploring new trajectories.
This reward is inspired by how human acts. ** we naturally have an intrinsic desire to explore environments and discover new things**.
There are different ways to calculate this intrinsic reward. The classical approach (Curiosity through next-state prediction) is to calculate Curiosity **as the error of our agent in predicting the next state, given the current state and action taken**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/curiosity3.png" alt="Curiosity"/>
Because the idea of Curiosity is to **encourage our agent to perform actions that reduce the uncertainty in the agents ability to predict the consequences of its actions** (uncertainty will be higher in areas where the agent has spent less time or in areas with complex dynamics).
If the agent spends a lot of time on these states, it will be good to predict the next state (low Curiosity). On the other hand, if its a new state unexplored, it will be hard to predict the following state (high Curiosity).
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/curiosity4.png" alt="Curiosity"/>
Using Curiosity will push our agent to favor transitions with high prediction error (which will be higher in areas where the agent has spent less time, or in areas with complex dynamics) and **consequently better explore our environment**.
Theres also **other curiosity calculation methods**. ML-Agents uses a more advanced one called Curiosity through random network distillation. This is out of the scope of the tutorial but if youre interested [I wrote an article explaining it in detail](https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-random-network-distillation-488ffd8e5938).

370
units/en/unit5/hands-on.mdx Normal file
View File

@@ -0,0 +1,370 @@
# Hands-on
<CourseFloatingBanner classNames="absolute z-10 right-0 top-0"
notebooks={[
{label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/unit5/unit5.ipynb"}
]}
askForHelpUrl="http://hf.co/join/discord" />
We learned what ML-Agents is and how it works. We also studied the two environments we're going to use. Now we're ready to train our agents!
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/envs.png" alt="Environments" />
The ML-Agents integration on the Hub **is still experimental**. Some features will be added in the future. But for now, to validate this hands-on for the certification process, you just need to push your trained models to the Hub.
There are no minimum results to attain to validate this Hands On. But if you want to get nice results, you can try to reach the following:
- For [Pyramids](https://singularite.itch.io/pyramids): Mean Reward = 1.75
- For [SnowballTarget](https://singularite.itch.io/snowballtarget): Mean Reward = 15 or 30 targets shoot in an episode.
For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
**To start the hands-on, click on Open In Colab button** 👇 :
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit5/unit5.ipynb)
# Unit 5: An Introduction to ML-Agents
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/thumbnail.png" alt="Thumbnail"/>
In this notebook, you'll learn about ML-Agents and train two agents.
- The first one will learn to **shoot snowballs onto spawning targets**.
- The second need to press a button to spawn a pyramid, then navigate to the pyramid, knock it over, **and move to the gold brick at the top**. To do that, it will need to explore its environment, and we will use a technique called curiosity.
After that, you'll be able **to watch your agents playing directly on your browser**.
For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
⬇️ Here is an example of what **you will achieve at the end of this unit.** ⬇️
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids.gif" alt="Pyramids"/>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballtarget.gif" alt="SnowballTarget"/>
### 🎮 Environments:
- [Pyramids](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Learning-Environment-Examples.md#pyramids)
- SnowballTarget
### 📚 RL-Library:
- [ML-Agents (HuggingFace Experimental Version)](https://github.com/huggingface/ml-agents)
⚠ We're going to use an experimental version of ML-Agents where you can push to Hub and load from Hub Unity ML-Agents Models **you need to install the same version**
We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues).
## Objectives of this notebook 🏆
At the end of the notebook, you will:
- Understand how works **ML-Agents**, the environment library.
- Be able to **train agents in Unity Environments**.
## Prerequisites 🏗️
Before diving into the notebook, you need to:
🔲 📚 **Study [what is ML-Agents and how it works by reading Unit 5](https://huggingface.co/deep-rl-course/unit5/introduction)** 🤗
# Let's train our agents 🚀
## Set the GPU 💪
- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg" alt="GPU Step 1">
- `Hardware Accelerator > GPU`
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg" alt="GPU Step 2">
## Clone the repository and install the dependencies 🔽
- We need to clone the repository that **contains the experimental version of the library that allows you to push your trained agent to the Hub.**
```python
%%capture
# Clone the repository
!git clone --depth 1 https://github.com/huggingface/ml-agents/
```
```python
%%capture
# Go inside the repository and install the package
%cd ml-agents
!pip3 install -e ./ml-agents-envs
!pip3 install -e ./ml-agents
```
## SnowballTarget ⛄
If you need a refresher on how this environment works check this section 👉
https://huggingface.co/deep-rl-course/unit5/snowball-target
### Download and move the environm ent zip file in `./training-envs-executables/linux/`
- Our environment executable is in a zip file.
- We need to download it and place it to `./training-envs-executables/linux/`
- We use a linux executable because we use colab, and colab machines OS is Ubuntu (linux)
```python
# Here, we create training-envs-executables and linux
!mkdir ./training-envs-executables
!mkdir ./training-envs-executables/linux
```
Download the file SnowballTarget.zip from https://drive.google.com/file/d/1YHHLjyj6gaZ3Gemx1hQgqrPgSS2ZhmB5 using `wget`.
Check out the full solution to download large files from GDrive [here](https://bcrf.biochem.wisc.edu/2021/02/05/download-google-drive-files-using-wget/)
```python
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1YHHLjyj6gaZ3Gemx1hQgqrPgSS2ZhmB5' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1YHHLjyj6gaZ3Gemx1hQgqrPgSS2ZhmB5" -O ./training-envs-executables/linux/SnowballTarget.zip && rm -rf /tmp/cookies.txt
```
We unzip the executable.zip file
```python
%%capture
!unzip -d ./training-envs-executables/linux/ ./training-envs-executables/linux/SnowballTarget.zip
```
Make sure your file is accessible
```python
!chmod -R 755 ./training-envs-executables/linux/SnowballTarget
```
### Define the SnowballTarget config file
- In ML-Agents, you define the **training hyperparameters into config.yaml files.**
There are multiple hyperparameters. To know them better, you should check for each explanation with [the documentation](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/Training-Configuration-File.md)
You need to create a `SnowballTarget.yaml` config file in ./content/ml-agents/config/ppo/
We'll give you here a first version of this config (to copy and paste into your `SnowballTarget.yaml file`), **but you should modify it**.
```yaml
behaviors:
SnowballTarget:
trainer_type: ppo
summary_freq: 10000
keep_checkpoints: 10
checkpoint_interval: 50000
max_steps: 200000
time_horizon: 64
threaded: true
hyperparameters:
learning_rate: 0.0003
learning_rate_schedule: linear
batch_size: 128
buffer_size: 2048
beta: 0.005
epsilon: 0.2
lambd: 0.95
num_epoch: 3
network_settings:
normalize: false
hidden_units: 256
num_layers: 2
vis_encode_type: simple
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
```
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballfight_config1.png" alt="Config SnowballTarget"/>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballfight_config2.png" alt="Config SnowballTarget"/>
As an experiment, try to modify some other hyperparameters. Unity provides very [good documentation explaining each of them here](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Training-Configuration-File.md).
Now that you've created the config file and understand what most hyperparameters do, we're ready to train our agent 🔥.
### Train the agent
To train our agent, we need to **launch mlagents-learn and select the executable containing the environment.**
We define four parameters:
1. `mlagents-learn <config>`: the path where the hyperparameter config file is.
2. `--env`: where the environment executable is.
3. `--run_id`: the name you want to give to your training run id.
4. `--no-graphics`: to not launch the visualization during the training.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/mlagentslearn.png" alt="MlAgents learn"/>
Train the model and use the `--resume` flag to continue training in case of interruption.
> It will fail the first time if and when you use `--resume`. Try rerunning the block to bypass the error.
The training will take 10 to 35min depending on your config. Go take a ☕you deserve it 🤗.
```bash
!mlagents-learn ./config/ppo/SnowballTarget.yaml --env=./training-envs-executables/linux/SnowballTarget/SnowballTarget --run-id="SnowballTarget1" --no-graphics
```
### Push the agent to the Hugging Face Hub
- Now that we trained our agent, were **ready to push it to the Hub to be able to visualize it playing on your browser🔥.**
To be able to share your model with the community, there are three more steps to follow:
1⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join
2⃣ Sign in and store your authentication token from the Hugging Face website.
- Create a new token (https://huggingface.co/settings/tokens) **with write role**
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
- Copy the token
- Run the cell below and paste the token
```python
from huggingface_hub import notebook_login
notebook_login()
```
If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
Then, we need to run `mlagents-push-to-hf`.
And we define four parameters:
1. `--run-id`: the name of the training run id.
2. `--local-dir`: where the agent was saved, its results/<run_id name>, so in my case results/First Training.
3. `--repo-id`: the name of the Hugging Face repo you want to create or update. Its always <your huggingface username>/<the repo name>
If the repo does not exist **it will be created automatically**
4. `--commit-message`: since HF repos are git repository you need to define a commit message.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/mlagentspushtohub.png" alt="Push to Hub"/>
For instance:
`!mlagents-push-to-hf --run-id="SnowballTarget1" --local-dir="./results/SnowballTarget1" --repo-id="ThomasSimonini/ppo-SnowballTarget" --commit-message="First Push"`
```python
!mlagents-push-to-hf --run-id= # Add your run id --local-dir= # Your local dir --repo-id= # Your repo id --commit-message= # Your commit message
```
Else, if everything worked you should have this at the end of the process(but with a different url 😆) :
```
Your model is pushed to the hub. You can view your model here: https://huggingface.co/ThomasSimonini/ppo-SnowballTarget
```
It's the link to your model. It contains a model card that explains how to use it, your Tensorboard, and your config file. **What's awesome is that it's a git repository, which means you can have different commits, update your repository with a new push, etc.**
But now comes the best: **being able to visualize your agent online 👀.**
### Watch your agent playing 👀
This step it's simple:
1. Remember your repo-id
2. Go here: https://singularite.itch.io/snowballtarget
3. Launch the game and put it in full screen by clicking on the bottom right button
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballtarget_load.png" alt="Snowballtarget load"/>
1. In step 1, choose your model repository, which is the model id (in my case ThomasSimonini/ppo-SnowballTarget).
2. In step 2, **choose what model you want to replay**:
- I have multiple ones since we saved a model every 500000 timesteps.
- But if I want the more recent I choose `SnowballTarget.onnx`
👉 What's nice **is to try different models steps to see the improvement of the agent.**
And don't hesitate to share the best score your agent gets on discord in #rl-i-made-this channel 🔥
Let's now try a more challenging environment called Pyramids.
## Pyramids 🏆
### Download and move the environment zip file in `./training-envs-executables/linux/`
- Our environment executable is in a zip file.
- We need to download it and place it to `./training-envs-executables/linux/`
- We use a linux executable because we use colab, and colab machines OS is Ubuntu (linux)
Download the file Pyramids.zip from https://drive.google.com/uc?export=download&id=1UiFNdKlsH0NTu32xV-giYUEVKV4-vc7H using `wget`. Check out the full solution to download large files from GDrive [here](https://bcrf.biochem.wisc.edu/2021/02/05/download-google-drive-files-using-wget/)
```python
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1UiFNdKlsH0NTu32xV-giYUEVKV4-vc7H' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1UiFNdKlsH0NTu32xV-giYUEVKV4-vc7H" -O ./training-envs-executables/linux/Pyramids.zip && rm -rf /tmp/cookies.txt
```
Unzip it
```python
%%capture
!unzip -d ./training-envs-executables/linux/ ./training-envs-executables/linux/Pyramids.zip
```
Make sure your file is accessible
```python
!chmod -R 755 ./training-envs-executables/linux/Pyramids/Pyramids
```
### Modify the PyramidsRND config file
- Contrary to the first environment, which was a custom one, **Pyramids was made by the Unity team**.
- So the PyramidsRND config file already exists and is in ./content/ml-agents/config/ppo/PyramidsRND.yaml
- You might ask why "RND" is in PyramidsRND. RND stands for *random network distillation* it's a way to generate curiosity rewards. If you want to know more about that, we wrote an article explaining this technique: https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-random-network-distillation-488ffd8e5938
For this training, well modify one thing:
- The total training steps hyperparameter is too high since we can hit the benchmark (mean reward = 1.75) in only 1M training steps.
👉 To do that, we go to config/ppo/PyramidsRND.yaml,**and modify these to max_steps to 1000000.**
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-config.png" alt="Pyramids config"/>
As an experiment, you should also try to modify some other hyperparameters. Unity provides very [good documentation explaining each of them here](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Training-Configuration-File.md).
Were now ready to train our agent 🔥.
### Train the agent
The training will take 30 to 45min depending on your machine, go take a ☕you deserve it 🤗.
```python
!mlagents-learn ./config/ppo/PyramidsRND.yaml --env=./training-envs-executables/linux/Pyramids/Pyramids --run-id="Pyramids Training" --no-graphics
```
### Push the agent to the Hugging Face Hub
- Now that we trained our agent, were **ready to push it to the Hub to be able to visualize it playing on your browser🔥.**
```bash
!mlagents-push-to-hf --run-id= # Add your run id --local-dir= # Your local dir --repo-id= # Your repo id --commit-message= # Your commit message
```
### Watch your agent playing 👀
The temporary link for the Pyramids demo is: https://singularite.itch.io/pyramids
### 🎁 Bonus: Why not train on another environment?
Now that you know how to train an agent using MLAgents, **why not try another environment?**
MLAgents provides 18 different and were building some custom ones. The best way to learn is to try things of your own, have fun.
![cover](https://miro.medium.com/max/1400/0*xERdThTRRM2k_U9f.png)
You have the full list of the one currently available on Hugging Face here 👉 https://github.com/huggingface/ml-agents#the-environments
For the demos to visualize your agent, the temporary link is: https://singularite.itch.io (temporary because we'll also put the demos on Hugging Face Space)
For now we have integrated:
- [Worm](https://singularite.itch.io/worm) demo where you teach a **worm to crawl**.
- [Walker](https://singularite.itch.io/walker) demo where you teach an agent **to walk towards a goal**.
If you want new demos to be added, please open an issue: https://github.com/huggingface/deep-rl-class 🤗
Thats all for today. Congrats on finishing this tutorial!
The best way to learn is to practice and try stuff. Why not try another environment? ML-Agents has 18 different environments, but you can also create your own? Check the documentation and have fun!
See you on Unit 6 🔥,
## Keep Learning, Stay awesome 🤗

View File

@@ -0,0 +1,68 @@
# How do Unity ML-Agents work? [[how-mlagents-works]]
Before training our agent, we need to understand **what ML-Agents is and how it works**.
## What is Unity ML-Agents? [[what-is-mlagents]]
[Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents) is a toolkit for the game engine Unity that **allows us to create environments using Unity or use pre-made environments to train our agents**.
Its developed by [Unity Technologies](https://unity.com/), the developers of Unity, one of the most famous Game Engines used by the creators of Firewatch, Cuphead, and Cities: Skylines.
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/firewatch.jpeg" alt="Firewatch"/>
<figcaption>Firewatch was made with Unity</figcaption>
</figure>
## The six components [[six-components]]
With Unity ML-Agents, you have six essential components:
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/mlagents-1.png" alt="MLAgents"/>
<figcaption>Source: <a href="https://unity-technologies.github.io/ml-agents/">Unity ML-Agents Documentation</a> </figcaption>
</figure>
- The first is the *Learning Environment*, which contains **the Unity scene (the environment) and the environment elements** (game characters).
- The second is the *Python Low-level API*, which contains **the low-level Python interface for interacting and manipulating the environment**. Its the API we use to launch the training.
- Then, we have the *External Communicator* that **connects the Learning Environment (made with C#) with the low level Python API (Python)**.
- The *Python trainers*: the **Reinforcement algorithms made with PyTorch (PPO, SAC…)**.
- The *Gym wrapper*: to encapsulate RL environment in a gym wrapper.
- The *PettingZoo wrapper*: PettingZoo is the multi-agents of gym wrapper.
## Inside the Learning Component [[inside-learning-component]]
Inside the Learning Component, we have **three important elements**:
- The first is the *agent component*, the actor of the scene. Well **train the agent by optimizing its policy** (which will tell us what action to take in each state). The policy is called *Brain*.
- Finally, there is the *Academy*. This component **orchestrates agents and their decision-making processes**. Think of this Academy as a teacher who handles Python API requests.
To better understand its role, lets remember the RL process. This can be modeled as a loop that works like this:
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process.jpg" alt="The RL process" width="100%">
<figcaption>The RL Process: a loop of state, action, reward and next state</figcaption>
<figcaption>Source: <a href="http://incompleteideas.net/book/RLbook2020.pdf">Reinforcement Learning: An Introduction, Richard Sutton and Andrew G. Barto</a></figcaption>
</figure>
Now, lets imagine an agent learning to play a platform game. The RL process looks like this:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process_game.jpg" alt="The RL process" width="100%">
- Our Agent receives **state \\(S_0\\)** from the **Environment** — we receive the first frame of our game (Environment).
- Based on that **state \\(S_0\\),** the Agent takes **action \\(A_0\\)** — our Agent will move to the right.
- Environment goes to a **new** **state \\(S_1\\)** — new frame.
- The environment gives some **reward \\(R_1\\)** to the Agent — were not dead *(Positive Reward +1)*.
This RL loop outputs a sequence of **state, action, reward and next state.** The goal of the agent is to **maximize the expected cumulative reward**.
The Academy will be the one that will **send the order to our Agents and ensure that agents are in sync**:
- Collect Observations
- Select your action using your policy
- Take the Action
- Reset if you reached the max step or if youre done.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/academy.png" alt="The MLAgents Academy" width="100%">
Now that we understand how ML-Agents works, **were ready to train our agents.**

View File

@@ -0,0 +1,31 @@
# An Introduction to Unity ML-Agents [[introduction-to-ml-agents]]
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/thumbnail.png" alt="thumbnail"/>
One of the challenges in Reinforcement Learning is **creating environments**. Fortunately for us, we can use game engines to achieve so.
These engines, such as [Unity](https://unity.com/), [Godot](https://godotengine.org/) or [Unreal Engine](https://www.unrealengine.com/), are programs made to create video games. They are perfectly suited
for creating environments: they provide physics systems, 2D/3D rendering, and more.
One of them, [Unity](https://unity.com/), created the [Unity ML-Agents Toolkit](https://github.com/Unity-Technologies/ml-agents), a plugin based on the game engine Unity that allows us **to use the Unity Game Engine as an environment builder to train agents**. In the first bonus unit, this is what we used to train Huggy to catch a stick!
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/example-envs.png" alt="MLAgents environments"/>
<figcaption>Source: <a href="https://github.com/Unity-Technologies/ml-agents">ML-Agents documentation</a></figcaption>
</figure>
Unity ML-Agents Toolkit provides many exceptional pre-made environments, from playing football (soccer), learning to walk, and jumping big walls.
In this Unit, we'll learn to use ML-Agents, but **don't worry if you don't know how to use the Unity Game Engine**: you don't need to use it to train your agents.
So, today, we're going to train two agents:
- The first one will learn to **shoot snowballs onto spawning target**.
- The second needs to **press a button to spawn a pyramid, then navigate to the pyramid, knock it over, and move to the gold brick at the top**. To do that, it will need to explore its environment, which will be achieved using a technique called curiosity.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/envs.png" alt="Environments" />
Then, after training, **you'll push the trained agents to the Hugging Face Hub**, and you'll be able to **visualize it playing directly on your browser without having to use the Unity Editor**.
Doing this Unit will **prepare you for the next challenge: AI vs. AI where you will train agents in multi-agents environments and compete against your classmates' agents**.
Sounds exciting? Let's get started!

View File

@@ -0,0 +1,39 @@
# The Pyramid environment
The goal in this environment is to train our agent to **get the gold brick on the top of the Pyramid. To do that, it needs to press a button to spawn a Pyramid, navigate to the Pyramid, knock it over, and move to the gold brick at the top**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids.png" alt="Pyramids Environment"/>
## The reward function
The reward function is:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-reward.png" alt="Pyramids Environment"/>
In terms of code, it looks like this
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-reward-code.png" alt="Pyramids Reward"/>
To train this new agent that seeks that button and then the Pyramid to destroy, well use a combination of two types of rewards:
- The *extrinsic one* given by the environment (illustration above).
- But also an *intrinsic* one called **curiosity**. This second will **push our agent to be curious, or in other terms, to better explore its environment**.
If you want to know more about curiosity, the next section (optional) will explain the basics.
## The observation space
In terms of observation, we **use 148 raycasts that can each detect objects** (switch, bricks, golden brick, and walls.)
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids_raycasts.png"/>
We also use a **boolean variable indicating the switch state** (did we turn on or off the switch to spawn the Pyramid) and a vector that **contains the agents speed**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-obs-code.png" alt="Pyramids obs code"/>
## The action space
The action space is **discrete** with four possible actions:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-action.png" alt="Pyramids Environment"/>

View File

@@ -0,0 +1,57 @@
# The SnowballTarget Environment
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballtarget.gif" alt="SnowballTarget"/>
SnowballTarget is an environment we created at Hugging Face using assets from [Kay Lousberg](https://kaylousberg.com/). We have an optional section at the end of this Unit **if you want to learn to use Unity and create your environments**.
## The agent's Goal
The first agent you're going to train is called Julien the bear 🐻. Julien is trained **to hit targets with snowballs**.
The Goal in this environment is that Julien **hits as many targets as possible in the limited time** (1000 timesteps). It will need **to place itself correctly from the target and shoot**to do that.
In addition, to avoid "snowball spamming" (aka shooting a snowball every timestep), **Julien has a "cool off" system** (it needs to wait 0.5 seconds after a shoot to be able to shoot again).
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/cooloffsystem.gif" alt="Cool Off System"/>
<figcaption>The agent needs to wait 0.5s before being able to shoot a snowball again</figcaption>
</figure>
## The reward function and the reward engineering problem
The reward function is simple. **The environment gives a +1 reward every time the agent's snowball hits a target**. Because the agent's Goal is to maximize the expected cumulative reward, **it will try to hit as many targets as possible**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballtarget_reward.png" alt="Reward system"/>
We could have a more complex reward function (with a penalty to push the agent to go faster, for example). But when you design an environment, you need to avoid the *reward engineering problem*, which is having a too complex reward function to force your agent to behave as you want it to do.
Why? Because by doing that, **you might miss interesting strategies that the agent will find with a simpler reward function**.
In terms of code, it looks like this:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballtarget-reward-code.png" alt="Reward"/>
## The observation space
Regarding observations, we don't use normal vision (frame), but **we use raycasts**.
Think of raycasts as lasers that will detect if they pass through an object.
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/raycasts.png" alt="Raycasts"/>
<figcaption>Source: <a href="https://github.com/Unity-Technologies/ml-agents">ML-Agents documentation</a></figcaption>
</figure>
In this environment, our agent has multiple set of raycasts:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowball_target_raycasts.png" alt="Raycasts"/>
In addition to raycasts, the agent gets a "can I shoot" bool as observation.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballtarget-obs-code.png" alt="Obs"/>
## The action space
The action space is discrete:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballtarget_action_space.png" alt="Action Space"/>

View File

@@ -0,0 +1,17 @@
# Additional Readings [[additional-readings]]
## Bias-variance tradeoff in Reinforcement Learning
If you want to dive deeper into the question of variance and bias tradeoff in Deep Reinforcement Learning, you can check these two articles:
- [Making Sense of the Bias / Variance Trade-off in (Deep) Reinforcement Learning](https://blog.mlreview.com/making-sense-of-the-bias-variance-trade-off-in-deep-reinforcement-learning-79cf1e83d565)
- [Bias-variance Tradeoff in Reinforcement Learning](https://www.endtoend.ai/blog/bias-variance-tradeoff-in-reinforcement-learning/)
## Advantage Functions
- [Advantage Functions, SpinningUp RL](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html?highlight=advantage%20functio#advantage-functions)
## Actor Critic
- [Foundations of Deep RL Series, L3 Policy Gradients and Advantage Estimation by Pieter Abbeel](https://www.youtube.com/watch?v=AKbX1Zvo7r8)
- [A2C Paper: Asynchronous Methods for Deep Reinforcement Learning](https://arxiv.org/abs/1602.01783v2)

View File

@@ -0,0 +1,70 @@
# Advantage Actor-Critic (A2C) [[advantage-actor-critic]]
## Reducing variance with Actor-Critic methods
The solution to reducing the variance of the Reinforce algorithm and training our agent faster and better is to use a combination of Policy-Based and Value-Based methods: *the Actor-Critic method*.
To understand the Actor-Critic, imagine you play a video game. You can play with a friend that will provide you with some feedback. You're the Actor and your friend is the Critic.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/ac.jpg" alt="Actor Critic"/>
You don't know how to play at the beginning, **so you try some actions randomly**. The Critic observes your action and **provides feedback**.
Learning from this feedback, **you'll update your policy and be better at playing that game.**
On the other hand, your friend (Critic) will also update their way to provide feedback so it can be better next time.
This is the idea behind Actor-Critic. We learn two function approximations:
- *A policy* that **controls how our agent acts**: \\( \pi_{\theta}(s,a) \\)
- *A value function* to assist the policy update by measuring how good the action taken is: \\( \hat{q}_{w}(s,a) \\)
## The Actor-Critic Process
Now that we have seen the Actor Critic's big picture, let's dive deeper to understand how Actor and Critic improve together during the training.
As we saw, with Actor-Critic methods, there are two function approximations (two neural networks):
- *Actor*, a **policy function** parameterized by theta: \\( \pi_{\theta}(s,a) \\)
- *Critic*, a **value function** parameterized by w: \\( \hat{q}_{w}(s,a) \\)
Let's see the training process to understand how Actor and Critic are optimized:
- At each timestep, t, we get the current state \\( S_t\\) from the environment and **pass it as input through our Actor and Critic**.
- Our Policy takes the state and **outputs an action** \\( A_t \\).
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/step1.jpg" alt="Step 1 Actor Critic"/>
- The Critic takes that action also as input and, using \\( S_t\\) and \\( A_t \\), **computes the value of taking that action at that state: the Q-value**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/step2.jpg" alt="Step 2 Actor Critic"/>
- The action \\( A_t\\) performed in the environment outputs a new state \\( S_{t+1}\\) and a reward \\( R_{t+1} \\) .
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/step3.jpg" alt="Step 3 Actor Critic"/>
- The Actor updates its policy parameters using the Q value.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/step4.jpg" alt="Step 4 Actor Critic"/>
- Thanks to its updated parameters, the Actor produces the next action to take at \\( A_{t+1} \\) given the new state \\( S_{t+1} \\).
- The Critic then updates its value parameters.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/step5.jpg" alt="Step 5 Actor Critic"/>
## Adding Advantage in Actor-Critic (A2C)
We can stabilize learning further by **using the Advantage function as Critic instead of the Action value function**.
The idea is that the Advantage function calculates the relative advantage of an action compared to the others possible at a state: **how taking that action at a state is better compared to the average value of the state**. It's subtracting the mean value of the state from the state action pair:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/advantage1.jpg" alt="Advantage Function"/>
In other words, this function calculates **the extra reward we get if we take this action at that state compared to the mean reward we get at that state**.
The extra reward is what's beyond the expected value of that state.
- If A(s,a) > 0: our gradient is **pushed in that direction**.
- If A(s,a) < 0 (our action does worse than the average value of that state), **our gradient is pushed in the opposite direction**.
The problem with implementing this advantage function is that it requires two value functions — \\( Q(s,a)\\) and \\( V(s)\\). Fortunately, **we can use the TD error as a good estimator of the advantage function.**
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/advantage2.jpg" alt="Advantage Function"/>

View File

@@ -0,0 +1,11 @@
# Conclusion [[conclusion]]
Congrats on finishing this unit and the tutorial. You've just trained your first virtual robots 🥳.
**Take time to grasp the material before continuing**. You can also look at the additional reading materials we provided in the *additional reading* section.
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
See you in next unit,
### Keep learning, stay awesome 🤗,

464
units/en/unit6/hands-on.mdx Normal file
View File

@@ -0,0 +1,464 @@
# Advantage Actor Critic (A2C) using Robotics Simulations with PyBullet and Panda-Gym 🤖 [[hands-on]]
<CourseFloatingBanner classNames="absolute z-10 right-0 top-0"
notebooks={[
{label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/unit6/unit6.ipynb"}
]}
askForHelpUrl="http://hf.co/join/discord" />
Now that you've studied the theory behind Advantage Actor Critic (A2C), **you're ready to train your A2C agent** using Stable-Baselines3 in robotic environments. And train two robots:
- A spider 🕷️ to learn to move.
- A robotic arm 🦾 to move in the correct position.
We're going to use two Robotics environments:
- [PyBullet](https://github.com/bulletphysics/bullet3)
- [panda-gym](https://github.com/qgallouedec/panda-gym)
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/environments.gif" alt="Environments"/>
To validate this hands-on for the certification process, you need to push your two trained models to the Hub and get the following results:
- `AntBulletEnv-v0` get a result of >= 650.
- `PandaReachDense-v2` get a result of >= -3.5.
To find your result, [go to the leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**
For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
**To start the hands-on click on Open In Colab button** 👇 :
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit6/unit6.ipynb)
# Unit 6: Advantage Actor Critic (A2C) using Robotics Simulations with PyBullet and Panda-Gym 🤖
### 🎮 Environments:
- [PyBullet](https://github.com/bulletphysics/bullet3)
- [Panda-Gym](https://github.com/qgallouedec/panda-gym)
### 📚 RL-Library:
- [Stable-Baselines3](https://stable-baselines3.readthedocs.io/)
We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues).
## Objectives of this notebook 🏆
At the end of the notebook, you will:
- Be able to use **PyBullet** and **Panda-Gym**, the environment libraries.
- Be able to **train robots using A2C**.
- Understand why **we need to normalize the input**.
- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.
## Prerequisites 🏗️
Before diving into the notebook, you need to:
🔲 📚 Study [Actor-Critic methods by reading Unit 6](https://huggingface.co/deep-rl-course/unit6/introduction) 🤗
# Let's train our first robots 🤖
## Set the GPU 💪
- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg" alt="GPU Step 1">
- `Hardware Accelerator > GPU`
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg" alt="GPU Step 2">
## Create a virtual display 🔽
During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).
Hence the following cell will install the librairies and create and run a virtual screen 🖥
```python
%%capture
!apt install python-opengl
!apt install ffmpeg
!apt install xvfb
!pip3 install pyvirtualdisplay
```
```python
# Virtual display
from pyvirtualdisplay import Display
virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()
```
### Install dependencies 🔽
The first step is to install the dependencies, well install multiple ones:
- `pybullet`: Contains the walking robots environments.
- `panda-gym`: Contains the robotics arm environments.
- `stable-baselines3[extra]`: The SB3 deep reinforcement learning library.
- `huggingface_sb3`: Additional code for Stable-baselines3 to load and upload models from the Hugging Face 🤗 Hub.
- `huggingface_hub`: Library allowing anyone to work with the Hub repositories.
```bash
!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit6/requirements-unit6.txt
```
## Import the packages 📦
```python
import pybullet_envs
import panda_gym
import gym
import os
from huggingface_sb3 import load_from_hub, package_to_hub
from stable_baselines3 import A2C
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
from stable_baselines3.common.env_util import make_vec_env
from huggingface_hub import notebook_login
```
## Environment 1: AntBulletEnv-v0 🕸
### Create the AntBulletEnv-v0
#### The environment 🎮
In this environment, the agent needs to use correctly its different joints to walk correctly.
You can find a detailled explanation of this environment here: https://hackmd.io/@jeffreymo/SJJrSJh5_#PyBullet
```python
env_id = "AntBulletEnv-v0"
# Create the env
env = gym.make(env_id)
# Get the state space and action space
s_size = env.observation_space.shape[0]
a_size = env.action_space
```
```python
print("_____OBSERVATION SPACE_____ \n")
print("The State Space is: ", s_size)
print("Sample observation", env.observation_space.sample()) # Get a random observation
```
The observation Space (from [Jeffrey Y Mo](https://hackmd.io/@jeffreymo/SJJrSJh5_#PyBullet)):
The difference is that our observation space is 28 not 29.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/obs_space.png" alt="PyBullet Ant Obs space"/>
```python
print("\n _____ACTION SPACE_____ \n")
print("The Action Space is: ", a_size)
print("Action Space Sample", env.action_space.sample()) # Take a random action
```
The action Space (from [Jeffrey Y Mo](https://hackmd.io/@jeffreymo/SJJrSJh5_#PyBullet)):
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/action_space.png" alt="PyBullet Ant Obs space"/>
### Normalize observation and rewards
A good practice in reinforcement learning is to [normalize input features](https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html).
For that purpose, there is a wrapper that will compute a running average and standard deviation of input features.
We also normalize rewards with this same wrapper by adding `norm_reward = True`
[You should check the documentation to fill this cell](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize)
```python
env = make_vec_env(env_id, n_envs=4)
# Adding this wrapper to normalize the observation and the reward
env = # TODO: Add the wrapper
```
#### Solution
```python
env = make_vec_env(env_id, n_envs=4)
env = VecNormalize(env, norm_obs=True, norm_reward=False, clip_obs=10.0)
```
### Create the A2C Model 🤖
In this case, because we have a vector of 28 values as input, we'll use an MLP (multi-layer perceptron) as policy.
For more information about A2C implementation with StableBaselines3 check: https://stable-baselines3.readthedocs.io/en/master/modules/a2c.html#notes
To find the best parameters I checked the [official trained agents by Stable-Baselines3 team](https://huggingface.co/sb3).
```python
model = # Create the A2C model and try to find the best parameters
```
#### Solution
```python
model = A2C(
policy="MlpPolicy",
env=env,
gae_lambda=0.9,
gamma=0.99,
learning_rate=0.00096,
max_grad_norm=0.5,
n_steps=8,
vf_coef=0.4,
ent_coef=0.0,
policy_kwargs=dict(log_std_init=-2, ortho_init=False),
normalize_advantage=False,
use_rms_prop=True,
use_sde=True,
verbose=1,
)
```
### Train the A2C agent 🏃
- Let's train our agent for 2,000,000 timesteps, don't forget to use GPU on Colab. It will take approximately ~25-40min
```python
model.learn(2_000_000)
```
```python
# Save the model and VecNormalize statistics when saving the agent
model.save("a2c-AntBulletEnv-v0")
env.save("vec_normalize.pkl")
```
### Evaluate the agent 📈
- Now that's our agent is trained, we need to **check its performance**.
- Stable-Baselines3 provides a method to do that: `evaluate_policy`
- In my case, I got a mean reward of `2371.90 +/- 16.50`
```python
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
# Load the saved statistics
eval_env = DummyVecEnv([lambda: gym.make("AntBulletEnv-v0")])
eval_env = VecNormalize.load("vec_normalize.pkl", eval_env)
# do not update them at test time
eval_env.training = False
# reward normalization is not needed at test time
eval_env.norm_reward = False
# Load the agent
model = A2C.load("a2c-AntBulletEnv-v0")
mean_reward, std_reward = evaluate_policy(model, env)
print(f"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}")
```
### Publish your trained model on the Hub 🔥
Now that we saw we got good results after the training, we can publish our trained model on the Hub with one line of code.
📚 The libraries documentation 👉 https://github.com/huggingface/huggingface_sb3/tree/main#hugging-face--x-stable-baselines3-v20
Here's an example of a Model Card (with a PyBullet environment):
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/modelcardpybullet.png" alt="Model Card Pybullet"/>
By using `package_to_hub`, as we already mentionned in the former units, **you evaluate, record a replay, generate a model card of your agent and push it to the hub**.
This way:
- You can **showcase our work** 🔥
- You can **visualize your agent playing** 👀
- You can **share with the community an agent that others can use** 💾
- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
To be able to share your model with the community there are three more steps to follow:
1⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join
2⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.
- Create a new token (https://huggingface.co/settings/tokens) **with write role**
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
- Copy the token
- Run the cell below and paste the token
```python
notebook_login()
!git config --global credential.helper store
```
If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
3⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥 using `package_to_hub()` function
```python
package_to_hub(
model=model,
model_name=f"a2c-{env_id}",
model_architecture="A2C",
env_id=env_id,
eval_env=eval_env,
repo_id=f"ThomasSimonini/a2c-{env_id}", # Change the username
commit_message="Initial commit",
)
```
## Take a coffee break ☕
- You already trained your first robot that learned to move congratutlations 🥳!
- It's **time to take a break**. Don't hesitate to **save this notebook** `File > Save a copy to Drive` to work on this second part later.
## Environment 2: PandaReachDense-v2 🦾
The agent we're going to train is a robotic arm that needs to do controls (moving the arm and using the end-effector).
In robotics, the *end-effector* is the device at the end of a robotic arm designed to interact with the environment.
In `PandaReach`, the robot must place its end-effector at a target position (green ball).
We're going to use the dense version of this environment. It means we'll get a *dense reward function* that **will provide a reward at each timestep** (the closer the agent is to completing the task, the higher the reward). Contrary to a *sparse reward function* where the environment **return a reward if and only if the task is completed**.
Also, we're going to use the *End-effector displacement control*, it means the **action corresponds to the displacement of the end-effector**. We don't control the individual motion of each joint (joint control).
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/robotics.jpg" alt="Robotics"/>
This way **the training will be easier**.
In `PandaReachDense-v2`, the robotic arm must place its end-effector at a target position (green ball).
```python
import gym
env_id = "PandaPushDense-v2"
# Create the env
env = gym.make(env_id)
# Get the state space and action space
s_size = env.observation_space.shape
a_size = env.action_space
```
```python
print("_____OBSERVATION SPACE_____ \n")
print("The State Space is: ", s_size)
print("Sample observation", env.observation_space.sample()) # Get a random observation
```
The observation space **is a dictionary with 3 different elements**:
- `achieved_goal`: (x,y,z) position of the goal.
- `desired_goal`: (x,y,z) distance between the goal position and the current object position.
- `observation`: position (x,y,z) and velocity of the end-effector (vx, vy, vz).
Given it's a dictionary as observation, **we will need to use a MultiInputPolicy policy instead of MlpPolicy**.
```python
print("\n _____ACTION SPACE_____ \n")
print("The Action Space is: ", a_size)
print("Action Space Sample", env.action_space.sample()) # Take a random action
```
The action space is a vector with 3 values:
- Control x, y, z movement
Now it's your turn:
1. Define the environment called "PandaReachDense-v2"
2. Make a vectorized environment
3. Add a wrapper to normalize the observations and rewards. [Check the documentation](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize)
4. Create the A2C Model (don't forget verbose=1 to print the training logs).
5. Train it for 1M Timesteps
6. Save the model and VecNormalize statistics when saving the agent
7. Evaluate your agent
8. Publish your trained model on the Hub 🔥 with `package_to_hub`
### Solution (fill the todo)
```python
# 1 - 2
env_id = "PandaReachDense-v2"
env = make_vec_env(env_id, n_envs=4)
# 3
env = VecNormalize(env, norm_obs=True, norm_reward=False, clip_obs=10.0)
# 4
model = A2C(policy="MultiInputPolicy", env=env, verbose=1)
# 5
model.learn(1_000_000)
```
```python
# 6
model_name = "a2c-PandaReachDense-v2"
model.save(model_name)
env.save("vec_normalize.pkl")
# 7
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
# Load the saved statistics
eval_env = DummyVecEnv([lambda: gym.make("PandaReachDense-v2")])
eval_env = VecNormalize.load("vec_normalize.pkl", eval_env)
# do not update them at test time
eval_env.training = False
# reward normalization is not needed at test time
eval_env.norm_reward = False
# Load the agent
model = A2C.load(model_name)
mean_reward, std_reward = evaluate_policy(model, env)
print(f"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}")
# 8
package_to_hub(
model=model,
model_name=f"a2c-{env_id}",
model_architecture="A2C",
env_id=env_id,
eval_env=eval_env,
repo_id=f"ThomasSimonini/a2c-{env_id}", # TODO: Change the username
commit_message="Initial commit",
)
```
## Some additional challenges 🏆
The best way to learn **is to try things by your own**! Why not trying `HalfCheetahBulletEnv-v0` for PyBullet and `PandaPickAndPlace-v1` for Panda-Gym?
If you want to try more advanced tasks for panda-gym, you need to check what was done using **TQC or SAC** (a more sample-efficient algorithm suited for robotics tasks). In real robotics, you'll use a more sample-efficient algorithm for a simple reason: contrary to a simulation **if you move your robotic arm too much, you have a risk of breaking it**.
PandaPickAndPlace-v1: https://huggingface.co/sb3/tqc-PandaPickAndPlace-v1
And don't hesitate to check panda-gym documentation here: https://panda-gym.readthedocs.io/en/latest/usage/train_with_sb3.html
Here are some ideas to achieve so:
* Train more steps
* Try different hyperparameters by looking at what your classmates have done 👉 https://huggingface.co/models?other=https://huggingface.co/models?other=AntBulletEnv-v0
* **Push your new trained model** on the Hub 🔥
See you on Unit 7! 🔥
## Keep learning, stay awesome 🤗

View File

@@ -0,0 +1,25 @@
# Introduction [[introduction]]
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/thumbnail.png" alt="Thumbnail"/>
In unit 4, we learned about our first Policy-Based algorithm called **Reinforce**.
In Policy-Based methods, **we aim to optimize the policy directly without using a value function**. More precisely, Reinforce is part of a subclass of *Policy-Based Methods* called *Policy-Gradient methods*. This subclass optimizes the policy directly by **estimating the weights of the optimal policy using Gradient Ascent**.
We saw that Reinforce worked well. However, because we use Monte-Carlo sampling to estimate return (we use an entire episode to calculate the return), **we have significant variance in policy gradient estimation**.
Remember that the policy gradient estimation is **the direction of the steepest increase in return**. In other words, how to update our policy weights so that actions that lead to good returns have a higher probability of being taken. The Monte Carlo variance, which we will further study in this unit, **leads to slower training since we need a lot of samples to mitigate it**.
So, today we'll study **Actor-Critic methods**, a hybrid architecture combining value-based and Policy-Based methods that help to stabilize the training by reducing the variance:
- *An Actor* that controls **how our agent behaves** (Policy-Based method)
- *A Critic* that measures **how good the taken action is** (Value-Based method)
We'll study one of these hybrid methods, Advantage Actor Critic (A2C), **and train our agent using Stable-Baselines3 in robotic environments**. We'll train two robots:
- A spider 🕷️ to learn to move.
- A robotic arm 🦾 to move in the correct position.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/environments.gif" alt="Environments"/>
Sounds exciting? Let's get started!

View File

@@ -0,0 +1,30 @@
# The Problem of Variance in Reinforce [[the-problem-of-variance-in-reinforce]]
In Reinforce, we want to **increase the probability of actions in a trajectory proportional to how high the return is**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/pg.jpg" alt="Reinforce"/>
- If the **return is high**, we will **push up** the probabilities of the (state, action) combinations.
- Else, if the **return is low**, it will **push down** the probabilities of the (state, action) combinations.
This return \\(R(\tau)\\) is calculated using a *Monte-Carlo sampling*. We collect a trajectory and calculate the discounted return, **and use this score to increase or decrease the probability of every action taken in that trajectory**. If the return is good, all actions will be “reinforced” by increasing their likelihood of being taken.
\\(R(\tau) = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ...\\)
The advantage of this method is that **its unbiased. Since were not estimating the return**, we use only the true return we obtain.
Given the stochasticity of the environment (random events during an episode) and stochasticity of the policy, **trajectories can lead to different returns, which can lead to high variance**. Consequently, the same starting state can lead to very different returns.
Because of this, **the return starting at the same state can vary significantly across episodes**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/variance.jpg" alt="variance"/>
The solution is to mitigate the variance by **using a large number of trajectories, hoping that the variance introduced in any one trajectory will be reduced in aggregate and provide a "true" estimation of the return.**
However, increasing the batch size significantly **reduces sample efficiency**. So we need to find additional mechanisms to reduce the variance.
---
If you want to dive deeper into the question of variance and bias tradeoff in Deep Reinforcement Learning, you can check these two articles:
- [Making Sense of the Bias / Variance Trade-off in (Deep) Reinforcement Learning](https://blog.mlreview.com/making-sense-of-the-bias-variance-trade-off-in-deep-reinforcement-learning-79cf1e83d565)
- [Bias-variance Tradeoff in Reinforcement Learning](https://www.endtoend.ai/blog/bias-variance-tradeoff-in-reinforcement-learning/)
---

View File

@@ -0,0 +1,17 @@
# Additional Readings [[additional-readings]]
## An introduction to multi-agents
- [Multi-agent reinforcement learning: An overview](https://www.dcsc.tudelft.nl/~bdeschutter/pub/rep/10_003.pdf)
- [Multiagent Reinforcement Learning, Marc Lanctot](https://rlss.inria.fr/files/2019/07/RLSS_Multiagent.pdf)
- [Example of a multi-agent environment](https://www.mathworks.com/help/reinforcement-learning/ug/train-3-agents-for-area-coverage.html?s_eid=PSM_15028)
- [A list of different multi-agent environments](https://agents.inf.ed.ac.uk/blog/multiagent-learning-environments/)
- [Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents](https://bit.ly/3nVK7My)
- [Dealing with Non-Stationarity in Multi-Agent Deep Reinforcement Learning](https://bit.ly/3v7LxaT)
## Self-Play and MA-POCA
- [Self Play Theory and with MLAgents](https://blog.unity.com/technology/training-intelligent-adversaries-using-self-play-with-ml-agents)
- [Training complex behavior with MLAgents](https://blog.unity.com/technology/ml-agents-v20-release-now-supports-training-complex-cooperative-behaviors)
- [MLAgents plays dodgeball](https://blog.unity.com/technology/ml-agents-plays-dodgeball)
- [On the Use and Misuse of Absorbing States in Multi-agent Reinforcement Learning (MA-POCA)](https://arxiv.org/pdf/2111.05992.pdf)

View File

@@ -0,0 +1,11 @@
# Conclusion
Thats all for today. Congrats on finishing this unit and the tutorial!
The best way to learn is to practice and try stuff. **Why not training another agent with a different configuration?**
And dont hesitate from time to time to check the [leaderboard](https://huggingface.co/spaces/huggingface-projects/AIvsAI-SoccerTwos)
See you on Unit 8 🔥,
## Keep Learning, Stay awesome 🤗

322
units/en/unit7/hands-on.mdx Normal file
View File

@@ -0,0 +1,322 @@
# Hands-on
Now that you learned the bases of multi-agents. You're ready to train our first agents in a multi-agent system: **a 2vs2 soccer team that needs to beat the opponent team**.
And youre going to participate in AI vs. AI challenges where your trained agent will compete against other classmates **agents every day and be ranked on a new leaderboard.**
To validate this hands-on for the certification process, you just need to push a trained model. There **are no minimal results to attain to validate it.**
For more information about the certification process, check this section 👉 [https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process)
This hands-on will be different since to get correct results **you need to train your agents from 4 hours to 8 hours**. And given the risk of timeout in Colab, we advise you to train on your computer. You dont need a supercomputer: a simple laptop is good enough for this exercise.
Let's get started! 🔥
## What is AI vs. AI?
AI vs. AI is an open-source tool we developed at Hugging Face to compete agents on the Hub against one another in a multi-agent setting. These models are then ranked in a leaderboard.
The idea of this tool is to have a robust evaluation tool: **by evaluating your agent with a lot of others, youll get a good idea of the quality of your policy.**
More precisely, AI vs. AI is three tools:
- A *matchmaking process* defining the matches (which model against which) and running the model fights using a background task in the Space.
- A *leaderboard* getting the match history results and displaying the models ELO ratings: [https://huggingface.co/spaces/huggingface-projects/AIvsAI-SoccerTwos](https://huggingface.co/spaces/huggingface-projects/AIvsAI-SoccerTwos)
- A *Space demo* to visualize your agents playing against others: [https://huggingface.co/spaces/unity/ML-Agents-SoccerTwos](https://huggingface.co/spaces/unity/ML-Agents-SoccerTwos)
We're going to write a blog post to explain this AI vs. AI tool in detail, but to give you the big picture it works this way:
- Every four hours, our algorithm **fetches all the available models for a given environment (in our case ML-Agents-SoccerTwos).**
- It creates a **queue of matches with the matchmaking algorithm.**
- We simulate the match in a Unity headless process and **gather the match result** (1 if the first model won, 0.5 if its a draw, 0 if the second model won) in a Dataset.
- Then, when all matches from the matches queue are done, **we update the ELO score for each model and update the leaderboard.**
### Competition Rules
This first AI vs. AI competition **is an experiment**: the goal is to improve the tool in the future with your feedback. So some **breakups can happen during the challenge**. But don't worry
**all the results are saved in a dataset so we can always restart the calculation correctly without losing information**.
In order that your model to get correctly evaluated against others you need to follow these rules:
1. **You can't change the observation space or action space of the agent.** By doing that your model will not work during evaluation.
2. You **can't use a custom trainer for now,** you need to use Unity MLAgents ones.
3. We provide executables to train your agents. You can also use the Unity Editor if you prefer **, but to avoid bugs, we advise you to use our executables**.
What will make the difference during this challenge are **the hyperparameters you choose**.
The AI vs AI algorithm will run until April the 30th, 2023.
We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues).
### Exchange with your classmates, share advice and ask questions on Discord
- We created a new channel called `ai-vs-ai-challenge` to exchange advice and ask questions.
- If you didnt joined yet the discord server you can [join here](https://discord.gg/ydHrjt3WP5)
## Step 0: Install MLAgents and download the correct executable
⚠ We're going to use an experimental version of ML-Agents which allows you to push and load your models to/from the Hub. **You need to install the same version.**
⚠ ⚠ ⚠ Were not going to use the same version than for the Unit 5: Introduction to ML-Agents ⚠ ⚠ ⚠
We advise you to use [conda](https://docs.conda.io/en/latest/) as a package manager and create a new environment.
With conda, we create a new environment called rl with **Python 3.9**:
```bash
conda create --name rl python=3.9
conda activate rl
```
To be able to train correctly our agents and push to the Hub, we need to install an experimental version of ML-Agents (the branch aivsai from Hugging Face ML-Agents fork)
```bash
git clone --branch aivsai https://github.com/huggingface/ml-agents
```
When the cloning is done (it takes 2.63 GB), we go inside the repository and install the package
```bash
cd ml-agents
pip install -e ./ml-agents-envs
pip install -e ./ml-agents
```
We also need to install pytorch with:
```bash
pip install torch
```
Finally, you need to install git-lfs: https://git-lfs.com/
Now that its installed, we need to add the environment training executable. Based on your operating system you need to download one of them, unzip it and place it in a new folder inside `ml-agents` that you call `training-envs-executables`
At the end your executable should be in `mlagents/training-envs-executables/SoccerTwos`
Windows: Download [this executable](https://drive.google.com/file/d/1sqFxbEdTMubjVktnV4C6ICjp89wLhUcP/view?usp=sharing)
Linux (Ubuntu): Download [this executable](https://drive.google.com/file/d/1KuqBKYiXiIcU4kNMqEzhgypuFP5_45CL/view?usp=sharing)
Mac: Download [this executable](https://drive.google.com/drive/folders/1h7YB0qwjoxxghApQdEUQmk95ZwIDxrPG?usp=share_link)
⚠ For Mac you need also to call this `xattr -cr training-envs-executables/SoccerTwos/SoccerTwos.app` to be able to run SoccerTwos
## Step 1: Understand the environment
The environment is called `SoccerTwos`. The Unity MLAgents Team made it. You can find its documentation [here](https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Learning-Environment-Examples.md#soccer-twos)
The goal in this environment **is to get the ball into the opponent's goal while preventing the ball from entering its own goal.**
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/soccertwos.gif" alt="SoccerTwos"/>
<figcaption>This environment was made by the <a href="https://github.com/Unity-Technologies/ml-agents"> Unity MLAgents Team</a></figcaption>
</figure>
### The reward function
The reward function is:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/soccerreward.png" alt="SoccerTwos Reward"/>
### The observation space
The observation space is composed vector size of 336:
- 11 ray-casts forward distributed over 120 degrees (264 state dimensions)
- 3 ray-casts backward distributed over 90 degrees (72 state dimensions)
- Both of these ray-casts can detect 6 objects:
- Ball
- Blue Goal
- Purple Goal
- Wall
- Blue Agent
- Purple Agent
### The action space
The action space is three discrete branches:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/socceraction.png" alt="SoccerTwos Action"/>
## Step 2: Understand MA-POCA
We know how to train agents to play against others: **we can use self-play.** This is a perfect technique for a 1vs1.
But in our case were 2vs2, and each team has 2 agents. How then we can **train cooperative behavior for groups of agents?**
As explained in the [Unity Blog](https://blog.unity.com/technology/ml-agents-v20-release-now-supports-training-complex-cooperative-behaviors), agents typically receive a reward as a group (+1 - penalty) when the team scores a goal. This implies that **every agent on the team is rewarded even if each agent didnt contribute the same to the win**, which makes it difficult to learn what to do independently.
The Unity MLAgents team developed the solution in a new multi-agent trainer called *MA-POCA (Multi-Agent POsthumous Credit Assignment)*.
The idea is simple but powerful: a centralized critic **processes the states of all agents in the team to estimate how well each agent is doing**. Think of this critic as a coach.
This allows each agent to **make decisions based only on what it perceives locally**, and **simultaneously evaluate how good its behavior is in the context of the whole group**.
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/mapoca.png" alt="MA POCA"/>
<figcaption>This illustrates MA-POCAs centralized learning and decentralized execution. Source: <a href="https://blog.unity.com/technology/ml-agents-plays-dodgeball">MLAgents Plays Dodgeball</a>
</figcaption>
</figure>
The solution then is to use Self-Play with an MA-POCA trainer (called poca). The poca trainer will help us to train cooperative behavior and self-play to get an opponent team.
If you want to dive deeper into this MA-POCA algorithm, you need to read the paper they published [here](https://arxiv.org/pdf/2111.05992.pdf) and the sources we put on the additional readings section.
## Step 3: Define the config file
We already learned in [Unit 5](https://huggingface.co/deep-rl-course/unit5/introduction) that in ML-Agents, you define **the training hyperparameters into `config.yaml` files.**
There are multiple hyperparameters. To know them better, you should check for each explanation with **[the documentation](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/Training-Configuration-File.md)**
The config file were going to use here is in `./config/poca/SoccerTwos.yaml` it looks like this:
```csharp
behaviors:
SoccerTwos:
trainer_type: poca
hyperparameters:
batch_size: 2048
buffer_size: 20480
learning_rate: 0.0003
beta: 0.005
epsilon: 0.2
lambd: 0.95
num_epoch: 3
learning_rate_schedule: constant
network_settings:
normalize: false
hidden_units: 512
num_layers: 2
vis_encode_type: simple
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
keep_checkpoints: 5
max_steps: 5000000
time_horizon: 1000
summary_freq: 10000
self_play:
save_steps: 50000
team_change: 200000
swap_steps: 2000
window: 10
play_against_latest_model_ratio: 0.5
initial_elo: 1200.0
```
Compared to Pyramids or SnowballTarget, we have new hyperparameters with a self-play part. How you modify them can be critical in getting good results.
The advice I can give you here is to check the explanation and recommended value for each parameters (especially self-play ones) with **[the documentation](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/Training-Configuration-File.md).**
Now that youve modified our config file, youre ready to train your agents.
## Step 4: Start the training
To train the agents, we need to **launch mlagents-learn and select the executable containing the environment.**
We define four parameters:
1. `mlagents-learn <config>`: the path where the hyperparameter config file is.
2. `-env`: where the environment executable is.
3. `-run_id`: the name you want to give to your training run id.
4. `-no-graphics`: to not launch the visualization during the training.
Depending on your hardware, 5M timesteps (the recommended value but you can also try 10M) will take 5 to 8 hours of training. You can continue using your computer in the meantime, but I advise deactivating the computer standby mode to prevent the training from being stopped.
Depending on the executable you use (windows, ubuntu, mac) the training command will look like this (your executable path can be different so dont hesitate to check before running).
```bash
mlagents-learn ./config/poca/SoccerTwos.yaml --env=./training-envs-executables/SoccerTwos.exe --run-id="SoccerTwos" --no-graphics
```
The executable contains 8 copies of SoccerTwos.
⚠️ Its normal if you dont see a big increase of ELO score (and even a decrease below 1200) before 2M timesteps, since your agents will spend most of their time moving randomly on the field before being able to goal.
⚠️ You can stop the training with Ctrl + C but beware of typing only once this command to stop the training since MLAgents needs to generate a final .onnx file before closing the run.
## Step 5: **Push the agent to the Hugging Face Hub**
Now that we trained our agents, were **ready to push them to the Hub to be able to participate in the AI vs. AI challenge and visualize them playing on your browser🔥.**
To be able to share your model with the community, there are three more steps to follow:
1⃣ (If its not already done) create an account to HF ➡ https://huggingface.co/join](https://huggingface.co/join
2⃣ Sign in and store your authentication token from the Hugging Face website.
Create a new token (https://huggingface.co/settings/tokens)) **with write role**
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
Copy the token, run this, and paste the token
```bash
huggingface-cli login
```
Then, we need to run `mlagents-push-to-hf`.
And we define four parameters:
1. `-run-id`: the name of the training run id.
2. `-local-dir`: where the agent was saved, its results/<run_id name>, so in my case results/First Training.
3. `-repo-id`: the name of the Hugging Face repo you want to create or update. Its always <your huggingface username>/<the repo name>
If the repo does not exist **it will be created automatically**
4. `--commit-message`: since HF repos are git repository you need to define a commit message.
In my case
```bash
mlagents-push-to-hf --run-id="SoccerTwos" --local-dir="./results/SoccerTwos" --repo-id="ThomasSimonini/poca-SoccerTwos" --commit-message="First Push"`
```
```bash
mlagents-push-to-hf --run-id= # Add your run id --local-dir= # Your local dir --repo-id= # Your repo id --commit-message="First Push"
```
If everything worked you should have this at the end of the process(but with a different url 😆) :
Your model is pushed to the Hub. You can view your model here: https://huggingface.co/ThomasSimonini/poca-SoccerTwos
It's the link to your model. It contains a model card that explains how to use it, your Tensorboard, and your config file. **What's awesome is that it's a git repository, which means you can have different commits, update your repository with a new push, etc.**
## Step 6: Verify that your model is ready for AI vs AI Challenge
Now that your model is pushed to the Hub, **its going to be added automatically to the AI vs AI Challenge model pool.** It can take a little bit of time before your model is added to the leaderboard given we do a run of matches every 4h.
But in order that everything works perfectly you need to check:
1. That you have this tag in your model: ML-Agents-SoccerTwos. This is the tag we use to select models to be added to the challenge pool. To do that go to your model and check the tags
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/verify1.png" alt="Verify"/>
If its not the case you just need to modify readme and add it
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/verify2.png" alt="Verify"/>
2. That you have a `SoccerTwos.onnx` file
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/verify3.png" alt="Verify"/>
We strongly suggest that you create a new model when you push to the Hub if you want to train it again or train a new version.
## Step 7: Visualize some match in our demo
Now that your model is part of AI vs AI Challenge, **you can visualize how good it is compared to others**: https://huggingface.co/spaces/unity/ML-Agents-SoccerTwos
In order to do that, you just need to go on this demo:
- Select your model as team blue (or team purple if you prefer) and another. The best to compare your model is either with the one whos on top of the leaderboard. Or use the [baseline model as opponent](https://huggingface.co/unity/MLAgents-SoccerTwos)
This matches you see live are not used to the calculation of your result **but are good way to visualize how good your agent is**.
And don't hesitate to share the best score your agent gets on discord in #rl-i-made-this channel 🔥

View File

@@ -0,0 +1,55 @@
# An introduction to Multi-Agents Reinforcement Learning (MARL)
## From single agent to multiple agents
In the first unit, we learned to train agents in a single-agent system. Where our agent was alone in its environment: **it was not cooperating or collaborating with other agents**.
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/patchwork.jpg" alt="Patchwork"/>
<figcaption>
A patchwork of all the environments you've trained your agents on since the beginning of the course
</figcaption>
</figure>
When we do multi-agents reinforcement learning (MARL), we are in a situation where we have multiple agents **that share and interact in a common environment**.
For instance, you can think of a warehouse where **multiple robots need to navigate to load and unload packages**.
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/warehouse.jpg" alt="Warehouse"/>
<figcaption> [Image by upklyak](https://www.freepik.com/free-vector/robots-warehouse-interior-automated-machines_32117680.htm#query=warehouse robot&position=17&from_view=keyword) on Freepik </figcaption>
</figure>
Or a road with **several autonomous vehicles**.
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/selfdrivingcar.jpg" alt="Self driving cars"/>
<figcaption>
[Image by jcomp](https://www.freepik.com/free-vector/autonomous-smart-car-automatic-wireless-sensor-driving-road-around-car-autonomous-smart-car-goes-scans-roads-observe-distance-automatic-braking-system_26413332.htm#query=self driving cars highway&position=34&from_view=search&track=ais) on Freepik
</figcaption>
</figure>
In these examples, we have **multiple agents interacting in the environment and with the other agents**. This implies defining a multi-agents system. But first, let's understand the different types of multi-agent environments.
## Different types of multi-agent environments
Given that in a multi-agent system, agents interact with other agents, we can have different types of environments:
- *Cooperative environments*: where your agents needs **to maximize the common benefits**.
For instance, in a warehouse, **robots must collaborate to load and unload the packages as efficiently (as fast as possible)**.
- *Competitive/Adversarial environments*: in that case, your agent **want to maximize its benefits by minimizing the opponent ones**.
For example, in a game of tennis, **each agent wants to beat the other agent**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/tennis.png" alt="Tennis"/>
- *Mixed of both adversarial and cooperative*: like in our SoccerTwos environment, two agents are part of a team (blue or purple): they need to cooperate with each other and beat the opponent team.
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/soccertwos.gif" alt="SoccerTwos"/>
<figcaption>This environment was made by the <a href="https://github.com/Unity-Technologies/ml-agents">Unity MLAgents Team</a></figcaption>
</figure>
So now we might wonder: how can we design these multi-agent systems? Said differently, **how can we train agents in a multi-agent setting** ?

View File

@@ -0,0 +1,36 @@
# Introduction [[introduction]]
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/thumbnail.png" alt="Thumbnail"/>
Since the beginning of this course, we learned to train agents in a *single-agent system* where our agent was alone in its environment: it was **not cooperating or collaborating with other agents**.
This worked great, and the single-agent system is useful for many applications.
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/patchwork.jpg" alt="Patchwork"/>
<figcaption>
A patchwork of all the environments youve trained your agents on since the beginning of the course
</figcaption>
</figure>
But, as humans, **we live in a multi-agent world**. Our intelligence comes from interaction with other agents. And so, our **goal is to create agents that can interact with other humans and other agents**.
Consequently, we must study how to train deep reinforcement learning agents in a *multi-agents system* to build robust agents that can adapt, collaborate, or compete.
So today, were going to **learn the basics of this fascinating topic of multi-agents reinforcement learning (MARL)**.
And the most exciting part is that during this unit, youre going to train your first agents in a multi-agents system: **a 2vs2 soccer team that needs to beat the opponent team**.
And youre going to participate in **AI vs. AI challenge** where your trained agent will compete against other classmates agents every day and be ranked on a [new leaderboard](https://huggingface.co/spaces/huggingface-projects/AIvsAI-SoccerTwos).
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/soccertwos.gif" alt="SoccerTwos"/>
<figcaption>This environment was made by the <a href="https://github.com/Unity-Technologies/ml-agents">Unity MLAgents Team</a></figcaption>
</figure>
So lets get started!

View File

@@ -0,0 +1,57 @@
# Designing Multi-Agents systems
For this section, you're going to watch this excellent introduction to multi-agents made by <a href="https://www.youtube.com/channel/UCq0imsn84ShAe9PBOFnoIrg"> Brian Douglas </a>.
<Youtube id="qgb0gyrpiGk" />
In this video, Brian talked about how to design multi-agent systems. He specifically took a vacuum cleaner multi-agents setting and asked how they **can cooperate with each other**?
We have two solutions to design this multi-agent reinforcement learning system (MARL).
## Decentralized system
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/decentralized.png" alt="Decentralized"/>
<figcaption>
Source: <a href="https://www.youtube.com/watch?v=qgb0gyrpiGk"> Introduction to Multi-Agent Reinforcement Learning </a>
</figcaption>
</figure>
In decentralized learning, **each agent is trained independently from others**. In the example given, each vacuum learns to clean as many places as it can **without caring about what other vacuums (agents) are doing**.
The benefit is that **since no information is shared between agents, these vacuums can be designed and trained like we train single agents**.
The idea here is that **our training agent will consider other agents as part of the environment dynamics**. Not as agents.
However, the big drawback of this technique is that it will **make the environment non-stationary** since the underlying Markov decision process changes over time as other agents are also interacting in the environment.
And this is problematic for many Reinforcement Learning algorithms **that can't reach a global optimum with a non-stationary environment**.
## Centralized approach
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/centralized.png" alt="Centralized"/>
<figcaption>
Source: <a href="https://www.youtube.com/watch?v=qgb0gyrpiGk"> Introduction to Multi-Agent Reinforcement Learning </a>
</figcaption>
</figure>
In this architecture, **we have a high-level process that collects agents' experiences**: experience buffer. And we'll use these experiences **to learn a common policy**.
For instance, in the vacuum cleaner, the observation will be:
- The coverage map of the vacuums.
- The position of all the vacuums.
We use that collective experience **to train a policy that will move all three robots in the most beneficial way as a whole**. So each robot is learning from the common experience.
And we have a stationary environment since all the agents are treated as a larger entity, and they know the change of other agents' policies (since it's the same as theirs).
If we recap:
- In *decentralized approach*, we **treat all agents independently without considering the existence of the other agents.**
- In this case, all agents **consider others agents as part of the environment**.
- **Its a non-stationarity environment condition**, so non-guaranty of convergence.
- In centralized approach:
- A **single policy is learned from all the agents**.
- Takes as input the present state of an environment and the policy output joint actions.
- The reward is global.

View File

@@ -0,0 +1,137 @@
# Self-Play: a classic technique to train competitive agents in adversarial games
Now that we studied the basics of multi-agents. We're ready to go deeper. As mentioned in the introduction, we're going **to train agents in an adversarial game with SoccerTwos, a 2vs2 game**.
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/soccertwos.gif" alt="SoccerTwos"/>
<figcaption>This environment was made by the <a href="https://github.com/Unity-Technologies/ml-agents">Unity MLAgents Team</a></figcaption>
</figure>
## What is Self-Play?
Training agents correctly in an adversarial game can be **quite complex**.
On the one hand, we need to find how to get a well-trained opponent to play against your training agent. And on the other hand, even if you have a very good trained opponent, it's not a good solution since how your agent is going to improve its policy when the opponent is too strong?
Think of a child that just started to learn soccer. Playing against a very good soccer player will be useless since it will be too hard to win or at least get the ball from time to time. So the child will continuously lose without having time to learn a good policy.
The best solution would be **to have an opponent that is on the same level as the agent and will upgrade its level as the agent upgrades its own**. Because if the opponent is too strong, well learn nothing; if it is too weak, well overlearn useless behavior against a stronger opponent then.
This solution is called *self-play*. In self-play, **the agent uses former copies of itself (of its policy) as an opponent**. This way, the agent will play against an agent of the same level (challenging but not too much), have opportunities to gradually improve its policy, and then update its opponent as it becomes better. Its a way to bootstrap an opponent and progressively increase the opponent's complexity.
Its the same way humans learn in competition:
- We start to train against an opponent of similar level
- Then we learn from it, and when we acquired some skills, we can move further with stronger opponents.
We do the same with self-play:
- We **start with a copy of our agent as an opponent** this way, this opponent is on a similar level.
- We **learn from it**, and when we acquire some skills, we **update our opponent with a more recent copy of our training policy**.
The theory behind self-play is not something new. It was already used by Arthur Samuels checker player system in the fifties and by Gerald Tesauros TD-Gammon in 1955. If you want to learn more about the history of self-play [check this very good blogpost by Andrew Cohen](https://blog.unity.com/technology/training-intelligent-adversaries-using-self-play-with-ml-agents)
## Self-Play in MLAgents
Self-Play is integrated into the MLAgents library and is managed by multiple hyperparameters that were going to study. But the main focus as explained in the documentation is the **tradeoff between the skill level and generality of the final policy and the stability of learning**.
Training against a set of slowly changing or unchanging adversaries with low diversity **results in more stable training. But a risk to overfit if the change is too slow.**
We need then to control:
- How **often do we change opponents** with `swap_steps` and `team_change` parameters.
- The **number of opponents saved** with `window` parameter. A larger value of `window`
 means that an agent's pool of opponents will contain a larger diversity of behaviors since it will contain policies from earlier in the training run.
- **Probability of playing against the current self vs opponent** sampled in the pool with `play_against_latest_model_ratio`. A larger value of `play_against_latest_model_ratio`
 indicates that an agent will be playing against the current opponent more often.
- The **number of training steps before saving a new opponent** with `save_steps` parameters. A larger value of `save_steps`
 will yield a set of opponents that cover a wider range of skill levels and possibly play styles since the policy receives more training.
To get more details about these hyperparameters, you definitely need [to check this part of the documentation](https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Training-Configuration-File.md#self-play)
## The ELO Score to evaluate our agent
### What is ELO Score?
In adversarial games, tracking the **cumulative reward is not always a meaningful metric to track the learning progress:** because this metric is **dependent only on the skill of the opponent.**
Instead, were using an ***ELO rating system*** (named after Arpad Elo) that calculates the **relative skill level** between 2 players from a given population in a zero-sum game.
In a zero-sum game: one agent wins, and the other agent loses. Its a mathematical representation of a situation in which each participants gain or loss of utility **is exactly balanced by the gain or loss of the utility of the other participants.** We talk about zero-sum games because the sum of utility is equal to zero.
This ELO (starting at a specific score: frequently 1200) can decrease initially but should increase progressively during the training.
The Elo system is **inferred from the losses and draws against other players.** It means that player ratings depend **on the ratings of their opponents and the results scored against them.**
Elo defines an Elo score that is the relative skills of a player in a zero-sum game. **We say relative because it depends on the performance of opponents.**
The central idea is to think of the performance of a player **as a random variable that is normally distributed.**
The difference in rating between 2 players serves as **the predictor of the outcomes of a match.** If the player wins, but the probability of winning is high, it will only win a few points from its opponent since it means that it is much stronger than it.
After every game:
- The winning player takes **points from the losing one.**
- The number of points is determined **by the difference in the 2 players ratings (hence relative).**
- If the higher-rated player wins → few points will be taken from the lower-rated player.
- If the lower-rated player wins → a lot of points will be taken from the high-rated player.
- If its a draw → the lower-rated player gains a few points from the higher.
So if A and B have rating Ra, and Rb, then the **expected scores are** given by:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/elo1.png" alt="ELO Score"/>
Then, at the end of the game, we need to update the players actual Elo score. We use a linear adjustment **proportional to the amount by which the player over-performed or under-performed.**
We also define a maximum adjustment rating per game: K-factor.
- K=16 for master.
- K=32 for weaker players.
If Player A has Ea points but scored Sa points, then the players rating is updated using the formula:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/elo2.png" alt="ELO Score"/>
### Example
If we take an example:
Player A has a rating of 2600
Player B has a rating of 2300
- We first calculate the expected score:
\\(E_{A} = \frac{1}{1+10^{(2300-2600)/400}} = 0.849 \\)
\\(E_{B} = \frac{1}{1+10^{(2600-2300)/400}} = 0.151 \\)
- If the organizers determined that K=16 and A wins, the new rating would be:
\\(ELO_A = 2600 + 16*(1-0.849) = 2602 \\)
\\(ELO_B = 2300 + 16*(0-0.151) = 2298 \\)
- If the organizers determined that K=16 and B wins, the new rating would be:
\\(ELO_A = 2600 + 16*(0-0.849) = 2586 \\)
\\(ELO_B = 2300 + 16 *(1-0.151) = 2314 \\)
### The Advantages
Using ELO score has multiple advantages:
- Points are **always balanced** (more points are exchanged when there is an unexpected outcome, but the sum is always the same).
- It is a **self-corrected system** since if a player wins against a weak player, you will only win a few points.
- If **works with team games**: we calculate the average for each team and use it in Elo.
### The Disadvantages
- ELO **does not take the individual contribution** of each people in the team.
- Rating deflation: **good rating require skill over time to get the same rating**.
- **Cant compare rating in history**.

View File

@@ -1,6 +1,6 @@
# Introduction [[introduction]]
In this bonus unit, we'll reinforce what we learned in the first unit by teaching Huggy the Dog to fetch the stick and then [play with him directly in your browser](https://huggingface.co/spaces/ThomasSimonini/Huggy) 🐶
In this bonus unit, we'll reinforce what we learned in the first unit by teaching Huggy the Dog to fetch the stick and then [play with him directly in your browser](https://singularite.itch.io/huggy) 🐶
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit2/thumbnail.png" alt="Unit bonus 1 thumbnail" width="100%">

View File

@@ -4,7 +4,7 @@ Now that you've trained Huggy and pushed it to the Hub. **You will be able to pl
For this step its simple:
- Open the game Huggy in your browser: https://huggingface.co/spaces/ThomasSimonini/Huggy
- Open the game Huggy in your browser: https://singularite.itch.io/huggy
- Click on Play with my Huggy model

View File

@@ -236,7 +236,7 @@ But now comes the best: **being able to play with Huggy online 👀.**
This step is the simplest:
- Open the game Huggy in your browser: https://huggingface.co/spaces/ThomasSimonini/Huggy
- Open the game Huggy in your browser: https://singularite.itch.io/huggy
- Click on Play with my Huggy model