mirror of
https://github.com/huggingface/deep-rl-class.git
synced 2026-02-08 12:54:32 +08:00
Merge branch 'main' into ThomasSimonini/A2C
This commit is contained in:
@@ -13,7 +13,7 @@ This repository contains the Deep Reinforcement Learning Course mdx files and no
|
||||
<br>
|
||||
<br>
|
||||
|
||||
# The documentation below is for v1.0 (depreciated)
|
||||
# The documentation below is for v1.0 (deprecated)
|
||||
|
||||
We're launching a **new version (v2.0) of the course starting December the 5th,**
|
||||
|
||||
@@ -26,7 +26,7 @@ The syllabus 📚: https://simoninithomas.github.io/deep-rl-course
|
||||
<br>
|
||||
<br>
|
||||
|
||||
# The documentation below is for v1.0 (depreciated)
|
||||
# The documentation below is for v1.0 (deprecated)
|
||||
|
||||
In this free course, you will:
|
||||
|
||||
|
||||
@@ -509,7 +509,7 @@
|
||||
"\n",
|
||||
"This step is the simplest:\n",
|
||||
"\n",
|
||||
"- Open the game Huggy in your browser: https://huggingface.co/spaces/ThomasSimonini/Huggy\n",
|
||||
"- Open the game Huggy in your browser: https://singularite.itch.io/huggy\n",
|
||||
"\n",
|
||||
"- Click on Play with my Huggy model\n",
|
||||
"\n",
|
||||
@@ -569,4 +569,4 @@
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 0
|
||||
}
|
||||
}
|
||||
|
||||
5
notebooks/unit1/requirements-unit1.txt
Normal file
5
notebooks/unit1/requirements-unit1.txt
Normal file
@@ -0,0 +1,5 @@
|
||||
stable-baselines3[extra]
|
||||
box2d
|
||||
box2d-kengz
|
||||
huggingface_sb3
|
||||
pyglet==1.5.1
|
||||
@@ -230,15 +230,6 @@
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"TODO CHANGE LINK OF THE REQUIREMENTS"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "32e3NPYgH5ET"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
@@ -247,7 +238,7 @@
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!pip install -r https://huggingface.co/spaces/ThomasSimonini/temp-space-requirements/raw/main/requirements/requirements-unit1.txt"
|
||||
"!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit1/requirements-unit1.txt"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -1155,4 +1146,4 @@
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 0
|
||||
}
|
||||
}
|
||||
|
||||
@@ -127,7 +127,7 @@
|
||||
"source": [
|
||||
"# Let's train a Deep Q-Learning agent playing Atari' Space Invaders 👾 and upload it to the Hub.\n",
|
||||
"\n",
|
||||
"To validate this hands-on for the certification process, you need to push your trained model to the Hub and **get a result of >= 500**.\n",
|
||||
"To validate this hands-on for the certification process, you need to push your trained model to the Hub and **get a result of >= 200**.\n",
|
||||
"\n",
|
||||
"To find your result, go to the leaderboard and find your model, **the result = mean_reward - std of reward**\n",
|
||||
"\n",
|
||||
@@ -799,4 +799,4 @@
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 0
|
||||
}
|
||||
}
|
||||
|
||||
6
notebooks/unit4/requirements-unit4.txt
Normal file
6
notebooks/unit4/requirements-unit4.txt
Normal file
@@ -0,0 +1,6 @@
|
||||
gym
|
||||
git+https://github.com/ntasfi/PyGame-Learning-Environment.git
|
||||
git+https://github.com/qlan3/gym-games.git
|
||||
huggingface_hub
|
||||
imageio-ffmpeg
|
||||
pyyaml==6.0
|
||||
1614
notebooks/unit4/unit4.ipynb
Normal file
1614
notebooks/unit4/unit4.ipynb
Normal file
File diff suppressed because it is too large
Load Diff
844
notebooks/unit5/unit5.ipynb
Normal file
844
notebooks/unit5/unit5.ipynb
Normal file
@@ -0,0 +1,844 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "view-in-github",
|
||||
"colab_type": "text"
|
||||
},
|
||||
"source": [
|
||||
"<a href=\"https://colab.research.google.com/github/huggingface/deep-rl-class/blob/ThomasSimonini%2FMLAgents/notebooks/unit5/unit5.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "2D3NL_e4crQv"
|
||||
},
|
||||
"source": [
|
||||
"# Unit 5: An Introduction to ML-Agents\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/thumbnail.png\" alt=\"Thumbnail\"/>\n",
|
||||
"\n",
|
||||
"In this notebook, you'll learn about ML-Agents and train two agents.\n",
|
||||
"\n",
|
||||
"- The first one will learn to **shoot snowballs onto spawning targets**.\n",
|
||||
"- The second need to press a button to spawn a pyramid, then navigate to the pyramid, knock it over, **and move to the gold brick at the top**. To do that, it will need to explore its environment, and we will use a technique called curiosity.\n",
|
||||
"\n",
|
||||
"After that, you'll be able **to watch your agents playing directly on your browser**.\n",
|
||||
"\n",
|
||||
"For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "97ZiytXEgqIz"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"⬇️ Here is an example of what **you will achieve at the end of this unit.** ⬇️\n"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "FMYrDriDujzX"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids.gif\" alt=\"Pyramids\"/>\n",
|
||||
"\n",
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballtarget.gif\" alt=\"SnowballTarget\"/>"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "cBmFlh8suma-"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"### 🎮 Environments: \n",
|
||||
"\n",
|
||||
"- [Pyramids](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Learning-Environment-Examples.md#pyramids)\n",
|
||||
"- SnowballTarget\n",
|
||||
"\n",
|
||||
"### 📚 RL-Library: \n",
|
||||
"\n",
|
||||
"- [ML-Agents (HuggingFace Experimental Version)](https://github.com/huggingface/ml-agents)\n",
|
||||
"\n",
|
||||
"⚠ We're going to use an experimental version of ML-Agents were you can push to hub and load from hub Unity ML-Agents Models **you need to install the same version**"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "A-cYE0K5iL-w"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues)."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "qEhtaFh9i31S"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"## Objectives of this notebook 🏆\n",
|
||||
"\n",
|
||||
"At the end of the notebook, you will:\n",
|
||||
"\n",
|
||||
"- Understand how works **ML-Agents**, the environment library.\n",
|
||||
"- Be able to **train agents in Unity Environments**.\n"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "j7f63r3Yi5vE"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"## This notebook is from the Deep Reinforcement Learning Course\n",
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/deep-rl-course-illustration.jpg\" alt=\"Deep RL Course illustration\"/>"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "viNzVbVaYvY3"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "6p5HnEefISCB"
|
||||
},
|
||||
"source": [
|
||||
"In this free course, you will:\n",
|
||||
"\n",
|
||||
"- 📖 Study Deep Reinforcement Learning in **theory and practice**.\n",
|
||||
"- 🧑💻 Learn to **use famous Deep RL libraries** such as Stable Baselines3, RL Baselines3 Zoo, CleanRL and Sample Factory 2.0.\n",
|
||||
"- 🤖 Train **agents in unique environments** \n",
|
||||
"\n",
|
||||
"And more check 📚 the syllabus 👉 https://huggingface.co/deep-rl-course/communication/publishing-schedule\n",
|
||||
"\n",
|
||||
"Don’t forget to **<a href=\"http://eepurl.com/ic5ZUD\">sign up to the course</a>** (we are collecting your email to be able to **send you the links when each Unit is published and give you information about the challenges and updates).**\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"The best way to keep in touch is to join our discord server to exchange with the community and with us 👉🏻 https://discord.gg/ydHrjt3WP5"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "Y-mo_6rXIjRi"
|
||||
},
|
||||
"source": [
|
||||
"## Prerequisites 🏗️\n",
|
||||
"Before diving into the notebook, you need to:\n",
|
||||
"\n",
|
||||
"🔲 📚 **Study [what is ML-Agents and how it works by reading Unit 5](https://huggingface.co/deep-rl-course/unit5/introduction)** 🤗 "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"# Let's train our agents 🚀\n",
|
||||
"\n",
|
||||
"The ML-Agents integration on the Hub is **still experimental**, some features will be added in the future. \n",
|
||||
"\n",
|
||||
"But for now, **to validate this hands-on for the certification process, you just need to push your trained models to the Hub**. There’s no results to attain to validate this one. But if you want to get nice results you can try to attain:\n",
|
||||
"\n",
|
||||
"- For `Pyramids` : Mean Reward = 1.75\n",
|
||||
"- For `SnowballTarget` : Mean Reward = 15 or 30 targets hit in an episode.\n"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "xYO1uD5Ujgdh"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"## Set the GPU 💪\n",
|
||||
"- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`\n",
|
||||
"\n",
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg\" alt=\"GPU Step 1\">"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "DssdIjk_8vZE"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"- `Hardware Accelerator > GPU`\n",
|
||||
"\n",
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg\" alt=\"GPU Step 2\">"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "sTfCXHy68xBv"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "an3ByrXYQ4iK"
|
||||
},
|
||||
"source": [
|
||||
"## Clone the repository and install the dependencies 🔽\n",
|
||||
"- We need to clone the repository, that **contains the experimental version of the library that allows you to push your trained agent to the Hub.**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"id": "6WNoL04M7rTa"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%%capture\n",
|
||||
"# Clone the repository\n",
|
||||
"!git clone --depth 1 https://github.com/huggingface/ml-agents/ "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"id": "d8wmVcMk7xKo"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%%capture\n",
|
||||
"# Go inside the repository and install the package\n",
|
||||
"%cd ml-agents\n",
|
||||
"!pip3 install -e ./ml-agents-envs\n",
|
||||
"!pip3 install -e ./ml-agents"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"## SnowballTarget ⛄\n",
|
||||
"\n",
|
||||
"If you need a refresher on how this environments work check this section 👉\n",
|
||||
"https://huggingface.co/deep-rl-course/unit5/snowball-target"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "R5_7Ptd_kEcG"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "HRY5ufKUKfhI"
|
||||
},
|
||||
"source": [
|
||||
"### Download and move the environment zip file in `./training-envs-executables/linux/`\n",
|
||||
"- Our environment executable is in a zip file.\n",
|
||||
"- We need to download it and place it to `./training-envs-executables/linux/`\n",
|
||||
"- We use a linux executable because we use colab, and colab machines OS is Ubuntu (linux)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"id": "C9Ls6_6eOKiA"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Here, we create training-envs-executables and linux\n",
|
||||
"!mkdir ./training-envs-executables\n",
|
||||
"!mkdir ./training-envs-executables/linux"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "jsoZGxr1MIXY"
|
||||
},
|
||||
"source": [
|
||||
"Download the file SnowballTarget.zip from https://drive.google.com/file/d/1YHHLjyj6gaZ3Gemx1hQgqrPgSS2ZhmB5 using `wget`. \n",
|
||||
"\n",
|
||||
"Check out the full solution to download large files from GDrive [here](https://bcrf.biochem.wisc.edu/2021/02/05/download-google-drive-files-using-wget/)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"id": "QU6gi8CmWhnA"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!wget --load-cookies /tmp/cookies.txt \"https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1YHHLjyj6gaZ3Gemx1hQgqrPgSS2ZhmB5' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\\1\\n/p')&id=1YHHLjyj6gaZ3Gemx1hQgqrPgSS2ZhmB5\" -O ./training-envs-executables/linux/SnowballTarget.zip && rm -rf /tmp/cookies.txt"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"We unzip the executable.zip file"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "_LLVaEEK3ayi"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"id": "8FPx0an9IAwO"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%%capture\n",
|
||||
"!unzip -d ./training-envs-executables/linux/ ./training-envs-executables/linux/SnowballTarget.zip"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "nyumV5XfPKzu"
|
||||
},
|
||||
"source": [
|
||||
"Make sure your file is accessible "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"id": "EdFsLJ11JvQf"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!chmod -R 755 ./training-envs-executables/linux/SnowballTarget"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"### Define the SnowballTarget config file\n",
|
||||
"- In ML-Agents, you define the **training hyperparameters into config.yaml files.**\n",
|
||||
"\n",
|
||||
"There are multiple hyperparameters. To know them better, you should check for each explanation with [the documentation](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/Training-Configuration-File.md)\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"So you need to create a `SnowballTarget.yaml` config file in ./content/ml-agents/config/ppo/\n",
|
||||
"\n",
|
||||
"We'll give you here a first version of this config (to copy and paste into your `SnowballTarget.yaml file`), **but you should modify it**.\n",
|
||||
"\n",
|
||||
"```\n",
|
||||
"behaviors:\n",
|
||||
" SnowballTarget:\n",
|
||||
" trainer_type: ppo\n",
|
||||
" summary_freq: 10000\n",
|
||||
" keep_checkpoints: 10\n",
|
||||
" checkpoint_interval: 50000\n",
|
||||
" max_steps: 200000\n",
|
||||
" time_horizon: 64\n",
|
||||
" threaded: true\n",
|
||||
" hyperparameters:\n",
|
||||
" learning_rate: 0.0003\n",
|
||||
" learning_rate_schedule: linear\n",
|
||||
" batch_size: 128\n",
|
||||
" buffer_size: 2048\n",
|
||||
" beta: 0.005\n",
|
||||
" epsilon: 0.2\n",
|
||||
" lambd: 0.95\n",
|
||||
" num_epoch: 3\n",
|
||||
" network_settings:\n",
|
||||
" normalize: false\n",
|
||||
" hidden_units: 256\n",
|
||||
" num_layers: 2\n",
|
||||
" vis_encode_type: simple\n",
|
||||
" reward_signals:\n",
|
||||
" extrinsic:\n",
|
||||
" gamma: 0.99\n",
|
||||
" strength: 1.0\n",
|
||||
"```"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "NAuEq32Mwvtz"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballfight_config1.png\" alt=\"Config SnowballTarget\"/>\n",
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballfight_config2.png\" alt=\"Config SnowballTarget\"/>"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "4U3sRH4N4h_l"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"As an experimentation, you should also try to modify some other hyperparameters. Unity provides very [good documentation explaining each of them here](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Training-Configuration-File.md).\n",
|
||||
"\n",
|
||||
"Now that you've created the config file and understand what most hyperparameters do, we're ready to train our agent 🔥."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "JJJdo_5AyoGo"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "f9fI555bO12v"
|
||||
},
|
||||
"source": [
|
||||
"### Train the agent\n",
|
||||
"\n",
|
||||
"To train our agent, we just need to **launch mlagents-learn and select the executable containing the environment.**\n",
|
||||
"\n",
|
||||
"We define four parameters:\n",
|
||||
"\n",
|
||||
"1. `mlagents-learn <config>`: the path where the hyperparameter config file is.\n",
|
||||
"2. `--env`: where the environment executable is.\n",
|
||||
"3. `--run_id`: the name you want to give to your training run id.\n",
|
||||
"4. `--no-graphics`: to not launch the visualization during the training.\n",
|
||||
"\n",
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/mlagentslearn.png\" alt=\"MlAgents learn\"/>\n",
|
||||
"\n",
|
||||
"Train the model and use the `--resume` flag to continue training in case of interruption. \n",
|
||||
"\n",
|
||||
"> It will fail first time if and when you use `--resume`, try running the block again to bypass the error. \n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"The training will take 10 to 35min depending on your config, go take a ☕️you deserve it 🤗."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "lN32oWF8zPjs"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"id": "bS-Yh1UdHfzy"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!mlagents-learn ./config/ppo/SnowballTarget.yaml --env=./training-envs-executables/linux/SnowballTarget/SnowballTarget --run-id=\"SnowballTarget1\" --no-graphics"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "5Vue94AzPy1t"
|
||||
},
|
||||
"source": [
|
||||
"### Push the agent to the 🤗 Hub\n",
|
||||
"\n",
|
||||
"- Now that we trained our agent, we’re **ready to push it to the Hub to be able to visualize it playing on your browser🔥.**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"To be able to share your model with the community there are three more steps to follow:\n",
|
||||
"\n",
|
||||
"1️⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join\n",
|
||||
"\n",
|
||||
"2️⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.\n",
|
||||
"- Create a new token (https://huggingface.co/settings/tokens) **with write role**\n",
|
||||
"\n",
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg\" alt=\"Create HF Token\">\n",
|
||||
"\n",
|
||||
"- Copy the token \n",
|
||||
"- Run the cell below and paste the token"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "izT6FpgNzZ6R"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"id": "rKt2vsYoK56o"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from huggingface_hub import notebook_login\n",
|
||||
"notebook_login()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "aSU9qD9_6dem"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Then, we simply need to run `mlagents-push-to-hf`.\n",
|
||||
"\n",
|
||||
"And we define 4 parameters:\n",
|
||||
"\n",
|
||||
"1. `--run-id`: the name of the training run id.\n",
|
||||
"2. `--local-dir`: where the agent was saved, it’s results/<run_id name>, so in my case results/First Training.\n",
|
||||
"3. `--repo-id`: the name of the Hugging Face repo you want to create or update. It’s always <your huggingface username>/<the repo name>\n",
|
||||
"If the repo does not exist **it will be created automatically**\n",
|
||||
"4. `--commit-message`: since HF repos are git repository you need to define a commit message.\n",
|
||||
"\n",
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/mlagentspushtohub.png\" alt=\"Push to Hub\"/>\n",
|
||||
"\n",
|
||||
"For instance:\n",
|
||||
"\n",
|
||||
"`!mlagents-push-to-hf --run-id=\"SnowballTarget1\" --local-dir=\"./results/SnowballTarget1\" --repo-id=\"ThomasSimonini/ppo-SnowballTarget\" --commit-message=\"First Push\"`"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "KK4fPfnczunT"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"id": "dGEFAIboLVc6"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!mlagents-push-to-hf --run-id= # Add your run id --local-dir= # Your local dir --repo-id= # Your repo id --commit-message= # Your commit message"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Else, if everything worked you should have this at the end of the process(but with a different url 😆) :\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"```\n",
|
||||
"Your model is pushed to the hub. You can view your model here: https://huggingface.co/ThomasSimonini/ppo-SnowballTarget\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"It’s the link to your model, it contains a model card that explains how to use it, your Tensorboard and your config file. **What’s awesome is that it’s a git repository, that means you can have different commits, update your repository with a new push etc.**"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "yborB0850FTM"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"But now comes the best: **being able to visualize your agent online 👀.**"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "5Uaon2cg0NrL"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"### Watch your agent playing 👀\n",
|
||||
"\n",
|
||||
"For this step it’s simple:\n",
|
||||
"\n",
|
||||
"1. Remember your repo-id\n",
|
||||
"\n",
|
||||
"2. Go here: https://singularite.itch.io/snowballtarget\n",
|
||||
"\n",
|
||||
"3. Launch the game and put it in full screen by clicking on the bottom right button\n",
|
||||
"\n",
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballtarget_load.png\" alt=\"Snowballtarget load\"/>"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "VMc4oOsE0QiZ"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"1. In step 1, choose your model repository which is the model id (in my case ThomasSimonini/ppo-SnowballTarget).\n",
|
||||
"\n",
|
||||
"2. In step 2, **choose what model you want to replay**:\n",
|
||||
" - I have multiple one, since we saved a model every 500000 timesteps. \n",
|
||||
" - But if I want the more recent I choose `SnowballTarget.onnx`\n",
|
||||
"\n",
|
||||
"👉 What’s nice **is to try with different models step to see the improvement of the agent.**\n",
|
||||
"\n",
|
||||
"And don't hesitate to share the best score your agent gets on discord in #rl-i-made-this channel 🔥\n",
|
||||
"\n",
|
||||
"Let's now try a harder environment called Pyramids..."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "Djs8c5rR0Z8a"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"## Pyramids 🏆\n",
|
||||
"\n",
|
||||
"### Download and move the environment zip file in `./training-envs-executables/linux/`\n",
|
||||
"- Our environment executable is in a zip file.\n",
|
||||
"- We need to download it and place it to `./training-envs-executables/linux/`\n",
|
||||
"- We use a linux executable because we use colab, and colab machines OS is Ubuntu (linux)"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "rVMwRi4y_tmx"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "NyqYYkLyAVMK"
|
||||
},
|
||||
"source": [
|
||||
"Download the file Pyramids.zip from https://drive.google.com/uc?export=download&id=1UiFNdKlsH0NTu32xV-giYUEVKV4-vc7H using `wget`. Check out the full solution to download large files from GDrive [here](https://bcrf.biochem.wisc.edu/2021/02/05/download-google-drive-files-using-wget/)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"id": "AxojCsSVAVMP"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!wget --load-cookies /tmp/cookies.txt \"https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1UiFNdKlsH0NTu32xV-giYUEVKV4-vc7H' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\\1\\n/p')&id=1UiFNdKlsH0NTu32xV-giYUEVKV4-vc7H\" -O ./training-envs-executables/linux/Pyramids.zip && rm -rf /tmp/cookies.txt"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "bfs6CTJ1AVMP"
|
||||
},
|
||||
"source": [
|
||||
"**OR** Download directly to local machine and then drag and drop the file from local machine to `./training-envs-executables/linux`"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "H7JmgOwcSSmF"
|
||||
},
|
||||
"source": [
|
||||
"Wait for the upload to finish and then run the command below. \n",
|
||||
"\n",
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Unzip it"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "iWUUcs0_794U"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"id": "i2E3K4V2AVMP"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%%capture\n",
|
||||
"!unzip -d ./training-envs-executables/linux/ ./training-envs-executables/linux/Pyramids.zip"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "KmKYBgHTAVMP"
|
||||
},
|
||||
"source": [
|
||||
"Make sure your file is accessible "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"id": "Im-nwvLPAVMP"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!chmod -R 755 ./training-envs-executables/linux/Pyramids/Pyramids"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"### Modify the PyramidsRND config file\n",
|
||||
"- Contrary to the first environment which was a custom one, **Pyramids was made by the Unity team**.\n",
|
||||
"- So the PyramidsRND config file already exists and is in ./content/ml-agents/config/ppo/PyramidsRND.yaml\n",
|
||||
"- You might asked why \"RND\" in PyramidsRND. RND stands for *random network distillation* it's a way to generate curiosity rewards. If you want to know more on that we wrote an article explaning this technique: https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-random-network-distillation-488ffd8e5938\n",
|
||||
"\n",
|
||||
"For this training, we’ll modify one thing:\n",
|
||||
"- The total training steps hyperparameter is too high since we can hit the benchmark (mean reward = 1.75) in only 1M training steps.\n",
|
||||
"👉 To do that, we go to config/ppo/PyramidsRND.yaml,**and modify these to max_steps to 1000000.**\n",
|
||||
"\n",
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-config.png\" alt=\"Pyramids config\"/>"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "fqceIATXAgih"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"As an experimentation, you should also try to modify some other hyperparameters, Unity provides a very [good documentation explaining each of them here](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Training-Configuration-File.md).\n",
|
||||
"\n",
|
||||
"We’re now ready to train our agent 🔥."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "RI-5aPL7BWVk"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"### Train the agent\n",
|
||||
"\n",
|
||||
"The training will take 30 to 45min depending on your machine, go take a ☕️you deserve it 🤗."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "s5hr1rvIBdZH"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"id": "fXi4-IaHBhqD"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!mlagents-learn ./config/ppo/PyramidsRND.yaml --env=./training-envs-executables/linux/Pyramids/Pyramids --run-id=\"Pyramids Training\" --no-graphics"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "txonKxuSByut"
|
||||
},
|
||||
"source": [
|
||||
"### Push the agent to the 🤗 Hub\n",
|
||||
"\n",
|
||||
"- Now that we trained our agent, we’re **ready to push it to the Hub to be able to visualize it playing on your browser🔥.**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [],
|
||||
"metadata": {
|
||||
"id": "JZ53caJ99sX_"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"!mlagents-push-to-hf --run-id= # Add your run id --local-dir= # Your local dir --repo-id= # Your repo id --commit-message= # Your commit message"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "yiEQbv7rB4mU"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"### Watch your agent playing 👀\n",
|
||||
"\n",
|
||||
"The temporary link for Pyramids demo is: https://singularite.itch.io/pyramids"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "7aZfgxo-CDeQ"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"### 🎁 Bonus: Why not train on another environment?\n",
|
||||
"Now that you know how to train an agent using MLAgents, **why not try another environment?** \n",
|
||||
"\n",
|
||||
"MLAgents provides 18 different and we’re building some custom ones. The best way to learn is to try things of your own, have fun.\n",
|
||||
"\n"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "hGG_oq2n0wjB"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"metadata": {
|
||||
"id": "KSAkJxSr0z6-"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"You have the full list of the one currently available on Hugging Face here 👉 https://github.com/huggingface/ml-agents#the-environments\n",
|
||||
"\n",
|
||||
"For the demos to visualize your agent, the temporary link is: https://singularite.itch.io (temporary because we'll also put the demos on Hugging Face Space)\n",
|
||||
"\n",
|
||||
"For now we have integrated: \n",
|
||||
"- [Worm](https://singularite.itch.io/worm) demo where you teach a **worm to crawl**.\n",
|
||||
"- [Walker](https://singularite.itch.io/walker) demo where you teach an agent **to walk towards a goal**.\n",
|
||||
"\n",
|
||||
"If you want new demos to be added, please open an issue: https://github.com/huggingface/deep-rl-class 🤗"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "YiyF4FX-04JB"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"That’s all for today. Congrats on finishing this tutorial!\n",
|
||||
"\n",
|
||||
"The best way to learn is to practice and try stuff. Why not try another environment? ML-Agents has 18 different environments, but you can also create your own? Check the documentation and have fun!\n",
|
||||
"\n",
|
||||
"See you on Unit 6 🔥,\n",
|
||||
"\n",
|
||||
"## Keep Learning, Stay awesome 🤗"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "PI6dPWmh064H"
|
||||
}
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"accelerator": "GPU",
|
||||
"colab": {
|
||||
"provenance": [],
|
||||
"private_outputs": true,
|
||||
"include_colab_link": true
|
||||
},
|
||||
"gpuClass": "standard",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"name": "python"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 0
|
||||
}
|
||||
@@ -365,8 +365,6 @@
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import gym\n",
|
||||
"\n",
|
||||
"# First, we create our environment called LunarLander-v2\n",
|
||||
"env = gym.make(\"LunarLander-v2\")\n",
|
||||
"\n",
|
||||
|
||||
3194
unit5/unit5.ipynb
3194
unit5/unit5.ipynb
File diff suppressed because one or more lines are too long
@@ -46,6 +46,10 @@
|
||||
title: Play with Huggy
|
||||
- local: unitbonus1/conclusion
|
||||
title: Conclusion
|
||||
- title: Live 1. How the course work, Q&A, and playing with Huggy
|
||||
sections:
|
||||
- local: live1/live1
|
||||
title: Live 1. How the course work, Q&A, and playing with Huggy 🐶
|
||||
- title: Unit 2. Introduction to Q-Learning
|
||||
sections:
|
||||
- local: unit2/introduction
|
||||
@@ -88,6 +92,8 @@
|
||||
title: The Deep Q-Network (DQN)
|
||||
- local: unit3/deep-q-algorithm
|
||||
title: The Deep Q Algorithm
|
||||
- local: unit3/glossary
|
||||
title: Glossary
|
||||
- local: unit3/hands-on
|
||||
title: Hands-on
|
||||
- local: unit3/quiz
|
||||
@@ -96,7 +102,7 @@
|
||||
title: Conclusion
|
||||
- local: unit3/additional-readings
|
||||
title: Additional Readings
|
||||
- title: Unit Bonus 2. Automatic Hyperparameter Tuning with Optuna
|
||||
- title: Bonus Unit 2. Automatic Hyperparameter Tuning with Optuna
|
||||
sections:
|
||||
- local: unitbonus2/introduction
|
||||
title: Introduction
|
||||
@@ -104,6 +110,44 @@
|
||||
title: Optuna
|
||||
- local: unitbonus2/hands-on
|
||||
title: Hands-on
|
||||
- title: Unit 4. Policy Gradient with PyTorch
|
||||
sections:
|
||||
- local: unit4/introduction
|
||||
title: Introduction
|
||||
- local: unit4/what-are-policy-based-methods
|
||||
title: What are the policy-based methods?
|
||||
- local: unit4/advantages-disadvantages
|
||||
title: The advantages and disadvantages of policy-gradient methods
|
||||
- local: unit4/policy-gradient
|
||||
title: Diving deeper into policy-gradient
|
||||
- local: unit4/pg-theorem
|
||||
title: (Optional) the Policy Gradient Theorem
|
||||
- local: unit4/hands-on
|
||||
title: Hands-on
|
||||
- local: unit4/quiz
|
||||
title: Quiz
|
||||
- local: unit4/conclusion
|
||||
title: Conclusion
|
||||
- local: unit4/additional-readings
|
||||
title: Additional Readings
|
||||
- title: Unit 5. Introduction to Unity ML-Agents
|
||||
sections:
|
||||
- local: unit5/introduction
|
||||
title: Introduction
|
||||
- local: unit5/how-mlagents-works
|
||||
title: How ML-Agents works?
|
||||
- local: unit5/snowball-target
|
||||
title: The SnowballTarget environment
|
||||
- local: unit5/pyramids
|
||||
title: The Pyramids environment
|
||||
- local: unit5/curiosity
|
||||
title: (Optional) What is curiosity in Deep Reinforcement Learning?
|
||||
- local: unit5/hands-on
|
||||
title: Hands-on
|
||||
- local: unit5/bonus
|
||||
title: Bonus. Learn to create your own environments with Unity and MLAgents
|
||||
- local: unit5/conclusion
|
||||
title: Conclusion
|
||||
- title: Unit 6. Actor Critic methods with Robotics environments
|
||||
sections:
|
||||
- local: unit6/introduction
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# Publishing Schedule [[publishing-schedule]]
|
||||
|
||||
We publish a **new unit every Monday** (except Monday, the 26th of December).
|
||||
We publish a **new unit every Tuesday**.
|
||||
|
||||
If you don't want to miss any of the updates, don't forget to:
|
||||
|
||||
|
||||
9
units/en/live1/live1.mdx
Normal file
9
units/en/live1/live1.mdx
Normal file
@@ -0,0 +1,9 @@
|
||||
# Live 1: How the course work, Q&A, and playing with Huggy
|
||||
|
||||
In this first live stream, we explained how the course work (scope, units, challenges, and more) and answered your questions.
|
||||
|
||||
And finally, we saw some LunarLander agents you've trained and play with your Huggies 🐶
|
||||
|
||||
<Youtube id="JeJIswxyrsM" />
|
||||
|
||||
To know when the next live is scheduled **check the discord server**. We will also send **you an email**. If you can't participate, don't worry, we record the live sessions.
|
||||
@@ -9,7 +9,13 @@ Discord is a free chat platform. If you've used Slack, **it's quite similar**. T
|
||||
|
||||
Starting in Discord can be a bit intimidating, so let me take you through it.
|
||||
|
||||
When you sign-up to our Discord server, you'll need to specify which topics you're interested in by **clicking #role-assignment at the left**. Here, you can pick different categories. Make sure to **click "Reinforcement Learning"**! :fire:. You'll then get to **introduce yourself in the `#introduction-yourself` channel**.
|
||||
When you sign-up to our Discord server, you'll need to specify which topics you're interested in by **clicking #role-assignment at the left**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/discord1.jpg" alt="Discord"/>
|
||||
|
||||
In #role-assignment, you can pick different categories. Make sure to **click "Reinforcement Learning"**. You'll then get to **introduce yourself in the `#introduction-yourself` channel**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/discord2.jpg" alt="Discord"/>
|
||||
|
||||
## So which channels are interesting to me? [[channels]]
|
||||
|
||||
|
||||
@@ -23,7 +23,7 @@ In this course, you will:
|
||||
|
||||
- 📖 Study Deep Reinforcement Learning in **theory and practice.**
|
||||
- 🧑💻 Learn to **use famous Deep RL libraries** such as [Stable Baselines3](https://stable-baselines3.readthedocs.io/en/master/), [RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo), [Sample Factory](https://samplefactory.dev/) and [CleanRL](https://github.com/vwxyzjn/cleanrl).
|
||||
- 🤖 **Train agents in unique environments** such as [SnowballFight](https://huggingface.co/spaces/ThomasSimonini/SnowballFight), [Huggy the Doggo 🐶](https://huggingface.co/spaces/ThomasSimonini/Huggy), [MineRL (Minecraft ⛏️)](https://minerl.io/), [VizDoom (Doom)](https://vizdoom.cs.put.edu.pl/) and classical ones such as [Space Invaders](https://www.gymlibrary.dev/environments/atari/) and [PyBullet](https://pybullet.org/wordpress/).
|
||||
- 🤖 **Train agents in unique environments** such as [SnowballFight](https://huggingface.co/spaces/ThomasSimonini/SnowballFight), [Huggy the Doggo 🐶](https://singularite.itch.io/huggy), [VizDoom (Doom)](https://vizdoom.cs.put.edu.pl/) and classical ones such as [Space Invaders](https://www.gymlibrary.dev/environments/atari/), [PyBullet](https://pybullet.org/wordpress/) and more.
|
||||
- 💾 Share your **trained agents with one line of code to the Hub** and also download powerful agents from the community.
|
||||
- 🏆 Participate in challenges where you will **evaluate your agents against other teams. You'll also get to play against the agents you'll train.**
|
||||
|
||||
@@ -52,20 +52,21 @@ The course is composed of:
|
||||
|
||||
You can choose to follow this course either:
|
||||
|
||||
- *To get a certificate of completion*: you need to complete 80% of the assignments before the end of March 2023.
|
||||
- *To get a certificate of honors*: you need to complete 100% of the assignments before the end of March 2023.
|
||||
- *To get a certificate of completion*: you need to complete 80% of the assignments before the end of April 2023.
|
||||
- *To get a certificate of honors*: you need to complete 100% of the assignments before the end of April 2023.
|
||||
- *As a simple audit*: you can participate in all challenges and do assignments if you want, but you have no deadlines.
|
||||
|
||||
Both paths **are completely free**.
|
||||
Whatever path you choose, we advise you **to follow the recommended pace to enjoy the course and challenges with your fellow classmates.**
|
||||
You don't need to tell us which path you choose. At the end of March, when we verify the assignments **if you get more than 80% of the assignments done, you'll get a certificate.**
|
||||
|
||||
You don't need to tell us which path you choose. At the end of March, when we will verify the assignments **if you get more than 80% of the assignments done, you'll get a certificate.**
|
||||
|
||||
## The Certification Process [[certification-process]]
|
||||
|
||||
The certification process is **completely free**:
|
||||
|
||||
- *To get a certificate of completion*: you need to complete 80% of the assignments before the end of March 2023.
|
||||
- *To get a certificate of honors*: you need to complete 100% of the assignments before the end of March 2023.
|
||||
- *To get a certificate of completion*: you need to complete 80% of the assignments before the end of April 2023.
|
||||
- *To get a certificate of honors*: you need to complete 100% of the assignments before the end of April 2023.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/certification.jpg" alt="Course certification" width="100%"/>
|
||||
|
||||
@@ -92,7 +93,7 @@ You need only 3 things:
|
||||
## What is the publishing schedule? [[publishing-schedule]]
|
||||
|
||||
|
||||
We publish **a new unit every Monday** (except Monday, the 26th of December).
|
||||
We publish **a new unit every Tuesday**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/communication/schedule1.png" alt="Schedule 1" width="100%"/>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/communication/schedule2.png" alt="Schedule 2" width="100%"/>
|
||||
@@ -128,7 +129,7 @@ In this new version of the course, you have two types of challenges:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/challenges.jpg" alt="Challenges" width="100%"/>
|
||||
|
||||
These AI vs.AI challenges will be announced **later in December**.
|
||||
These AI vs.AI challenges will be announced **in January**.
|
||||
|
||||
|
||||
## I found a bug, or I want to improve the course [[contribute]]
|
||||
|
||||
@@ -18,9 +18,10 @@ You can now sign up for our Discord Server. This is the place where you **can ex
|
||||
When you join, remember to introduce yourself in #introduce-yourself and sign-up for reinforcement channels in #role-assignments.
|
||||
|
||||
We have multiple RL-related channels:
|
||||
- `rl-announcements`: where we give the last information about the course.
|
||||
- `rl-announcements`: where we give the latest information about the course.
|
||||
- `rl-discussions`: where you can exchange about RL and share information.
|
||||
- `rl-study-group`: where you can create and join study groups.
|
||||
- `rl-i-made-this`: where you can share your projects and models.
|
||||
|
||||
If this is your first time using Discord, we wrote a Discord 101 to get the best practices. Check the next section.
|
||||
|
||||
|
||||
@@ -12,5 +12,10 @@ In the next (bonus) unit, we’re going to reinforce what we just learned by **t
|
||||
|
||||
You will be able then to play with him 🤗.
|
||||
|
||||
<video src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/huggy.mp4" alt="Huggy" type="video/mp4">
|
||||
</video>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/huggy.jpg" alt="Huggy"/>
|
||||
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
|
||||
### Keep Learning, stay awesome 🤗
|
||||
|
||||
|
||||
|
||||
@@ -24,6 +24,8 @@ To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggi
|
||||
|
||||
For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
|
||||
|
||||
And you can check your progress here 👉 https://huggingface.co/spaces/ThomasSimonini/Check-my-progress-Deep-RL-Course
|
||||
|
||||
So let's get started! 🚀
|
||||
|
||||
**To start the hands-on click on Open In Colab button** 👇 :
|
||||
@@ -139,7 +141,7 @@ To make things easier, we created a script to install all these dependencies.
|
||||
```
|
||||
|
||||
```python
|
||||
!pip install -r https://huggingface.co/spaces/ThomasSimonini/temp-space-requirements/raw/main/requirements/requirements-unit1.txt
|
||||
!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit1/requirements-unit1.txt
|
||||
```
|
||||
|
||||
During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).
|
||||
|
||||
@@ -22,7 +22,6 @@ It's essential **to master these elements** before diving into implementing Dee
|
||||
|
||||
After this unit, in a bonus unit, you'll be **able to train Huggy the Dog 🐶 to fetch the stick and play with him 🤗**.
|
||||
|
||||
<video src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/huggy.mp4" alt="Huggy" type="video/mp4">
|
||||
</video>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/huggy.jpg" alt="Huggy"/>
|
||||
|
||||
So let's get started! 🚀
|
||||
|
||||
@@ -15,5 +15,7 @@ In the next chapter, we’re going to dive deeper by studying our first Deep Rei
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Atari environments"/>
|
||||
|
||||
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
|
||||
### Keep Learning, stay awesome 🤗
|
||||
|
||||
|
||||
@@ -13,9 +13,24 @@ This is a community-created glossary. Contributions are welcomed!
|
||||
- **The state-value function.** For each state, the state-value function is the expected return if the agent starts in that state and follows the policy until the end.
|
||||
- **The action-value function.** In contrast to the state-value function, the action-value calculates for each state and action pair the expected return if the agent starts in that state and takes an action. Then it follows the policy forever after.
|
||||
|
||||
### Epsilon-greedy strategy:
|
||||
|
||||
- Common exploration strategy used in reinforcement learning that involves balancing exploration and exploitation.
|
||||
- Chooses the action with the highest expected reward with a probability of 1-epsilon.
|
||||
- Chooses a random action with a probability of epsilon.
|
||||
- Epsilon is typically decreased over time to shift focus towards exploitation.
|
||||
|
||||
### Greedy strategy:
|
||||
|
||||
- Involves always choosing the action that is expected to lead to the highest reward, based on the current knowledge of the environment. (only exploitation)
|
||||
- Always chooses the action with the highest expected reward.
|
||||
- Does not include any exploration.
|
||||
- Can be disadvantageous in environments with uncertainty or unknown optimal actions.
|
||||
|
||||
|
||||
If you want to improve the course, you can [open a Pull Request.](https://github.com/huggingface/deep-rl-class/pulls)
|
||||
|
||||
This glossary was made possible thanks to:
|
||||
|
||||
- [Ramón Rueda](https://github.com/ramon-rd)
|
||||
- [Hasarindu Perera](https://github.com/hasarinduperera/)
|
||||
|
||||
@@ -22,6 +22,8 @@ To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggi
|
||||
|
||||
For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
|
||||
|
||||
And you can check your progress here 👉 https://huggingface.co/spaces/ThomasSimonini/Check-my-progress-Deep-RL-Course
|
||||
|
||||
|
||||
**To start the hands-on click on Open In Colab button** 👇 :
|
||||
|
||||
|
||||
@@ -62,7 +62,7 @@ For each state, the state-value function outputs the expected return if the agen
|
||||
|
||||
In the action-value function, for each state and action pair, the action-value function **outputs the expected return** if the agent starts in that state and takes action, and then follows the policy forever after.
|
||||
|
||||
The value of taking action an in state \\(s\\) under a policy \\(π\\) is:
|
||||
The value of taking action \\(a\\) in state \\(s\\) under a policy \\(π\\) is:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/action-state-value-function-1.jpg" alt="Action State value function"/>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/action-state-value-function-2.jpg" alt="Action State value function"/>
|
||||
|
||||
@@ -11,4 +11,7 @@ Don't hesitate to train your agent in other environments (Pong, Seaquest, QBert,
|
||||
|
||||
In the next unit, **we're going to learn about Optuna**. One of the most critical task in Deep Reinforcement Learning is to find a good set of training hyperparameters. And Optuna is a library that helps you to automate the search.
|
||||
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
|
||||
### Keep Learning, stay awesome 🤗
|
||||
|
||||
|
||||
@@ -30,7 +30,7 @@ No, because one frame is not enough to have a sense of motion! But what if I add
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/temporal-limitation-2.jpg" alt="Temporal Limitation"/>
|
||||
That’s why, to capture temporal information, we stack four frames together.
|
||||
|
||||
Then, the stacked frames are processed by three convolutional layers. These layers **allow us to capture and exploit spatial relationships in images**. But also, because frames are stacked together, **you can exploit some spatial properties across those frames**.
|
||||
Then, the stacked frames are processed by three convolutional layers. These layers **allow us to capture and exploit spatial relationships in images**. But also, because frames are stacked together, **you can exploit some temporal properties across those frames**.
|
||||
|
||||
If you don't know what are convolutional layers, don't worry. You can check the [Lesson 4 of this free Deep Reinforcement Learning Course by Udacity](https://www.udacity.com/course/deep-learning-pytorch--ud188)
|
||||
|
||||
|
||||
@@ -13,12 +13,12 @@ Internally, our Q-function has **a Q-table, a table where each cell corresponds
|
||||
The problem is that Q-Learning is a *tabular method*. This raises a problem in which the states and actions spaces **are small enough to approximate value functions to be represented as arrays and tables**. Also, this is **not scalable**.
|
||||
Q-Learning worked well with small state space environments like:
|
||||
|
||||
- FrozenLake, we had 14 states.
|
||||
- FrozenLake, we had 16 states.
|
||||
- Taxi-v3, we had 500 states.
|
||||
|
||||
But think of what we're going to do today: we will train an agent to learn to play Space Invaders a more complex game, using the frames as input.
|
||||
|
||||
As **[Nikita Melkozerov mentioned](https://twitter.com/meln1k), Atari environments** have an observation space with a shape of (210, 160, 3)*, containing values ranging from 0 to 255 so that gives us \\(256^{210x160x3} = 256^{100800}\\) (for comparison, we have approximately \\(10^{80}\\) atoms in the observable universe).
|
||||
As **[Nikita Melkozerov mentioned](https://twitter.com/meln1k), Atari environments** have an observation space with a shape of (210, 160, 3)*, containing values ranging from 0 to 255 so that gives us \\(256^{210 \times 160 \times 3} = 256^{100800}\\) (for comparison, we have approximately \\(10^{80}\\) atoms in the observable universe).
|
||||
|
||||
* A single frame in Atari is composed of an image of 210x160 pixels. Given the images are in color (RGB), there are 3 channels. This is why the shape is (210, 160, 3). For each pixel, the value can go from 0 to 255.
|
||||
|
||||
|
||||
39
units/en/unit3/glossary.mdx
Normal file
39
units/en/unit3/glossary.mdx
Normal file
@@ -0,0 +1,39 @@
|
||||
# Glossary
|
||||
|
||||
This is a community-created glossary. Contributions are welcomed!
|
||||
|
||||
- **Tabular Method:** type of problem in which the state and action spaces are small enough to approximate value functions to be represented as arrays and tables.
|
||||
**Q-learning** is an example of tabular method since a table is used to represent the value for different state-action pairs.
|
||||
|
||||
- **Deep Q-Learning:** method that trains a neural network to approximate, given a state, the different **Q-values** for each possible action at that state.
|
||||
Is used to solve problems when observational space is too big to apply a tabular Q-Learning approach.
|
||||
|
||||
- **Temporal Limitation:** is a difficulty presented when the environment state is represented by frames. A frame by itself does not provide temporal information.
|
||||
In order to obtain temporal information, we need to **stack** a number of frames together.
|
||||
|
||||
- **Phases of Deep Q-Learning:**
|
||||
- **Sampling:** actions are performed, and observed experience tuples are stored in a **replay memory**.
|
||||
- **Training:** batches of tuples are selected randomly and the neural network updates its weights using gradient descent.
|
||||
|
||||
- **Solutions to stabilize Deep Q-Learning:**
|
||||
- **Experience Replay:** a replay memory is created to save experiences samples that can be reused during training.
|
||||
This allows the agent to learn from the same experiences multiple times. Also, it makes the agent avoid to forget previous experiences as it get new ones.
|
||||
**Random sampling** from replay buffer allows to remove correlation in the observation sequences and prevents action values from oscillating or diverging
|
||||
catastrophically.
|
||||
|
||||
- **Fixed Q-Target:** In order to calculate the **Q-Target** we need to estimate the discounted optimal **Q-value** of the next state by using Bellman equation. The problem
|
||||
is that the same network weigths are used to calculate the **Q-Target** and the **Q-value**. This means that everytime we are modifying the **Q-value**, the **Q-Target** also moves with it.
|
||||
To avoid this issue, a separate network with fixed parameters is used for estimating the Temporal Difference Target. The target network is updated by copying parameters from
|
||||
our Deep Q-Network after certain **C steps**.
|
||||
|
||||
- **Double DQN:** method to handle **overstimation** of **Q-Values**. This solution uses two networks to decouple the action selection from the target **-Value generation**:
|
||||
-**DQN Network** to select the best action to take for the next state (the action with the highest **Q-Value**)
|
||||
-**Target Network** to calculate the target **Q-Value** of taking that action at the next state.
|
||||
This approach reduce the **Q-Values** overstimation, it helps to train faster and have more stable learning.
|
||||
|
||||
|
||||
If you want to improve the course, you can [open a Pull Request.](https://github.com/huggingface/deep-rl-class/pulls)
|
||||
|
||||
This glossary was made possible thanks to:
|
||||
|
||||
- [Dario Paez](https://github.com/dario248)
|
||||
@@ -18,12 +18,15 @@ We're using the [RL-Baselines-3 Zoo integration](https://github.com/DLR-RM/rl-ba
|
||||
|
||||
Also, **if you want to learn to implement Deep Q-Learning by yourself after this hands-on**, you definitely should look at CleanRL implementation: https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/dqn_atari.py
|
||||
|
||||
To validate this hands-on for the certification process, you need to push your trained model to the Hub and **get a result of >= 500**.
|
||||
To validate this hands-on for the certification process, you need to push your trained model to the Hub and **get a result of >= 200**.
|
||||
|
||||
To find your result, go to the leaderboard and find your model, **the result = mean_reward - std of reward**
|
||||
|
||||
For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
|
||||
|
||||
And you can check your progress here 👉 https://huggingface.co/spaces/ThomasSimonini/Check-my-progress-Deep-RL-Course
|
||||
|
||||
|
||||
**To start the hands-on click on Open In Colab button** 👇 :
|
||||
|
||||
[](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit3/unit3.ipynb)
|
||||
@@ -68,13 +71,6 @@ Before diving into the notebook, you need to:
|
||||
We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the Github Repo](https://github.com/huggingface/deep-rl-class/issues).
|
||||
|
||||
# Let's train a Deep Q-Learning agent playing Atari' Space Invaders 👾 and upload it to the Hub.
|
||||
|
||||
To validate this hands-on for the certification process, you need to push your trained model to the Hub and **get a result of >= 500**.
|
||||
|
||||
To find your result, go to the leaderboard and find your model, **the result = mean_reward - std of reward**
|
||||
|
||||
For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
|
||||
|
||||
## Set the GPU 💪
|
||||
|
||||
- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`
|
||||
@@ -141,7 +137,7 @@ To train an agent with RL-Baselines3-Zoo, we just need to do two things:
|
||||
|
||||
|
||||
Here we see that:
|
||||
- We use the `Atari Wrapper` that does the pre-processing (Frame reduction, grayscale, stack four frames frames),
|
||||
- We use the `Atari Wrapper` that does the pre-processing (Frame reduction, grayscale, stack four frames),
|
||||
- We use `CnnPolicy`, since we use Convolutional layers to process the frames.
|
||||
- We train the model for 10 million `n_timesteps`.
|
||||
- Memory (Experience Replay) size is 100000, i.e. the number of experience steps you saved to train again your agent with.
|
||||
|
||||
@@ -76,8 +76,8 @@ For instance, in pong, our agent **will be unable to know the ball direction if
|
||||
|
||||
**1. Make more efficient use of the experiences during the training**
|
||||
|
||||
Usually, in online reinforcement learning, the agent interacts in the environment, gets experiences (state, action, reward, and next state), learns from them (updates the neural network), and discards them. This is not efficient
|
||||
But with experience replay, **we create a replay buffer that saves experience samples that we can reuse during the training**.
|
||||
Usually, in online reinforcement learning, the agent interacts in the environment, gets experiences (state, action, reward, and next state), learns from them (updates the neural network), and discards them. This is not efficient.
|
||||
But, with experience replay, **we create a replay buffer that saves experience samples that we can reuse during the training**.
|
||||
|
||||
**2. Avoid forgetting previous experiences and reduce the correlation between experiences**
|
||||
|
||||
|
||||
20
units/en/unit4/additional-readings.mdx
Normal file
20
units/en/unit4/additional-readings.mdx
Normal file
@@ -0,0 +1,20 @@
|
||||
# Additional Readings
|
||||
|
||||
These are **optional readings** if you want to go deeper.
|
||||
|
||||
|
||||
## Introduction to Policy Optimization
|
||||
|
||||
- [Part 3: Intro to Policy Optimization - Spinning Up documentation](https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html)
|
||||
|
||||
|
||||
## Policy Gradient
|
||||
|
||||
- [https://johnwlambert.github.io/policy-gradients/](https://johnwlambert.github.io/policy-gradients/)
|
||||
- [RL - Policy Gradient Explained](https://jonathan-hui.medium.com/rl-policy-gradients-explained-9b13b688b146)
|
||||
- [Chapter 13, Policy Gradient Methods; Reinforcement Learning, an introduction by Richard Sutton and Andrew G. Barto](http://incompleteideas.net/book/RLbook2020.pdf)
|
||||
|
||||
## Implementation
|
||||
|
||||
- [PyTorch Reinforce implementation](https://github.com/pytorch/examples/blob/main/reinforcement_learning/reinforce.py)
|
||||
- [Implementations from DDPG to PPO](https://github.com/MrSyee/pg-is-all-you-need)
|
||||
74
units/en/unit4/advantages-disadvantages.mdx
Normal file
74
units/en/unit4/advantages-disadvantages.mdx
Normal file
@@ -0,0 +1,74 @@
|
||||
# The advantages and disadvantages of policy-gradient methods
|
||||
|
||||
At this point, you might ask, "but Deep Q-Learning is excellent! Why use policy-gradient methods?". To answer this question, let's study the **advantages and disadvantages of policy-gradient methods**.
|
||||
|
||||
## Advantages
|
||||
|
||||
There are multiple advantages over value-based methods. Let's see some of them:
|
||||
|
||||
### The simplicity of integration
|
||||
|
||||
We can estimate the policy directly without storing additional data (action values).
|
||||
|
||||
### Policy-gradient methods can learn a stochastic policy
|
||||
|
||||
Policy-gradient methods can **learn a stochastic policy while value functions can't**.
|
||||
|
||||
This has two consequences:
|
||||
|
||||
1. We **don't need to implement an exploration/exploitation trade-off by hand**. Since we output a probability distribution over actions, the agent explores **the state space without always taking the same trajectory.**
|
||||
|
||||
2. We also get rid of the problem of **perceptual aliasing**. Perceptual aliasing is when two states seem (or are) the same but need different actions.
|
||||
|
||||
Let's take an example: we have an intelligent vacuum cleaner whose goal is to suck the dust and avoid killing the hamsters.
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/hamster1.jpg" alt="Hamster 1"/>
|
||||
</figure>
|
||||
|
||||
Our vacuum cleaner can only perceive where the walls are.
|
||||
|
||||
The problem is that the **two rose cases are aliased states because the agent perceives an upper and lower wall for each**.
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/hamster2.jpg" alt="Hamster 1"/>
|
||||
</figure>
|
||||
|
||||
Under a deterministic policy, the policy either will move right when in a red state or move left. **Either case will cause our agent to get stuck and never suck the dust**.
|
||||
|
||||
Under a value-based Reinforcement learning algorithm, we learn a **quasi-deterministic policy** ("greedy epsilon strategy"). Consequently, our agent can **spend a lot of time before finding the dust**.
|
||||
|
||||
On the other hand, an optimal stochastic policy **will randomly move left or right in rose states**. Consequently, **it will not be stuck and will reach the goal state with a high probability**.
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/hamster3.jpg" alt="Hamster 1"/>
|
||||
</figure>
|
||||
|
||||
### Policy-gradient methods are more effective in high-dimensional action spaces and continuous actions spaces
|
||||
|
||||
The problem with Deep Q-learning is that their **predictions assign a score (maximum expected future reward) for each possible action**, at each time step, given the current state.
|
||||
|
||||
But what if we have an infinite possibility of actions?
|
||||
|
||||
For instance, with a self-driving car, at each state, you can have a (near) infinite choice of actions (turning the wheel at 15°, 17.2°, 19,4°, honking, etc.). **We'll need to output a Q-value for each possible action**! And **taking the max action of a continuous output is an optimization problem itself**!
|
||||
|
||||
Instead, with policy-gradient methods, we output a **probability distribution over actions.**
|
||||
|
||||
### Policy-gradient methods have better convergence properties
|
||||
|
||||
In value-based methods, we use an aggressive operator to **change the value function: we take the maximum over Q-estimates**.
|
||||
Consequently, the action probabilities may change dramatically for an arbitrarily small change in the estimated action values if that change results in a different action having the maximal value.
|
||||
|
||||
For instance, if during the training, the best action was left (with a Q-value of 0.22) and the training step after it's right (since the right Q-value becomes 0.23), we dramatically changed the policy since now the policy will take most of the time right instead of left.
|
||||
|
||||
On the other hand, in policy-gradient methods, stochastic policy action preferences (probability of taking action) **change smoothly over time**.
|
||||
|
||||
## Disadvantages
|
||||
|
||||
Naturally, policy-gradient methods also have some disadvantages:
|
||||
|
||||
- **Frequently, policy-gradient converges on a local maximum instead of a global optimum.**
|
||||
- Policy-gradient goes slower, **step by step: it can take longer to train (inefficient).**
|
||||
- Policy-gradient can have high variance. We'll see in actor-critic unit why and how we can solve this problem.
|
||||
|
||||
👉 If you want to go deeper into the advantages and disadvantages of policy-gradient methods, [you can check this video](https://youtu.be/y3oqOjHilio).
|
||||
17
units/en/unit4/conclusion.mdx
Normal file
17
units/en/unit4/conclusion.mdx
Normal file
@@ -0,0 +1,17 @@
|
||||
# Conclusion
|
||||
|
||||
|
||||
**Congrats on finishing this unit**! There was a lot of information.
|
||||
And congrats on finishing the tutorial. You've just coded your first Deep Reinforcement Learning agent from scratch using PyTorch and shared it on the Hub 🥳.
|
||||
|
||||
Don't hesitate to iterate on this unit **by improving the implementation for more complex environments** (for instance, what about changing the network to a Convolutional Neural Network to handle
|
||||
frames as observation)?
|
||||
|
||||
In the next unit, **we're going to learn more about Unity MLAgents**, by training agents in Unity environments. This way, you will be ready to participate in the **AI vs AI challenges where you'll train your agents
|
||||
to compete against other agents in a snowball fight and a soccer game.**
|
||||
|
||||
Sounds fun? See you next time!
|
||||
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
|
||||
### Keep Learning, stay awesome 🤗
|
||||
1015
units/en/unit4/hands-on.mdx
Normal file
1015
units/en/unit4/hands-on.mdx
Normal file
File diff suppressed because it is too large
Load Diff
24
units/en/unit4/introduction.mdx
Normal file
24
units/en/unit4/introduction.mdx
Normal file
@@ -0,0 +1,24 @@
|
||||
# Introduction [[introduction]]
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/thumbnail.png" alt="thumbnail"/>
|
||||
|
||||
In the last unit, we learned about Deep Q-Learning. In this value-based deep reinforcement learning algorithm, we **used a deep neural network to approximate the different Q-values for each possible action at a state.**
|
||||
|
||||
Since the beginning of the course, we only studied value-based methods, **where we estimate a value function as an intermediate step towards finding an optimal policy.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy" />
|
||||
|
||||
In value-based methods, the policy ** \\(π\\) only exists because of the action value estimates since the policy is just a function** (for instance, greedy-policy) that will select the action with the highest value given a state.
|
||||
|
||||
But, with policy-based methods, we want to optimize the policy directly **without having an intermediate step of learning a value function.**
|
||||
|
||||
So today, **we'll learn about policy-based methods and study a subset of these methods called policy gradient**. Then we'll implement our first policy gradient algorithm called Monte Carlo **Reinforce** from scratch using PyTorch.
|
||||
Then, we'll test its robustness using the CartPole-v1 and PixelCopter environments.
|
||||
|
||||
You'll then be able to iterate and improve this implementation for more advanced environments.
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/envs.gif" alt="Environments"/>
|
||||
</figure>
|
||||
|
||||
Let's get started,
|
||||
83
units/en/unit4/pg-theorem.mdx
Normal file
83
units/en/unit4/pg-theorem.mdx
Normal file
@@ -0,0 +1,83 @@
|
||||
# (Optional) the Policy Gradient Theorem
|
||||
|
||||
In this optional section where we're **going to study how we differentiate the objective function that we will use to approximate the policy gradient**.
|
||||
|
||||
Let's first recap our different formulas:
|
||||
|
||||
1. The Objective function
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/expected_reward.png" alt="Return"/>
|
||||
|
||||
|
||||
2. The probability of a trajectory (given that action comes from \\(\pi_\theta\\)):
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/probability.png" alt="Probability"/>
|
||||
|
||||
|
||||
So we have:
|
||||
|
||||
\\(\nabla_\theta J(\theta) = \nabla_\theta \sum_{\tau}P(\tau;\theta)R(\tau)\\)
|
||||
|
||||
|
||||
We can rewrite the gradient of the sum as the sum of the gradient:
|
||||
|
||||
\\( = \sum_{\tau} \nabla_\theta P(\tau;\theta)R(\tau) \\)
|
||||
|
||||
We then multiply every term in the sum by \\(\frac{P(\tau;\theta)}{P(\tau;\theta)}\\)(which is possible since it's = 1)
|
||||
|
||||
\\( = \sum_{\tau} \frac{P(\tau;\theta)}{P(\tau;\theta)}\nabla_\theta P(\tau;\theta)R(\tau) \\)
|
||||
|
||||
We can simplify further this since \\( \frac{P(\tau;\theta)}{P(\tau;\theta)}\nabla_\theta P(\tau;\theta) = P(\tau;\theta)\frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)} \\)
|
||||
|
||||
\\(= \sum_{\tau} P(\tau;\theta) \frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)}R(\tau) \\)
|
||||
|
||||
We can then use the *derivative log trick* (also called *likelihood ratio trick* or *REINFORCE trick*), a simple rule in calculus that implies that \\( \nabla_x log f(x) = \frac{\nabla_x f(x)}{f(x)} \\)
|
||||
|
||||
So given we have \\(\frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)} \\) we transform it as \\(\nabla_\theta log P(\tau|\theta) \\)
|
||||
|
||||
|
||||
|
||||
So this is our likelihood policy gradient:
|
||||
|
||||
\\( \nabla_\theta J(\theta) = \sum_{\tau} P(\tau;\theta) \nabla_\theta log P(\tau;\theta) R(\tau) \\)
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Thanks for this new formula, we can estimate the gradient using trajectory samples (we can approximate the likelihood ratio policy gradient with sample-based estimate if you prefer).
|
||||
|
||||
\\(\nabla_\theta J(\theta) = \frac{1}{m} \sum^{m}_{i=1} \nabla_\theta log P(\tau^{(i)};\theta)R(\tau^{(i)})\\) where each \\(\tau^{(i)}\\) is a sampled trajectory.
|
||||
|
||||
|
||||
But we still have some mathematics work to do there: we need to simplify \\( \nabla_\theta log P(\tau|\theta) \\)
|
||||
|
||||
We know that:
|
||||
|
||||
\\(\nabla_\theta log P(\tau^{(i)};\theta)= \nabla_\theta log[ \mu(s_0) \prod_{t=0}^{H} P(s_{t+1}^{(i)}|s_{t}^{(i)}, a_{t}^{(i)}) \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})]\\)
|
||||
|
||||
Where \\(\mu(s_0)\\) is the initial state distribution and \\( P(s_{t+1}^{(i)}|s_{t}^{(i)}, a_{t}^{(i)}) \\) is the state transition dynamics of the MDP.
|
||||
|
||||
We know that the log of a product is equal to the sum of the logs:
|
||||
|
||||
\\(\nabla_\theta log P(\tau^{(i)};\theta)= \nabla_\theta \left[log \mu(s_0) + \sum\limits_{t=0}^{H}log P(s_{t+1}^{(i)}|s_{t}^{(i)} a_{t}^{(i)}) + \sum\limits_{t=0}^{H}log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})\right] \\)
|
||||
|
||||
We also know that the gradient of the sum is equal to the sum of gradient:
|
||||
|
||||
\\( \nabla_\theta log P(\tau^{(i)};\theta)=\nabla_\theta log\mu(s_0) + \nabla_\theta \sum\limits_{t=0}^{H} log P(s_{t+1}^{(i)}|s_{t}^{(i)} a_{t}^{(i)}) + \nabla_\theta \sum\limits_{t=0}^{H} log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)}) \\)
|
||||
|
||||
|
||||
Since neither initial state distribution or state transition dynamics of the MDP are dependent of \\(\theta\\), the derivate of both terms are 0. So we can remove them:
|
||||
|
||||
Since:
|
||||
\\(\nabla_\theta \sum_{t=0}^{H} log P(s_{t+1}^{(i)}|s_{t}^{(i)} a_{t}^{(i)}) = 0 \\) and \\( \nabla_\theta \mu(s_0) = 0\\)
|
||||
|
||||
\\(\nabla_\theta log P(\tau^{(i)};\theta) = \nabla_\theta \sum_{t=0}^{H} log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})\\)
|
||||
|
||||
We can rewrite the gradient of the sum as the sum of gradients:
|
||||
|
||||
\\( \nabla_\theta log P(\tau^{(i)};\theta)= \sum_{t=0}^{H} \nabla_\theta log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)}) \\)
|
||||
|
||||
So, the final formula for estimating the policy gradient is:
|
||||
|
||||
\\( \nabla_{\theta} J(\theta) = \hat{g} = \frac{1}{m} \sum^{m}_{i=1} \sum^{H}_{t=0} \nabla_\theta \log \pi_\theta(a^{(i)}_{t} | s_{t}^{(i)})R(\tau^{(i)}) \\)
|
||||
120
units/en/unit4/policy-gradient.mdx
Normal file
120
units/en/unit4/policy-gradient.mdx
Normal file
@@ -0,0 +1,120 @@
|
||||
# Diving deeper into policy-gradient methods
|
||||
|
||||
## Getting the big picture
|
||||
|
||||
We just learned that policy-gradient methods aim to find parameters \\( \theta \\) that **maximize the expected return**.
|
||||
|
||||
The idea is that we have a *parameterized stochastic policy*. In our case, a neural network outputs a probability distribution over actions. The probability of taking each action is also called *action preference*.
|
||||
|
||||
If we take the example of CartPole-v1:
|
||||
- As input, we have a state.
|
||||
- As output, we have a probability distribution over actions at that state.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_based.png" alt="Policy based" />
|
||||
|
||||
Our goal with policy-gradient is to **control the probability distribution of actions** by tuning the policy such that **good actions (that maximize the return) are sampled more frequently in the future.**
|
||||
Each time the agent interacts with the environment, we tweak the parameters such that good actions will be sampled more likely in the future.
|
||||
|
||||
But **how are we going to optimize the weights using the expected return**?
|
||||
|
||||
The idea is that we're going to **let the agent interact during an episode**. And if we win the episode, we consider that each action taken was good and must be more sampled in the future
|
||||
since they lead to win.
|
||||
|
||||
So for each state-action pair, we want to increase the \\(P(a|s)\\): the probability of taking that action at that state. Or decrease if we lost.
|
||||
|
||||
The Policy-gradient algorithm (simplified) looks like this:
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/pg_bigpicture.jpg" alt="Policy Gradient Big Picture"/>
|
||||
</figure>
|
||||
|
||||
Now that we got the big picture, let's dive deeper into policy-gradient methods.
|
||||
|
||||
## Diving deeper into policy-gradient methods
|
||||
|
||||
We have our stochastic policy \\(\pi\\) which has a parameter \\(\theta\\). This \\(\pi\\), given a state, **outputs a probability distribution of actions**.
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/stochastic_policy.png" alt="Policy"/>
|
||||
</figure>
|
||||
|
||||
Where \\(\pi_\theta(a_t|s_t)\\) is the probability of the agent selecting action \\(a_t\\) from state \\(s_t\\) given our policy.
|
||||
|
||||
**But how do we know if our policy is good?** We need to have a way to measure it. To know that, we define a score/objective function called \\(J(\theta)\\).
|
||||
|
||||
### The objective function
|
||||
|
||||
The *objective function* gives us the **performance of the agent** given a trajectory (state action sequence without considering reward (contrary to an episode)), and it outputs the *expected cumulative reward*.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/objective.jpg" alt="Return"/>
|
||||
|
||||
Let's detail a little bit more this formula:
|
||||
- The *expected return* (also called expected cumulative reward), is the weighted average (where the weights are given by \\(P(\tau;\theta)\\) of all possible values that the return \\(R(\tau)\\) can take.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/expected_reward.png" alt="Return"/>
|
||||
|
||||
|
||||
- \\(R(\tau)\\) : Return from an arbitrary trajectory. To take this quantity and use it to calculate the expected return, we need to multiply it by the probability of each possible trajectory.
|
||||
- \\(P(\tau;\theta)\\) : Probability of each possible trajectory \\(\tau\\) (that probability depends on \\( \theta\\) since it defines the policy that it uses to select the actions of the trajectory which as an impact of the states visited).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/probability.png" alt="Probability"/>
|
||||
|
||||
- \\(J(\theta)\\) : Expected return, we calculate it by summing for all trajectories, the probability of taking that trajectory given \\(\theta \\), and the return of this trajectory.
|
||||
|
||||
Our objective then is to maximize the expected cumulative reward by finding \\(\theta \\) that will output the best action probability distributions:
|
||||
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/max_objective.png" alt="Max objective"/>
|
||||
|
||||
|
||||
## Gradient Ascent and the Policy-gradient Theorem
|
||||
|
||||
Policy-gradient is an optimization problem: we want to find the values of \\(\theta\\) that maximize our objective function \\(J(\theta)\\), we need to use **gradient-ascent**. It's the inverse of *gradient-descent* since it gives the direction of the steepest increase of \\(J(\theta)\\).
|
||||
|
||||
(If you need a refresher on the difference between gradient descent and gradient ascent [check this](https://www.baeldung.com/cs/gradient-descent-vs-ascent) and [this](https://stats.stackexchange.com/questions/258721/gradient-ascent-vs-gradient-descent-in-logistic-regression)).
|
||||
|
||||
Our update step for gradient-ascent is:
|
||||
|
||||
\\( \theta \leftarrow \theta + \alpha * \nabla_\theta J(\theta) \\)
|
||||
|
||||
We can repeatedly apply this update state in the hope that \\(\theta \\) converges to the value that maximizes \\(J(\theta)\\).
|
||||
|
||||
However, we have two problems to obtain the derivative of \\(J(\theta)\\):
|
||||
1. We can't calculate the true gradient of the objective function since it would imply calculating the probability of each possible trajectory which is computationally super expensive.
|
||||
We want then to **calculate a gradient estimation with a sample-based estimate (collect some trajectories)**.
|
||||
|
||||
2. We have another problem that I detail in the next optional section. To differentiate this objective function, we need to differentiate the state distribution, called Markov Decision Process dynamics. This is attached to the environment. It gives us the probability of the environment going into the next state, given the current state and the action taken by the agent. The problem is that we can't differentiate it because we might not know about it.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/probability.png" alt="Probability"/>
|
||||
|
||||
Fortunately we're going to use a solution called the Policy Gradient Theorem that will help us to reformulate the objective function into a differentiable function that does not involve the differentiation of the state distribution.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_gradient_theorem.png" alt="Policy Gradient"/>
|
||||
|
||||
If you want to understand how we derivate this formula that we will use to approximate the gradient, check the next (optional) section.
|
||||
|
||||
## The Reinforce algorithm (Monte Carlo Reinforce)
|
||||
|
||||
The Reinforce algorithm, also called Monte-Carlo policy-gradient, is a policy-gradient algorithm that **uses an estimated return from an entire episode to update the policy parameter** \\(\theta\\):
|
||||
|
||||
In a loop:
|
||||
- Use the policy \\(\pi_\theta\\) to collect an episode \\(\tau\\)
|
||||
- Use the episode to estimate the gradient \\(\hat{g} = \nabla_\theta J(\theta)\\)
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_gradient_one.png" alt="Policy Gradient"/>
|
||||
</figure>
|
||||
|
||||
- Update the weights of the policy: \\(\theta \leftarrow \theta + \alpha \hat{g}\\)
|
||||
|
||||
The interpretation we can make is this one:
|
||||
- \\(\nabla_\theta log \pi_\theta(a_t|s_t)\\) is the direction of **steepest increase of the (log) probability** of selecting action at from state st.
|
||||
This tells us **how we should change the weights of policy** if we want to increase/decrease the log probability of selecting action \\(a_t\\) at state \\(s_t\\).
|
||||
- \\(R(\tau)\\): is the scoring function:
|
||||
- If the return is high, it will **push up the probabilities** of the (state, action) combinations.
|
||||
- Else, if the return is low, it will **push down the probabilities** of the (state, action) combinations.
|
||||
|
||||
|
||||
We can also **collect multiple episodes (trajectories)** to estimate the gradient:
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_gradient_multiple.png" alt="Policy Gradient"/>
|
||||
</figure>
|
||||
82
units/en/unit4/quiz.mdx
Normal file
82
units/en/unit4/quiz.mdx
Normal file
@@ -0,0 +1,82 @@
|
||||
# Quiz
|
||||
|
||||
The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
|
||||
|
||||
|
||||
### Q1: What are the advantages of policy-gradient over value-based methods? (Check all that apply)
|
||||
|
||||
<Question
|
||||
choices={[
|
||||
{
|
||||
text: "Policy-gradient methods can learn a stochastic policy",
|
||||
explain: "",
|
||||
correct: true,
|
||||
},
|
||||
{
|
||||
text: "Policy-gradient methods are more effective in high-dimensional action spaces and continuous actions spaces",
|
||||
explain: "",
|
||||
correct: true,
|
||||
},
|
||||
{
|
||||
text: "Policy-gradient converges most of the time on a global maximum.",
|
||||
explain: "No, frequently, policy-gradient converges on a local maximum instead of a global optimum.",
|
||||
},
|
||||
]}
|
||||
/>
|
||||
|
||||
### Q2: What is the Policy Gradient Theorem?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
*The Policy Gradient Theorem* is a formula that will help us to reformulate the objective function into a differentiable function that does not involve the differentiation of the state distribution.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_gradient_theorem.png" alt="Policy Gradient"/>
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
### Q3: What's the difference between policy-based methods and policy-gradient methods? (Check all that apply)
|
||||
|
||||
<Question
|
||||
choices={[
|
||||
{
|
||||
text: "Policy-based methods are a subset of policy-gradient methods.",
|
||||
explain: "",
|
||||
},
|
||||
{
|
||||
text: "Policy-gradient methods are a subset of policy-based methods.",
|
||||
explain: "",
|
||||
correct: true,
|
||||
},
|
||||
{
|
||||
text: "In Policy-based methods, we can optimize the parameter θ **indirectly** by maximizing the local approximation of the objective function with techniques like hill climbing, simulated annealing, or evolution strategies.",
|
||||
explain: "",
|
||||
correct: true,
|
||||
},
|
||||
{
|
||||
text: "In Policy-gradient methods, we optimize the parameter θ **directly** by performing the gradient ascent on the performance of the objective function.",
|
||||
explain: "",
|
||||
correct: true,
|
||||
},
|
||||
]}
|
||||
/>
|
||||
|
||||
|
||||
### Q4: Why do we use gradient ascent instead of gradient descent to optimize J(θ)?
|
||||
|
||||
<Question
|
||||
choices={[
|
||||
{
|
||||
text: "We want to minimize J(θ) and gradient ascent gives us the gives the direction of the steepest increase of J(θ)",
|
||||
explain: "",
|
||||
},
|
||||
{
|
||||
text: "We want to maximize J(θ) and gradient ascent gives us the gives the direction of the steepest increase of J(θ)",
|
||||
explain: "",
|
||||
correct: true
|
||||
},
|
||||
]}
|
||||
/>
|
||||
|
||||
Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read the chapter again to reinforce (😏) your knowledge.
|
||||
42
units/en/unit4/what-are-policy-based-methods.mdx
Normal file
42
units/en/unit4/what-are-policy-based-methods.mdx
Normal file
@@ -0,0 +1,42 @@
|
||||
# What are the policy-based methods?
|
||||
|
||||
The main goal of Reinforcement learning is to **find the optimal policy \\(\pi^{*}\\) that will maximize the expected cumulative reward**.
|
||||
Because Reinforcement Learning is based on the *reward hypothesis*: **all goals can be described as the maximization of the expected cumulative reward.**
|
||||
|
||||
For instance, in a soccer game (where you're going to train the agents in two units), the goal is to win the game. We can describe this goal in reinforcement learning as
|
||||
**maximizing the number of goals scored** (when the ball crosses the goal line) into your opponent's soccer goals. And **minimizing the number of goals in your soccer goals**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/soccer.jpg" alt="Soccer" />
|
||||
|
||||
## Value-based, Policy-based, and Actor-critic methods
|
||||
|
||||
We studied in the first unit, that we had two methods to find (most of the time approximate) this optimal policy \\(\pi^{*}\\).
|
||||
|
||||
- In *value-based methods*, we learn a value function.
|
||||
- The idea is that an optimal value function leads to an optimal policy \\(\pi^{*}\\).
|
||||
- Our objective is to **minimize the loss between the predicted and target value** to approximate the true action-value function.
|
||||
- We have a policy, but it's implicit since it **was generated directly from the value function**. For instance, in Q-Learning, we defined an epsilon-greedy policy.
|
||||
|
||||
- On the other hand, in *policy-based methods*, we directly learn to approximate \\(\pi^{*}\\) without having to learn a value function.
|
||||
- The idea is **to parameterize the policy**. For instance, using a neural network \\(\pi_\theta\\), this policy will output a probability distribution over actions (stochastic policy).
|
||||
- <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/stochastic_policy.png" alt="stochastic policy" />
|
||||
- Our objective then is **to maximize the performance of the parameterized policy using gradient ascent**.
|
||||
- To do that, we control the parameter \\(\theta\\) that will affect the distribution of actions over a state.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_based.png" alt="Policy based" />
|
||||
|
||||
- Finally, we'll study the next time *actor-critic* which is a combination of value-based and policy-based methods.
|
||||
|
||||
Consequently, thanks to policy-based methods, we can directly optimize our policy \\(\pi_\theta\\) to output a probability distribution over actions \\(\pi_\theta(a|s)\\) that leads to the best cumulative return.
|
||||
To do that, we define an objective function \\(J(\theta)\\), that is, the expected cumulative reward, and we **want to find \\(\theta\\) that maximizes this objective function**.
|
||||
|
||||
## The difference between policy-based and policy-gradient methods
|
||||
|
||||
Policy-gradient methods, what we're going to study in this unit, is a subclass of policy-based methods. In policy-based methods, the optimization is most of the time *on-policy* since for each update, we only use data (trajectories) collected **by our most recent version of** \\(\pi_\theta\\).
|
||||
|
||||
The difference between these two methods **lies on how we optimize the parameter** \\(\theta\\):
|
||||
|
||||
- In *policy-based methods*, we search directly for the optimal policy. We can optimize the parameter \\(\theta\\) **indirectly** by maximizing the local approximation of the objective function with techniques like hill climbing, simulated annealing, or evolution strategies.
|
||||
- In *policy-gradient methods*, because it is a subclass of the policy-based methods, we search directly for the optimal policy. But we optimize the parameter \\(\theta\\) **directly** by performing the gradient ascent on the performance of the objective function \\(J(\theta)\\).
|
||||
|
||||
Before diving more into how policy-gradient methods work (the objective function, policy gradient theorem, gradient ascent, etc.), let's study the advantages and disadvantages of policy-based methods.
|
||||
19
units/en/unit5/bonus.mdx
Normal file
19
units/en/unit5/bonus.mdx
Normal file
@@ -0,0 +1,19 @@
|
||||
# Bonus: Learn to create your own environments with Unity and MLAgents
|
||||
|
||||
**You can create your own reinforcement learning environments with Unity and MLAgents**. But, using a game engine such as Unity, can be intimidating at first but here are the steps you can do to learn smoothly.
|
||||
|
||||
## Step 1: Know how to use Unity
|
||||
|
||||
- The best way to learn Unity is to do ["Create with Code" course](https://learn.unity.com/course/create-with-code): it's a series of videos for beginners where **you will create 5 small games with Unity**.
|
||||
|
||||
## Step 2: Create the simplest environment with this tutorial
|
||||
|
||||
- Then, when you know how to use Unity, you can create your [first basic RL environment using this tutorial](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/Learning-Environment-Create-New.md).
|
||||
|
||||
## Step 3: Iterate and create nice environments
|
||||
|
||||
- Now that you've created a first simple environment you can iterate in more complex one using the [MLAgents documentation (especially Designing Agents and Agent part)](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/)
|
||||
- In addition, you can follow this free course ["Create a hummingbird environment"](https://learn.unity.com/course/ml-agents-hummingbirds) by [Adam Kelly](https://twitter.com/aktwelve)
|
||||
|
||||
|
||||
Have fun! And if you create custom environments don't hesitate to share them to `#rl-i-made-this` discord channel.
|
||||
22
units/en/unit5/conclusion.mdx
Normal file
22
units/en/unit5/conclusion.mdx
Normal file
@@ -0,0 +1,22 @@
|
||||
# Conclusion
|
||||
|
||||
Congrats on finishing this unit! You’ve just trained your first ML-Agents and shared it to the Hub 🥳.
|
||||
|
||||
The best way to learn is to **practice and try stuff**. Why not try another environment? [ML-Agents has 18 different environments](https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Learning-Environment-Examples.md).
|
||||
|
||||
For instance:
|
||||
- [Worm](https://singularite.itch.io/worm), where you teach a worm to crawl.
|
||||
- [Walker](https://singularite.itch.io/walker): teach an agent to walk towards a goal.
|
||||
|
||||
Check the documentation to find how to train them and the list of already integrated MLAgents environments on the Hub: https://github.com/huggingface/ml-agents#getting-started
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/envs-unity.jpeg" alt="Example envs"/>
|
||||
|
||||
|
||||
In the next unit, we're going to learn about multi-agents. And you're going to train your first multi-agents to compete in Soccer and Snowball fight against other classmate's agents.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballfight.gif" alt="Snownball fight"/>
|
||||
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
|
||||
### Keep Learning, stay awesome 🤗
|
||||
50
units/en/unit5/curiosity.mdx
Normal file
50
units/en/unit5/curiosity.mdx
Normal file
@@ -0,0 +1,50 @@
|
||||
# (Optional) What is Curiosity in Deep Reinforcement Learning?
|
||||
|
||||
This is an (optional) introduction to Curiosity. If you want to learn more, you can read two additional articles where we dive into the mathematical details:
|
||||
|
||||
- [Curiosity-Driven Learning through Next State Prediction](https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-next-state-prediction-f7f4e2f592fa)
|
||||
- [Random Network Distillation: a new take on Curiosity-Driven Learning](https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-random-network-distillation-488ffd8e5938)
|
||||
|
||||
## Two Major Problems in Modern RL
|
||||
|
||||
To understand what is Curiosity, we need first to understand the two major problems with RL:
|
||||
|
||||
First, the *sparse rewards problem:* that is, **most rewards do not contain information, and hence are set to zero**.
|
||||
|
||||
Remember that RL is based on the *reward hypothesis*, which is the idea that each goal can be described as the maximization of the rewards. Therefore, rewards act as feedback for RL agents; **if they don’t receive any, their knowledge of which action is appropriate (or not) cannot change**.
|
||||
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/curiosity1.png" alt="Curiosity"/>
|
||||
<figcaption>Source: Thanks to the reward, our agent knows that this action at that state was good</figcaption>
|
||||
</figure>
|
||||
|
||||
|
||||
For instance, in [Vizdoom](https://vizdoom.cs.put.edu.pl/), a set of environments based on the game Doom “DoomMyWayHome,” your agent is only rewarded **if it finds the vest**.
|
||||
However, the vest is far away from your starting point, so most of your rewards will be zero. Therefore, if our agent does not receive useful feedback (dense rewards), it will take much longer to learn an optimal policy, and **it can spend time turning around without finding the goal**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/curiosity2.png" alt="Curiosity"/>
|
||||
|
||||
The second big problem is that **the extrinsic reward function is handmade; in each environment, a human has to implement a reward function**. But how we can scale that in big and complex environments?
|
||||
|
||||
## So what is Curiosity?
|
||||
|
||||
A solution to these problems is **to develop a reward function intrinsic to the agent, i.e., generated by the agent itself**. The agent will act as a self-learner since it will be the student and its own feedback master.
|
||||
|
||||
**This intrinsic reward mechanism is known as Curiosity** because this reward pushes the agent to explore states that are novel/unfamiliar. To achieve that, our agent will receive a high reward when exploring new trajectories.
|
||||
|
||||
This reward is inspired by how human acts. ** we naturally have an intrinsic desire to explore environments and discover new things**.
|
||||
|
||||
There are different ways to calculate this intrinsic reward. The classical approach (Curiosity through next-state prediction) is to calculate Curiosity **as the error of our agent in predicting the next state, given the current state and action taken**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/curiosity3.png" alt="Curiosity"/>
|
||||
|
||||
Because the idea of Curiosity is to **encourage our agent to perform actions that reduce the uncertainty in the agent’s ability to predict the consequences of its actions** (uncertainty will be higher in areas where the agent has spent less time or in areas with complex dynamics).
|
||||
|
||||
If the agent spends a lot of time on these states, it will be good to predict the next state (low Curiosity). On the other hand, if it’s a new state unexplored, it will be harmful to predict the following state (high Curiosity).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/curiosity4.png" alt="Curiosity"/>
|
||||
|
||||
Using Curiosity will push our agent to favor transitions with high prediction error (which will be higher in areas where the agent has spent less time, or in areas with complex dynamics) and **consequently better explore our environment**.
|
||||
|
||||
There’s also **other curiosity calculation methods**. ML-Agents uses a more advanced one called Curiosity through random network distillation. This is out of the scope of the tutorial but if you’re interested [I wrote an article explaining it in detail](https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-random-network-distillation-488ffd8e5938).
|
||||
370
units/en/unit5/hands-on.mdx
Normal file
370
units/en/unit5/hands-on.mdx
Normal file
@@ -0,0 +1,370 @@
|
||||
# Hands-on
|
||||
|
||||
<CourseFloatingBanner classNames="absolute z-10 right-0 top-0"
|
||||
notebooks={[
|
||||
{label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/unit5/unit5.ipynb"}
|
||||
]}
|
||||
askForHelpUrl="http://hf.co/join/discord" />
|
||||
|
||||
|
||||
We learned what ML-Agents is and how it works. We also studied the two environments we're going to use. Now we're ready to train our agents!
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/envs.png" alt="Environments" />
|
||||
|
||||
The ML-Agents integration on the Hub **is still experimental**. Some features will be added in the future. But for now, to validate this hands-on for the certification process, you just need to push your trained models to the Hub.
|
||||
There are no minimum results to attain to validate this Hands On. But if you want to get nice results, you can try to reach the following:
|
||||
|
||||
- For [Pyramids](https://singularite.itch.io/pyramids): Mean Reward = 1.75
|
||||
- For [SnowballTarget](https://singularite.itch.io/snowballtarget): Mean Reward = 15 or 30 targets shoot in an episode.
|
||||
|
||||
For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
|
||||
|
||||
**To start the hands-on, click on Open In Colab button** 👇 :
|
||||
|
||||
[](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit5/unit5.ipynb)
|
||||
|
||||
# Unit 5: An Introduction to ML-Agents
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/thumbnail.png" alt="Thumbnail"/>
|
||||
|
||||
In this notebook, you'll learn about ML-Agents and train two agents.
|
||||
|
||||
- The first one will learn to **shoot snowballs onto spawning targets**.
|
||||
- The second need to press a button to spawn a pyramid, then navigate to the pyramid, knock it over, **and move to the gold brick at the top**. To do that, it will need to explore its environment, and we will use a technique called curiosity.
|
||||
|
||||
After that, you'll be able **to watch your agents playing directly on your browser**.
|
||||
|
||||
For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
|
||||
|
||||
⬇️ Here is an example of what **you will achieve at the end of this unit.** ⬇️
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids.gif" alt="Pyramids"/>
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballtarget.gif" alt="SnowballTarget"/>
|
||||
|
||||
### 🎮 Environments:
|
||||
|
||||
- [Pyramids](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Learning-Environment-Examples.md#pyramids)
|
||||
- SnowballTarget
|
||||
|
||||
### 📚 RL-Library:
|
||||
|
||||
- [ML-Agents (HuggingFace Experimental Version)](https://github.com/huggingface/ml-agents)
|
||||
|
||||
⚠ We're going to use an experimental version of ML-Agents where you can push to Hub and load from Hub Unity ML-Agents Models **you need to install the same version**
|
||||
|
||||
We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues).
|
||||
|
||||
## Objectives of this notebook 🏆
|
||||
|
||||
At the end of the notebook, you will:
|
||||
|
||||
- Understand how works **ML-Agents**, the environment library.
|
||||
- Be able to **train agents in Unity Environments**.
|
||||
|
||||
## Prerequisites 🏗️
|
||||
Before diving into the notebook, you need to:
|
||||
|
||||
🔲 📚 **Study [what is ML-Agents and how it works by reading Unit 5](https://huggingface.co/deep-rl-course/unit5/introduction)** 🤗
|
||||
|
||||
# Let's train our agents 🚀
|
||||
|
||||
## Set the GPU 💪
|
||||
|
||||
- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg" alt="GPU Step 1">
|
||||
|
||||
- `Hardware Accelerator > GPU`
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg" alt="GPU Step 2">
|
||||
|
||||
## Clone the repository and install the dependencies 🔽
|
||||
- We need to clone the repository that **contains the experimental version of the library that allows you to push your trained agent to the Hub.**
|
||||
|
||||
```python
|
||||
%%capture
|
||||
# Clone the repository
|
||||
!git clone --depth 1 https://github.com/huggingface/ml-agents/
|
||||
```
|
||||
|
||||
```python
|
||||
%%capture
|
||||
# Go inside the repository and install the package
|
||||
%cd ml-agents
|
||||
!pip3 install -e ./ml-agents-envs
|
||||
!pip3 install -e ./ml-agents
|
||||
```
|
||||
|
||||
## SnowballTarget ⛄
|
||||
|
||||
If you need a refresher on how this environment works check this section 👉
|
||||
https://huggingface.co/deep-rl-course/unit5/snowball-target
|
||||
|
||||
### Download and move the environm ent zip file in `./training-envs-executables/linux/`
|
||||
- Our environment executable is in a zip file.
|
||||
- We need to download it and place it to `./training-envs-executables/linux/`
|
||||
- We use a linux executable because we use colab, and colab machines OS is Ubuntu (linux)
|
||||
|
||||
```python
|
||||
# Here, we create training-envs-executables and linux
|
||||
!mkdir ./training-envs-executables
|
||||
!mkdir ./training-envs-executables/linux
|
||||
```
|
||||
|
||||
Download the file SnowballTarget.zip from https://drive.google.com/file/d/1YHHLjyj6gaZ3Gemx1hQgqrPgSS2ZhmB5 using `wget`.
|
||||
|
||||
Check out the full solution to download large files from GDrive [here](https://bcrf.biochem.wisc.edu/2021/02/05/download-google-drive-files-using-wget/)
|
||||
|
||||
```python
|
||||
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1YHHLjyj6gaZ3Gemx1hQgqrPgSS2ZhmB5' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1YHHLjyj6gaZ3Gemx1hQgqrPgSS2ZhmB5" -O ./training-envs-executables/linux/SnowballTarget.zip && rm -rf /tmp/cookies.txt
|
||||
```
|
||||
|
||||
We unzip the executable.zip file
|
||||
|
||||
```python
|
||||
%%capture
|
||||
!unzip -d ./training-envs-executables/linux/ ./training-envs-executables/linux/SnowballTarget.zip
|
||||
```
|
||||
|
||||
Make sure your file is accessible
|
||||
|
||||
```python
|
||||
!chmod -R 755 ./training-envs-executables/linux/SnowballTarget
|
||||
```
|
||||
|
||||
### Define the SnowballTarget config file
|
||||
- In ML-Agents, you define the **training hyperparameters into config.yaml files.**
|
||||
|
||||
There are multiple hyperparameters. To know them better, you should check for each explanation with [the documentation](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/Training-Configuration-File.md)
|
||||
|
||||
|
||||
You need to create a `SnowballTarget.yaml` config file in ./content/ml-agents/config/ppo/
|
||||
|
||||
We'll give you here a first version of this config (to copy and paste into your `SnowballTarget.yaml file`), **but you should modify it**.
|
||||
|
||||
```yaml
|
||||
behaviors:
|
||||
SnowballTarget:
|
||||
trainer_type: ppo
|
||||
summary_freq: 10000
|
||||
keep_checkpoints: 10
|
||||
checkpoint_interval: 50000
|
||||
max_steps: 200000
|
||||
time_horizon: 64
|
||||
threaded: true
|
||||
hyperparameters:
|
||||
learning_rate: 0.0003
|
||||
learning_rate_schedule: linear
|
||||
batch_size: 128
|
||||
buffer_size: 2048
|
||||
beta: 0.005
|
||||
epsilon: 0.2
|
||||
lambd: 0.95
|
||||
num_epoch: 3
|
||||
network_settings:
|
||||
normalize: false
|
||||
hidden_units: 256
|
||||
num_layers: 2
|
||||
vis_encode_type: simple
|
||||
reward_signals:
|
||||
extrinsic:
|
||||
gamma: 0.99
|
||||
strength: 1.0
|
||||
```
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballfight_config1.png" alt="Config SnowballTarget"/>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballfight_config2.png" alt="Config SnowballTarget"/>
|
||||
|
||||
As an experiment, try to modify some other hyperparameters. Unity provides very [good documentation explaining each of them here](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Training-Configuration-File.md).
|
||||
|
||||
Now that you've created the config file and understand what most hyperparameters do, we're ready to train our agent 🔥.
|
||||
|
||||
### Train the agent
|
||||
|
||||
To train our agent, we need to **launch mlagents-learn and select the executable containing the environment.**
|
||||
|
||||
We define four parameters:
|
||||
|
||||
1. `mlagents-learn <config>`: the path where the hyperparameter config file is.
|
||||
2. `--env`: where the environment executable is.
|
||||
3. `--run_id`: the name you want to give to your training run id.
|
||||
4. `--no-graphics`: to not launch the visualization during the training.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/mlagentslearn.png" alt="MlAgents learn"/>
|
||||
|
||||
Train the model and use the `--resume` flag to continue training in case of interruption.
|
||||
|
||||
> It will fail the first time if and when you use `--resume`. Try rerunning the block to bypass the error.
|
||||
|
||||
The training will take 10 to 35min depending on your config. Go take a ☕️you deserve it 🤗.
|
||||
|
||||
```bash
|
||||
!mlagents-learn ./config/ppo/SnowballTarget.yaml --env=./training-envs-executables/linux/SnowballTarget/SnowballTarget --run-id="SnowballTarget1" --no-graphics
|
||||
```
|
||||
|
||||
### Push the agent to the Hugging Face Hub
|
||||
|
||||
- Now that we trained our agent, we’re **ready to push it to the Hub to be able to visualize it playing on your browser🔥.**
|
||||
|
||||
To be able to share your model with the community, there are three more steps to follow:
|
||||
|
||||
1️⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join
|
||||
|
||||
2️⃣ Sign in and store your authentication token from the Hugging Face website.
|
||||
- Create a new token (https://huggingface.co/settings/tokens) **with write role**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
|
||||
|
||||
- Copy the token
|
||||
- Run the cell below and paste the token
|
||||
|
||||
```python
|
||||
from huggingface_hub import notebook_login
|
||||
|
||||
notebook_login()
|
||||
```
|
||||
|
||||
If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
|
||||
|
||||
Then, we need to run `mlagents-push-to-hf`.
|
||||
|
||||
And we define four parameters:
|
||||
|
||||
1. `--run-id`: the name of the training run id.
|
||||
2. `--local-dir`: where the agent was saved, it’s results/<run_id name>, so in my case results/First Training.
|
||||
3. `--repo-id`: the name of the Hugging Face repo you want to create or update. It’s always <your huggingface username>/<the repo name>
|
||||
If the repo does not exist **it will be created automatically**
|
||||
4. `--commit-message`: since HF repos are git repository you need to define a commit message.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/mlagentspushtohub.png" alt="Push to Hub"/>
|
||||
|
||||
For instance:
|
||||
|
||||
`!mlagents-push-to-hf --run-id="SnowballTarget1" --local-dir="./results/SnowballTarget1" --repo-id="ThomasSimonini/ppo-SnowballTarget" --commit-message="First Push"`
|
||||
|
||||
```python
|
||||
!mlagents-push-to-hf --run-id= # Add your run id --local-dir= # Your local dir --repo-id= # Your repo id --commit-message= # Your commit message
|
||||
```
|
||||
|
||||
Else, if everything worked you should have this at the end of the process(but with a different url 😆) :
|
||||
|
||||
|
||||
|
||||
```
|
||||
Your model is pushed to the hub. You can view your model here: https://huggingface.co/ThomasSimonini/ppo-SnowballTarget
|
||||
```
|
||||
|
||||
It's the link to your model. It contains a model card that explains how to use it, your Tensorboard, and your config file. **What's awesome is that it's a git repository, which means you can have different commits, update your repository with a new push, etc.**
|
||||
|
||||
But now comes the best: **being able to visualize your agent online 👀.**
|
||||
|
||||
### Watch your agent playing 👀
|
||||
|
||||
This step it's simple:
|
||||
|
||||
1. Remember your repo-id
|
||||
|
||||
2. Go here: https://singularite.itch.io/snowballtarget
|
||||
|
||||
3. Launch the game and put it in full screen by clicking on the bottom right button
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballtarget_load.png" alt="Snowballtarget load"/>
|
||||
|
||||
1. In step 1, choose your model repository, which is the model id (in my case ThomasSimonini/ppo-SnowballTarget).
|
||||
|
||||
2. In step 2, **choose what model you want to replay**:
|
||||
- I have multiple ones since we saved a model every 500000 timesteps.
|
||||
- But if I want the more recent I choose `SnowballTarget.onnx`
|
||||
|
||||
👉 What's nice **is to try different models steps to see the improvement of the agent.**
|
||||
|
||||
And don't hesitate to share the best score your agent gets on discord in #rl-i-made-this channel 🔥
|
||||
|
||||
Let's now try a more challenging environment called Pyramids.
|
||||
|
||||
## Pyramids 🏆
|
||||
|
||||
### Download and move the environment zip file in `./training-envs-executables/linux/`
|
||||
- Our environment executable is in a zip file.
|
||||
- We need to download it and place it to `./training-envs-executables/linux/`
|
||||
- We use a linux executable because we use colab, and colab machines OS is Ubuntu (linux)
|
||||
|
||||
Download the file Pyramids.zip from https://drive.google.com/uc?export=download&id=1UiFNdKlsH0NTu32xV-giYUEVKV4-vc7H using `wget`. Check out the full solution to download large files from GDrive [here](https://bcrf.biochem.wisc.edu/2021/02/05/download-google-drive-files-using-wget/)
|
||||
|
||||
```python
|
||||
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1UiFNdKlsH0NTu32xV-giYUEVKV4-vc7H' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1UiFNdKlsH0NTu32xV-giYUEVKV4-vc7H" -O ./training-envs-executables/linux/Pyramids.zip && rm -rf /tmp/cookies.txt
|
||||
```
|
||||
|
||||
Unzip it
|
||||
|
||||
```python
|
||||
%%capture
|
||||
!unzip -d ./training-envs-executables/linux/ ./training-envs-executables/linux/Pyramids.zip
|
||||
```
|
||||
|
||||
Make sure your file is accessible
|
||||
|
||||
```python
|
||||
!chmod -R 755 ./training-envs-executables/linux/Pyramids/Pyramids
|
||||
```
|
||||
|
||||
### Modify the PyramidsRND config file
|
||||
- Contrary to the first environment, which was a custom one, **Pyramids was made by the Unity team**.
|
||||
- So the PyramidsRND config file already exists and is in ./content/ml-agents/config/ppo/PyramidsRND.yaml
|
||||
- You might ask why "RND" is in PyramidsRND. RND stands for *random network distillation* it's a way to generate curiosity rewards. If you want to know more about that, we wrote an article explaining this technique: https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-random-network-distillation-488ffd8e5938
|
||||
|
||||
For this training, we’ll modify one thing:
|
||||
- The total training steps hyperparameter is too high since we can hit the benchmark (mean reward = 1.75) in only 1M training steps.
|
||||
👉 To do that, we go to config/ppo/PyramidsRND.yaml,**and modify these to max_steps to 1000000.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-config.png" alt="Pyramids config"/>
|
||||
|
||||
As an experiment, you should also try to modify some other hyperparameters. Unity provides very [good documentation explaining each of them here](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Training-Configuration-File.md).
|
||||
|
||||
We’re now ready to train our agent 🔥.
|
||||
|
||||
### Train the agent
|
||||
|
||||
The training will take 30 to 45min depending on your machine, go take a ☕️you deserve it 🤗.
|
||||
|
||||
```python
|
||||
!mlagents-learn ./config/ppo/PyramidsRND.yaml --env=./training-envs-executables/linux/Pyramids/Pyramids --run-id="Pyramids Training" --no-graphics
|
||||
```
|
||||
|
||||
### Push the agent to the Hugging Face Hub
|
||||
|
||||
- Now that we trained our agent, we’re **ready to push it to the Hub to be able to visualize it playing on your browser🔥.**
|
||||
|
||||
```bash
|
||||
!mlagents-push-to-hf --run-id= # Add your run id --local-dir= # Your local dir --repo-id= # Your repo id --commit-message= # Your commit message
|
||||
```
|
||||
|
||||
### Watch your agent playing 👀
|
||||
|
||||
The temporary link for the Pyramids demo is: https://singularite.itch.io/pyramids
|
||||
|
||||
### 🎁 Bonus: Why not train on another environment?
|
||||
Now that you know how to train an agent using MLAgents, **why not try another environment?**
|
||||
|
||||
MLAgents provides 18 different and we’re building some custom ones. The best way to learn is to try things of your own, have fun.
|
||||
|
||||

|
||||
|
||||
You have the full list of the one currently available on Hugging Face here 👉 https://github.com/huggingface/ml-agents#the-environments
|
||||
|
||||
For the demos to visualize your agent, the temporary link is: https://singularite.itch.io (temporary because we'll also put the demos on Hugging Face Space)
|
||||
|
||||
For now we have integrated:
|
||||
- [Worm](https://singularite.itch.io/worm) demo where you teach a **worm to crawl**.
|
||||
- [Walker](https://singularite.itch.io/walker) demo where you teach an agent **to walk towards a goal**.
|
||||
|
||||
If you want new demos to be added, please open an issue: https://github.com/huggingface/deep-rl-class 🤗
|
||||
|
||||
That’s all for today. Congrats on finishing this tutorial!
|
||||
|
||||
The best way to learn is to practice and try stuff. Why not try another environment? ML-Agents has 18 different environments, but you can also create your own? Check the documentation and have fun!
|
||||
|
||||
See you on Unit 6 🔥,
|
||||
|
||||
## Keep Learning, Stay awesome 🤗
|
||||
68
units/en/unit5/how-mlagents-works.mdx
Normal file
68
units/en/unit5/how-mlagents-works.mdx
Normal file
@@ -0,0 +1,68 @@
|
||||
# How do Unity ML-Agents work? [[how-mlagents-works]]
|
||||
|
||||
Before training our agent, we need to understand **what ML-Agents is and how it works**.
|
||||
|
||||
## What is Unity ML-Agents? [[what-is-mlagents]]
|
||||
|
||||
[Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents) is a toolkit for the game engine Unity that **allows us to create environments using Unity or use pre-made environments to train our agents**.
|
||||
|
||||
It’s developed by [Unity Technologies](https://unity.com/), the developers of Unity, one of the most famous Game Engines used by the creators of Firewatch, Cuphead, and Cities: Skylines.
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/firewatch.jpeg" alt="Firewatch"/>
|
||||
<figcaption>Firewatch was made with Unity</figcaption>
|
||||
</figure>
|
||||
|
||||
## The six components [[six-components]]
|
||||
|
||||
With Unity ML-Agents, you have six essential components:
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/mlagents-1.png" alt="MLAgents"/>
|
||||
<figcaption>Source: <a href="https://unity-technologies.github.io/ml-agents/">Unity ML-Agents Documentation</a> </figcaption>
|
||||
</figure>
|
||||
|
||||
- The first is the *Learning Environment*, which contains **the Unity scene (the environment) and the environment elements** (game characters).
|
||||
- The second is the *Python Low-level API*, which contains **the low-level Python interface for interacting and manipulating the environment**. It’s the API we use to launch the training.
|
||||
- Then, we have the *External Communicator* that **connects the Learning Environment (made with C#) with the low level Python API (Python)**.
|
||||
- The *Python trainers*: the **Reinforcement algorithms made with PyTorch (PPO, SAC…)**.
|
||||
- The *Gym wrapper*: to encapsulate RL environment in a gym wrapper.
|
||||
- The *PettingZoo wrapper*: PettingZoo is the multi-agents of gym wrapper.
|
||||
|
||||
## Inside the Learning Component [[inside-learning-component]]
|
||||
|
||||
Inside the Learning Component, we have **three important elements**:
|
||||
|
||||
- The first is the *agent component*, the actor of the scene. We’ll **train the agent by optimizing its policy** (which will tell us what action to take in each state). The policy is called *Brain*.
|
||||
- Finally, there is the *Academy*. This component **orchestrates agents and their decision-making processes**. Think of this Academy as a teacher who handles Python API requests.
|
||||
|
||||
To better understand its role, let’s remember the RL process. This can be modeled as a loop that works like this:
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process.jpg" alt="The RL process" width="100%">
|
||||
<figcaption>The RL Process: a loop of state, action, reward and next state</figcaption>
|
||||
<figcaption>Source: <a href="http://incompleteideas.net/book/RLbook2020.pdf">Reinforcement Learning: An Introduction, Richard Sutton and Andrew G. Barto</a></figcaption>
|
||||
</figure>
|
||||
|
||||
Now, let’s imagine an agent learning to play a platform game. The RL process looks like this:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process_game.jpg" alt="The RL process" width="100%">
|
||||
|
||||
- Our Agent receives **state \\(S_0\\)** from the **Environment** — we receive the first frame of our game (Environment).
|
||||
- Based on that **state \\(S_0\\),** the Agent takes **action \\(A_0\\)** — our Agent will move to the right.
|
||||
- Environment goes to a **new** **state \\(S_1\\)** — new frame.
|
||||
- The environment gives some **reward \\(R_1\\)** to the Agent — we’re not dead *(Positive Reward +1)*.
|
||||
|
||||
This RL loop outputs a sequence of **state, action, reward and next state.** The goal of the agent is to **maximize the expected cumulative reward**.
|
||||
|
||||
The Academy will be the one that will **send the order to our Agents and ensure that agents are in sync**:
|
||||
|
||||
- Collect Observations
|
||||
- Select your action using your policy
|
||||
- Take the Action
|
||||
- Reset if you reached the max step or if you’re done.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/academy.png" alt="The MLAgents Academy" width="100%">
|
||||
|
||||
|
||||
Now that we understand how ML-Agents works, **we’re ready to train our agents.**
|
||||
31
units/en/unit5/introduction.mdx
Normal file
31
units/en/unit5/introduction.mdx
Normal file
@@ -0,0 +1,31 @@
|
||||
# An Introduction to Unity ML-Agents [[introduction-to-ml-agents]]
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/thumbnail.png" alt="thumbnail"/>
|
||||
|
||||
One of the challenges in Reinforcement Learning is **creating environments**. Fortunately for us, we can use game engines to achieve so.
|
||||
These engines, such as [Unity](https://unity.com/), [Godot](https://godotengine.org/) or [Unreal Engine](https://www.unrealengine.com/), are programs made to create video games. They are perfectly suited
|
||||
for creating environments: they provide physics systems, 2D/3D rendering, and more.
|
||||
|
||||
|
||||
One of them, [Unity](https://unity.com/), created the [Unity ML-Agents Toolkit](https://github.com/Unity-Technologies/ml-agents), a plugin based on the game engine Unity that allows us **to use the Unity Game Engine as an environment builder to train agents**. In the first bonus unit, this is what we used to train Huggy to catch a stick!
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/example-envs.png" alt="MLAgents environments"/>
|
||||
<figcaption>Source: <a href="https://github.com/Unity-Technologies/ml-agents">ML-Agents documentation</a></figcaption>
|
||||
</figure>
|
||||
|
||||
Unity ML-Agents Toolkit provides many exceptional pre-made environments, from playing football (soccer), learning to walk, and jumping big walls.
|
||||
|
||||
In this Unit, we'll learn to use ML-Agents, but **don't worry if you don't know how to use the Unity Game Engine**: you don't need to use it to train your agents.
|
||||
|
||||
So, today, we're going to train two agents:
|
||||
- The first one will learn to **shoot snowballs onto spawning target**.
|
||||
- The second needs to **press a button to spawn a pyramid, then navigate to the pyramid, knock it over, and move to the gold brick at the top**. To do that, it will need to explore its environment, which will be achieved using a technique called curiosity.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/envs.png" alt="Environments" />
|
||||
|
||||
Then, after training, **you'll push the trained agents to the Hugging Face Hub**, and you'll be able to **visualize it playing directly on your browser without having to use the Unity Editor**.
|
||||
|
||||
Doing this Unit will **prepare you for the next challenge: AI vs. AI where you will train agents in multi-agents environments and compete against your classmates' agents**.
|
||||
|
||||
Sounds exciting? Let's get started!
|
||||
39
units/en/unit5/pyramids.mdx
Normal file
39
units/en/unit5/pyramids.mdx
Normal file
@@ -0,0 +1,39 @@
|
||||
# The Pyramid environment
|
||||
|
||||
The goal in this environment is to train our agent to **get the gold brick on the top of the Pyramid. To do that, it needs to press a button to spawn a Pyramid, navigate to the Pyramid, knock it over, and move to the gold brick at the top**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids.png" alt="Pyramids Environment"/>
|
||||
|
||||
|
||||
## The reward function
|
||||
|
||||
The reward function is:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-reward.png" alt="Pyramids Environment"/>
|
||||
|
||||
In terms of code, it looks like this
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-reward-code.png" alt="Pyramids Reward"/>
|
||||
|
||||
To train this new agent that seeks that button and then the Pyramid to destroy, we’ll use a combination of two types of rewards:
|
||||
|
||||
- The *extrinsic one* given by the environment (illustration above).
|
||||
- But also an *intrinsic* one called **curiosity**. This second will **push our agent to be curious, or in other terms, to better explore its environment**.
|
||||
|
||||
If you want to know more about curiosity, the next section (optional) will explain the basics.
|
||||
|
||||
## The observation space
|
||||
|
||||
In terms of observation, we **use 148 raycasts that can each detect objects** (switch, bricks, golden brick, and walls.)
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids_raycasts.png"/>
|
||||
|
||||
We also use a **boolean variable indicating the switch state** (did we turn on or off the switch to spawn the Pyramid) and a vector that **contains the agent’s speed**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-obs-code.png" alt="Pyramids obs code"/>
|
||||
|
||||
|
||||
## The action space
|
||||
|
||||
The action space is **discrete** with four possible actions:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-action.png" alt="Pyramids Environment"/>
|
||||
57
units/en/unit5/snowball-target.mdx
Normal file
57
units/en/unit5/snowball-target.mdx
Normal file
@@ -0,0 +1,57 @@
|
||||
# The SnowballTarget Environment
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballtarget.gif" alt="SnowballTarget"/>
|
||||
|
||||
SnowballTarget is an environment we created at Hugging Face using assets from [Kay Lousberg](https://kaylousberg.com/). We have an optional section at the end of this Unit **if you want to learn to use Unity and create your environments**.
|
||||
|
||||
## The agent's Goal
|
||||
|
||||
The first agent you're going to train is called Julien the bear 🐻. Julien is trained **to hit targets with snowballs**.
|
||||
|
||||
The Goal in this environment is that Julien **hits as many targets as possible in the limited time** (1000 timesteps). It will need **to place itself correctly from the target and shoot**to do that.
|
||||
|
||||
In addition, to avoid "snowball spamming" (aka shooting a snowball every timestep), **Julien has a "cool off" system** (it needs to wait 0.5 seconds after a shoot to be able to shoot again).
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/cooloffsystem.gif" alt="Cool Off System"/>
|
||||
<figcaption>The agent needs to wait 0.5s before being able to shoot a snowball again</figcaption>
|
||||
</figure>
|
||||
|
||||
## The reward function and the reward engineering problem
|
||||
|
||||
The reward function is simple. **The environment gives a +1 reward every time the agent's snowball hits a target**. Because the agent's Goal is to maximize the expected cumulative reward, **it will try to hit as many targets as possible**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballtarget_reward.png" alt="Reward system"/>
|
||||
|
||||
We could have a more complex reward function (with a penalty to push the agent to go faster, for example). But when you design an environment, you need to avoid the *reward engineering problem*, which is having a too complex reward function to force your agent to behave as you want it to do.
|
||||
Why? Because by doing that, **you might miss interesting strategies that the agent will find with a simpler reward function**.
|
||||
|
||||
In terms of code, it looks like this:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballtarget-reward-code.png" alt="Reward"/>
|
||||
|
||||
|
||||
## The observation space
|
||||
|
||||
Regarding observations, we don't use normal vision (frame), but **we use raycasts**.
|
||||
|
||||
Think of raycasts as lasers that will detect if they pass through an object.
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/raycasts.png" alt="Raycasts"/>
|
||||
<figcaption>Source: <a href="https://github.com/Unity-Technologies/ml-agents">ML-Agents documentation</a></figcaption>
|
||||
</figure>
|
||||
|
||||
|
||||
In this environment, our agent has multiple set of raycasts:
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowball_target_raycasts.png" alt="Raycasts"/>
|
||||
|
||||
In addition to raycasts, the agent gets a "can I shoot" bool as observation.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballtarget-obs-code.png" alt="Obs"/>
|
||||
|
||||
## The action space
|
||||
|
||||
The action space is discrete:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballtarget_action_space.png" alt="Action Space"/>
|
||||
@@ -6,5 +6,7 @@ You can now sit and enjoy playing with your Huggy 🐶. And don't **forget to sp
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy-cover.jpeg" alt="Huggy cover" width="100%">
|
||||
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
|
||||
### Keep Learning, stay awesome 🤗
|
||||
|
||||
### Keep Learning, Stay Awesome 🤗
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# Introduction [[introduction]]
|
||||
|
||||
In this bonus unit, we'll reinforce what we learned in the first unit by teaching Huggy the Dog to fetch the stick and then [play with him directly in your browser](https://huggingface.co/spaces/ThomasSimonini/Huggy) 🐶
|
||||
In this bonus unit, we'll reinforce what we learned in the first unit by teaching Huggy the Dog to fetch the stick and then [play with him directly in your browser](https://singularite.itch.io/huggy) 🐶
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit2/thumbnail.png" alt="Unit bonus 1 thumbnail" width="100%">
|
||||
|
||||
|
||||
@@ -4,7 +4,7 @@ Now that you've trained Huggy and pushed it to the Hub. **You will be able to pl
|
||||
|
||||
For this step it’s simple:
|
||||
|
||||
- Open the game Huggy in your browser: https://huggingface.co/spaces/ThomasSimonini/Huggy
|
||||
- Open the game Huggy in your browser: https://singularite.itch.io/huggy
|
||||
|
||||
- Click on Play with my Huggy model
|
||||
|
||||
|
||||
@@ -236,7 +236,7 @@ But now comes the best: **being able to play with Huggy online 👀.**
|
||||
|
||||
This step is the simplest:
|
||||
|
||||
- Open the game Huggy in your browser: https://huggingface.co/spaces/ThomasSimonini/Huggy
|
||||
- Open the game Huggy in your browser: https://singularite.itch.io/huggy
|
||||
|
||||
- Click on Play with my Huggy model
|
||||
|
||||
|
||||
@@ -9,3 +9,8 @@ Now that you've learned to use Optuna, we give you some ideas to apply what you'
|
||||
By doing that, you're going to see how Optuna is valuable and powerful in training better agents,
|
||||
|
||||
Have fun,
|
||||
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
|
||||
### Keep Learning, stay awesome 🤗
|
||||
|
||||
|
||||
Reference in New Issue
Block a user