Merge branch 'main' into ThomasSimonini/CertificationAndNext

This commit is contained in:
Thomas Simonini
2023-02-28 09:33:38 +01:00
committed by GitHub
54 changed files with 4516 additions and 46 deletions

View File

@@ -1,6 +1,8 @@
# [The Hugging Face Deep Reinforcement Learning Course 🤗 (v2.0)](https://huggingface.co/deep-rl-course/unit0/introduction)
This repository contains the Deep Reinforcement Learning Course mdx files and notebooks. The website is here: https://huggingface.co/deep-rl-course/unit0/introduction?fw=pt
If you like the course, don't hesitate to **⭐ star this repository. This helps us 🤗**.
This repository contains the Deep Reinforcement Learning Course mdx files and notebooks. **The website is here**: https://huggingface.co/deep-rl-course/unit0/introduction?fw=pt
- The syllabus 📚: https://simoninithomas.github.io/deep-rl-course

View File

@@ -1099,7 +1099,7 @@
"\n",
"Take time to really **grasp the material before continuing and try the additional challenges**. Its important to master these elements and having a solid foundations.\n",
"\n",
"Naturally, during the course, were going to use and deeper explain again these terms but **its better to have a good understanding of them now before diving into the next chapters.**\n"
"Naturally, during the course, were going to dive deeper into these concepts but **its better to have a good understanding of them now before diving into the next chapters.**\n\n"
]
},
{

View File

@@ -511,7 +511,7 @@
},
"outputs": [],
"source": [
"# Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros\n",
"# Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros. np.zeros needs a tuple (a,b)\n",
"def initialize_q_table(state_space, action_space):\n",
" Qtable = \n",
" return Qtable"

View File

@@ -301,7 +301,7 @@
"## Train our Deep Q-Learning Agent to Play Space Invaders 👾\n",
"\n",
"To train an agent with RL-Baselines3-Zoo, we just need to do two things:\n",
"1. We define the hyperparameters in `rl-baselines3-zoo/hyperparams/dqn.yml`\n",
"1. We define the hyperparameters in `/content/rl-baselines3-zoo/hyperparams/dqn.yml`\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit3/hyperparameters.png\" alt=\"DQN Hyperparameters\">\n"
]

View File

@@ -431,7 +431,7 @@
"source": [
"env = make_vec_env(env_id, n_envs=4)\n",
"\n",
"env = VecNormalize(env, norm_obs=True, norm_reward=False, clip_obs=10.)"
"env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10.)"
],
"metadata": {
"id": "2O67mqgC-hol"

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,680 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/unit8/unit8_part2.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "OVx1gdg9wt9t"
},
"source": [
"# Unit 8 Part 2: Advanced Deep Reinforcement Learning. Using Sample Factory to play Doom from pixels\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/thumbnail2.png\" alt=\"Thumbnail\"/>\n",
"\n",
"In this notebook, we will learn how to train a Deep Neural Network to collect objects in a 3D environment based on the game of Doom, a video of the resulting policy is shown below. We train this policy using [Sample Factory](https://www.samplefactory.dev/), an asynchronous implementation of the PPO algorithm.\n",
"\n",
"Please note the following points:\n",
"\n",
"* [Sample Factory](https://www.samplefactory.dev/) is an advanced RL framework and **only functions on Linux and Mac** (not Windows).\n",
"\n",
"* The framework performs best on a **GPU machine with many CPU cores**, where it can achieve speeds of 100k interactions per second. The resources available on a standard Colab notebook **limit the performance of this library**. So the speed in this setting **does not reflect the real-world performance**.\n",
"* Benchmarks for Sample Factory are available in a number of settings, check out the [examples](https://github.com/alex-petrenko/sample-factory/tree/master/sf_examples) if you want to find out more.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "I6_67HfI1CKg"
},
"outputs": [],
"source": [
"from IPython.display import HTML\n",
"\n",
"HTML('''<video width=\"640\" height=\"480\" controls>\n",
" <source src=\"https://huggingface.co/edbeeching/doom_health_gathering_supreme_3333/resolve/main/replay.mp4\" \n",
" type=\"video/mp4\">Your browser does not support the video tag.</video>'''\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "DgHRAsYEXdyw"
},
"source": [
"To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push one model:\n",
"\n",
"- `doom_health_gathering_supreme` get a result of >= 5.\n",
"\n",
"To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**\n",
"\n",
"If you don't find your model, **go to the bottom of the page and click on the refresh button**\n",
"\n",
"For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "PU4FVzaoM6fC"
},
"source": [
"## Set the GPU 💪\n",
"- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg\" alt=\"GPU Step 1\">"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KV0NyFdQM9ZG"
},
"source": [
"- `Hardware Accelerator > GPU`\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg\" alt=\"GPU Step 2\">"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-fSy5HzUcMWB"
},
"source": [
"Before starting to train our agent, let's **study the library and environments we're going to use**.\n",
"\n",
"## Sample Factory\n",
"\n",
"[Sample Factory](https://www.samplefactory.dev/) is one of the **fastest RL libraries focused on very efficient synchronous and asynchronous implementations of policy gradients (PPO)**.\n",
"\n",
"Sample Factory is thoroughly **tested, used by many researchers and practitioners**, and is actively maintained. Our implementation is known to **reach SOTA performance in a variety of domains while minimizing RL experiment training time and hardware requirements**.\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/samplefactoryenvs.png\" alt=\"Sample factory\"/>\n",
"\n",
"\n",
"\n",
"### Key features\n",
"\n",
"- Highly optimized algorithm [architecture](https://www.samplefactory.dev/06-architecture/overview/) for maximum learning throughput\n",
"- [Synchronous and asynchronous](https://www.samplefactory.dev/07-advanced-topics/sync-async/) training regimes\n",
"- [Serial (single-process) mode](https://www.samplefactory.dev/07-advanced-topics/serial-mode/) for easy debugging\n",
"- Optimal performance in both CPU-based and [GPU-accelerated environments](https://www.samplefactory.dev/09-environment-integrations/isaacgym/)\n",
"- Single- & multi-agent training, self-play, supports [training multiple policies](https://www.samplefactory.dev/07-advanced-topics/multi-policy-training/) at once on one or many GPUs\n",
"- Population-Based Training ([PBT](https://www.samplefactory.dev/07-advanced-topics/pbt/))\n",
"- Discrete, continuous, hybrid action spaces\n",
"- Vector-based, image-based, dictionary observation spaces\n",
"- Automatically creates a model architecture by parsing action/observation space specification. Supports [custom model architectures](https://www.samplefactory.dev/03-customization/custom-models/)\n",
"- Designed to be imported into other projects, [custom environments](https://www.samplefactory.dev/03-customization/custom-environments/) are first-class citizens\n",
"- Detailed [WandB and Tensorboard summaries](https://www.samplefactory.dev/05-monitoring/metrics-reference/), [custom metrics](https://www.samplefactory.dev/05-monitoring/custom-metrics/)\n",
"- [HuggingFace 🤗 integration](https://www.samplefactory.dev/10-huggingface/huggingface/) (upload trained models and metrics to the Hub)\n",
"- [Multiple](https://www.samplefactory.dev/09-environment-integrations/mujoco/) [example](https://www.samplefactory.dev/09-environment-integrations/atari/) [environment](https://www.samplefactory.dev/09-environment-integrations/vizdoom/) [integrations](https://www.samplefactory.dev/09-environment-integrations/dmlab/) with tuned parameters and trained models\n",
"\n",
"All of the above policies are available on the 🤗 hub. Search for the tag [sample-factory](https://huggingface.co/models?library=sample-factory&sort=downloads)\n",
"\n",
"### How sample-factory works\n",
"\n",
"Sample-factory is one of the **most highly optimized RL implementations available to the community**. \n",
"\n",
"It works by **spawning multiple processes that run rollout workers, inference workers and a learner worker**. \n",
"\n",
"The *workers* **communicate through shared memory, which lowers the communication cost between processes**. \n",
"\n",
"The *rollout workers* interact with the environment and send observations to the *inference workers*. \n",
"\n",
"The *inferences workers* query a fixed version of the policy and **send actions back to the rollout worker**. \n",
"\n",
"After *k* steps the rollout works send a trajectory of experience to the learner worker, **which it uses to update the agents policy network**.\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/samplefactory.png\" alt=\"Sample factory\"/>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nB68Eb9UgC94"
},
"source": [
"### Actor Critic models in Sample-factory\n",
"\n",
"Actor Critic models in Sample Factory are composed of three components:\n",
"\n",
"- **Encoder** - Process input observations (images, vectors) and map them to a vector. This is the part of the model you will most likely want to customize.\n",
"- **Core** - Intergrate vectors from one or more encoders, can optionally include a single- or multi-layer LSTM/GRU in a memory-based agent.\n",
"- **Decoder** - Apply additional layers to the output of the model core before computing the policy and value outputs.\n",
"\n",
"The library has been designed to automatically support any observation and action spaces. Users can easily add their custom models. You can find out more in the [documentation](https://www.samplefactory.dev/03-customization/custom-models/#actor-critic-models-in-sample-factory)."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ez5UhUtYcWXF"
},
"source": [
"## ViZDoom\n",
"\n",
"[ViZDoom](https://vizdoom.cs.put.edu.pl/) is an **open-source python interface for the Doom Engine**. \n",
"\n",
"The library was created in 2016 by Marek Wydmuch, Michal Kempka at the Institute of Computing Science, Poznan University of Technology, Poland. \n",
"\n",
"The library enables the **training of agents directly from the screen pixels in a number of scenarios**, including team deathmatch, shown in the video below. Because the ViZDoom environment is based on a game the was created in the 90s, it can be run on modern hardware at accelerated speeds, **allowing us to learn complex AI behaviors fairly quickly**.\n",
"\n",
"The library includes feature such as:\n",
"\n",
"- Multi-platform (Linux, macOS, Windows),\n",
"- API for Python and C++,\n",
"- [OpenAI Gym](https://www.gymlibrary.dev/) environment wrappers\n",
"- Easy-to-create custom scenarios (visual editors, scripting language, and examples available),\n",
"- Async and sync single-player and multiplayer modes,\n",
"- Lightweight (few MBs) and fast (up to 7000 fps in sync mode, single-threaded),\n",
"- Customizable resolution and rendering parameters,\n",
"- Access to the depth buffer (3D vision),\n",
"- Automatic labeling of game objects visible in the frame,\n",
"- Access to the audio buffer\n",
"- Access to the list of actors/objects and map geometry,\n",
"- Off-screen rendering and episode recording,\n",
"- Time scaling in async mode."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "wAMwza0d5QVj"
},
"source": [
"## We first need to install some dependencies that are required for the ViZDoom environment\n",
"\n",
"Now that our Colab runtime is set up, we can start by installing the dependencies required to run ViZDoom on linux. \n",
"\n",
"If you are following on your machine on Mac, you will want to follow the installation instructions on the [github page](https://github.com/Farama-Foundation/ViZDoom/blob/master/doc/Quickstart.md#-quickstart-for-macos-and-anaconda3-python-36)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "RJMxkaldwIVx"
},
"outputs": [],
"source": [
"%%capture\n",
"%%bash\n",
"# Install ViZDoom deps from \n",
"# https://github.com/mwydmuch/ViZDoom/blob/master/doc/Building.md#-linux\n",
"\n",
"apt-get install build-essential zlib1g-dev libsdl2-dev libjpeg-dev \\\n",
"nasm tar libbz2-dev libgtk2.0-dev cmake git libfluidsynth-dev libgme-dev \\\n",
"libopenal-dev timidity libwildmidi-dev unzip ffmpeg\n",
"\n",
"# Boost libraries\n",
"apt-get install libboost-all-dev\n",
"\n",
"# Lua binding dependencies\n",
"apt-get install liblua5.1-dev"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "JT4att2c57MW"
},
"source": [
"## Then we can install Sample Factory and ViZDoom\n",
"- This can take 7min"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "bbqfPZnIsvA6"
},
"outputs": [],
"source": [
"# install python libraries\n",
"# thanks toinsson\n",
"!pip install sample-factory==2.0.2\n",
"!pip install faster-fifo==1.4.2\n",
"!pip install vizdoom"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1jizouGpghUZ"
},
"source": [
"## Setting up the Doom Environment in sample-factory"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "bCgZbeiavcDU"
},
"outputs": [],
"source": [
"import functools\n",
"\n",
"from sample_factory.algo.utils.context import global_model_factory\n",
"from sample_factory.cfg.arguments import parse_full_cfg, parse_sf_args\n",
"from sample_factory.envs.env_utils import register_env\n",
"from sample_factory.train import run_rl\n",
"\n",
"from sf_examples.vizdoom.doom.doom_model import make_vizdoom_encoder\n",
"from sf_examples.vizdoom.doom.doom_params import add_doom_env_args, doom_override_defaults\n",
"from sf_examples.vizdoom.doom.doom_utils import DOOM_ENVS, make_doom_env_from_spec\n",
"\n",
"\n",
"# Registers all the ViZDoom environments\n",
"def register_vizdoom_envs():\n",
" for env_spec in DOOM_ENVS:\n",
" make_env_func = functools.partial(make_doom_env_from_spec, env_spec)\n",
" register_env(env_spec.name, make_env_func)\n",
"\n",
"# Sample Factory allows the registration of a custom Neural Network architecture\n",
"# See https://github.com/alex-petrenko/sample-factory/blob/master/sf_examples/vizdoom/doom/doom_model.py for more details\n",
"def register_vizdoom_models():\n",
" global_model_factory().register_encoder_factory(make_vizdoom_encoder)\n",
"\n",
"\n",
"def register_vizdoom_components():\n",
" register_vizdoom_envs()\n",
" register_vizdoom_models()\n",
"\n",
"# parse the command line args and create a config\n",
"def parse_vizdoom_cfg(argv=None, evaluation=False):\n",
" parser, _ = parse_sf_args(argv=argv, evaluation=evaluation)\n",
" # parameters specific to Doom envs\n",
" add_doom_env_args(parser)\n",
" # override Doom default values for algo parameters\n",
" doom_override_defaults(parser)\n",
" # second parsing pass yields the final configuration\n",
" final_cfg = parse_full_cfg(parser, argv)\n",
" return final_cfg"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "sgRy6wnrgnij"
},
"source": [
"Now that the setup if complete, we can train the agent. We have chosen here to learn a ViZDoom task called `Health Gathering Supreme`.\n",
"\n",
"### The scenario: Health Gathering Supreme\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/Health-Gathering-Supreme.png\" alt=\"Health-Gathering-Supreme\"/>\n",
"\n",
"\n",
"\n",
"The objective of this scenario is to **teach the agent how to survive without knowing what makes him survive**. Agent know only that **life is precious** and death is bad so **it must learn what prolongs his existence and that his health is connected with it**.\n",
"\n",
"Map is a rectangle containing walls and with a green, acidic floor which **hurts the player periodically**. Initially there are some medkits spread uniformly over the map. A new medkit falls from the skies every now and then. **Medkits heal some portions of player's health** - to survive agent needs to pick them up. Episode finishes after player's death or on timeout.\n",
"\n",
"Further configuration:\n",
"- Living_reward = 1\n",
"- 3 available buttons: turn left, turn right, move forward\n",
"- 1 available game variable: HEALTH\n",
"- death penalty = 100\n",
"\n",
"You can find out more about the scenarios available in ViZDoom [here](https://github.com/Farama-Foundation/ViZDoom/tree/master/scenarios). \n",
"\n",
"There are also a number of more complex scenarios that have been create for ViZDoom, such as the ones detailed on [this github page](https://github.com/edbeeching/3d_control_deep_rl).\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "siHZZ34DiZEp"
},
"source": [
"## Training the agent\n",
"- We're going to train the agent for 4000000 steps it will take approximately 20min"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "y_TeicMvyKHP"
},
"outputs": [],
"source": [
"## Start the training, this should take around 15 minutes\n",
"register_vizdoom_components()\n",
"\n",
"# The scenario we train on today is health gathering\n",
"# other scenarios include \"doom_basic\", \"doom_two_colors_easy\", \"doom_dm\", \"doom_dwango5\", \"doom_my_way_home\", \"doom_deadly_corridor\", \"doom_defend_the_center\", \"doom_defend_the_line\"\n",
"env = \"doom_health_gathering_supreme\"\n",
"cfg = parse_vizdoom_cfg(argv=[f\"--env={env}\", \"--num_workers=8\", \"--num_envs_per_worker=4\", \"--train_for_env_steps=4000000\"])\n",
"\n",
"status = run_rl(cfg)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5L0nBS9e_jqC"
},
"source": [
"## Let's take a look at the performance of the trained policy and output a video of the agent."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "MGSA4Kg5_i0j"
},
"outputs": [],
"source": [
"from sample_factory.enjoy import enjoy\n",
"cfg = parse_vizdoom_cfg(argv=[f\"--env={env}\", \"--num_workers=1\", \"--save_video\", \"--no_render\", \"--max_num_episodes=10\"], evaluation=True)\n",
"status = enjoy(cfg)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Lj5L1x0WLxwB"
},
"source": [
"## Now lets visualize the performance of the agent"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "WsXhBY7JNOdJ"
},
"outputs": [],
"source": [
"from base64 import b64encode\n",
"from IPython.display import HTML\n",
"\n",
"mp4 = open('/content/train_dir/default_experiment/replay.mp4','rb').read()\n",
"data_url = \"data:video/mp4;base64,\" + b64encode(mp4).decode()\n",
"HTML(\"\"\"\n",
"<video width=640 controls>\n",
" <source src=\"%s\" type=\"video/mp4\">\n",
"</video>\n",
"\"\"\" % data_url)"
]
},
{
"cell_type": "markdown",
"source": [
"The agent has learned something, but its performance could be better. We would clearly need to train for longer. But let's upload this model to the Hub."
],
"metadata": {
"id": "2A4pf_1VwPqR"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "CSQVWF0kNuy9"
},
"source": [
"## Now lets upload your checkpoint and video to the Hugging Face Hub\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "JquRrWytA6eo"
},
"source": [
"To be able to share your model with the community there are three more steps to follow:\n",
"\n",
"1⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join\n",
"\n",
"2⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.\n",
"- Create a new token (https://huggingface.co/settings/tokens) **with write role**\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg\" alt=\"Create HF Token\">\n",
"\n",
"- Copy the token \n",
"- Run the cell below and paste the token"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_tsf2uv0g_4p"
},
"source": [
"If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "GoQm_jYSOts0"
},
"outputs": [],
"source": [
"from huggingface_hub import notebook_login\n",
"notebook_login()\n",
"!git config --global credential.helper store"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "sEawW_i0OvJV"
},
"outputs": [],
"source": [
"from sample_factory.enjoy import enjoy\n",
"\n",
"hf_username = \"ThomasSimonini\" # insert your HuggingFace username here\n",
"\n",
"cfg = parse_vizdoom_cfg(argv=[f\"--env={env}\", \"--num_workers=1\", \"--save_video\", \"--no_render\", \"--max_num_episodes=10\", \"--max_num_frames=100000\", \"--push_to_hub\", f\"--hf_repository={hf_username}/rl_course_vizdoom_health_gathering_supreme\"], evaluation=True)\n",
"status = enjoy(cfg)"
]
},
{
"cell_type": "markdown",
"source": [
"## Let's load another model\n",
"\n",
"\n"
],
"metadata": {
"id": "9PzeXx-qxVvw"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "mHZAWSgL5F7P"
},
"source": [
"This agent's performance was good, but can do better! Let's download and visualize an agent trained for 10B timesteps from the hub."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Ud6DwAUl5S-l"
},
"outputs": [],
"source": [
"#download the agent from the hub\n",
"!python -m sample_factory.huggingface.load_from_hub -r edbeeching/doom_health_gathering_supreme_2222 -d ./train_dir\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "qoUJhL6x6sY5"
},
"outputs": [],
"source": [
"!ls train_dir/doom_health_gathering_supreme_2222"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "lZskc8LG8qr8"
},
"outputs": [],
"source": [
"env = \"doom_health_gathering_supreme\"\n",
"cfg = parse_vizdoom_cfg(argv=[f\"--env={env}\", \"--num_workers=1\", \"--save_video\", \"--no_render\", \"--max_num_episodes=10\", \"--experiment=doom_health_gathering_supreme_2222\", \"--train_dir=train_dir\"], evaluation=True)\n",
"status = enjoy(cfg)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "BtzXBoj65Wmq"
},
"outputs": [],
"source": [
"mp4 = open('/content/train_dir/doom_health_gathering_supreme_2222/replay.mp4','rb').read()\n",
"data_url = \"data:video/mp4;base64,\" + b64encode(mp4).decode()\n",
"HTML(\"\"\"\n",
"<video width=640 controls>\n",
" <source src=\"%s\" type=\"video/mp4\">\n",
"</video>\n",
"\"\"\" % data_url)"
]
},
{
"cell_type": "markdown",
"source": [
"## Some additional challenges 🏆: Doom Deathmatch\n",
"\n",
"Training an agent to play a Doom deathmatch **takes many hours on a more beefy machine than is available in Colab**. \n",
"\n",
"Fortunately, we have have **already trained an agent in this scenario and it is available in the 🤗 Hub!** Lets download the model and visualize the agents performance."
],
"metadata": {
"id": "ie5YWC3NyKO8"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "fq3WFeus81iI"
},
"outputs": [],
"source": [
"# Download the agent from the hub\n",
"!python -m sample_factory.huggingface.load_from_hub -r edbeeching/doom_deathmatch_bots_2222 -d ./train_dir"
]
},
{
"cell_type": "markdown",
"source": [
"Given the agent plays for a long time the video generation can take **10 minutes**."
],
"metadata": {
"id": "7AX_LwxR2FQ0"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "0hq6XL__85Bv"
},
"outputs": [],
"source": [
"\n",
"from sample_factory.enjoy import enjoy\n",
"register_vizdoom_components()\n",
"env = \"doom_deathmatch_bots\"\n",
"cfg = parse_vizdoom_cfg(argv=[f\"--env={env}\", \"--num_workers=1\", \"--save_video\", \"--no_render\", \"--max_num_episodes=1\", \"--experiment=doom_deathmatch_bots_2222\", \"--train_dir=train_dir\"], evaluation=True)\n",
"status = enjoy(cfg)\n",
"mp4 = open('/content/train_dir/doom_deathmatch_bots_2222/replay.mp4','rb').read()\n",
"data_url = \"data:video/mp4;base64,\" + b64encode(mp4).decode()\n",
"HTML(\"\"\"\n",
"<video width=640 controls>\n",
" <source src=\"%s\" type=\"video/mp4\">\n",
"</video>\n",
"\"\"\" % data_url)"
]
},
{
"cell_type": "markdown",
"source": [
"\n",
"You **can try to train your agent in this environment** using the code above, but not on colab.\n",
"**Good luck 🤞**"
],
"metadata": {
"id": "N6mEC-4zyihx"
}
},
{
"cell_type": "markdown",
"source": [
"If you prefer an easier scenario, **why not try training in another ViZDoom scenario such as `doom_deadly_corridor` or `doom_defend_the_center`.**\n",
"\n",
"\n",
"\n",
"---\n",
"\n",
"\n",
"This concludes the last unit. But we are not finished yet! 🤗 The following **bonus section include some of the most interesting, advanced and cutting edge work in Deep Reinforcement Learning**.\n",
"\n",
"## Keep learning, stay awesome 🤗"
],
"metadata": {
"id": "YnDAngN6zeeI"
}
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"provenance": [],
"collapsed_sections": [
"PU4FVzaoM6fC",
"nB68Eb9UgC94",
"ez5UhUtYcWXF",
"sgRy6wnrgnij"
],
"private_outputs": true,
"include_colab_link": true
},
"gpuClass": "standard",
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

View File

@@ -1,4 +1,8 @@
# Unit 1: Introduction to Deep Reinforcement Learning 🚀
# DEPRECIATED THE NEW UNIT 1 IS HERE: https://huggingface.co/deep-rl-course/unit1/introduction
**Everything under is depreciated** 👇, the new version of the course is here: https://huggingface.co/deep-rl-course/unit1/introduction
# Unit 1: Introduction to Deep Reinforcement Learning 🚀 (DEPRECIATED)
![cover](assets/img/thumbnail.png)

View File

@@ -16,6 +16,11 @@
"id": "njb_ProuHiOe"
},
"source": [
"# DEPRECIATED NOTEBOOK, THE NEW UNIT 1 IS HERE: https://huggingface.co/deep-rl-course/unit1/introduction",
"\n",
"**Everything under is depreciated** 👇, the new version of the course is here: https://huggingface.co/deep-rl-course/unit1/introduction",
"\n",
"\n",
"# Unit 1: Train your first Deep Reinforcement Learning Agent 🚀\n",
"![Cover](https://github.com/huggingface/deep-rl-class/blob/main/unit1/assets/img/thumbnail.png?raw=true)\n",
"\n",

View File

@@ -1,3 +1,7 @@
# DEPRECIATED UNIT, THE NEW UNIT 2 IS HERE: https://huggingface.co/deep-rl-course/unit2/introduction
**Everything under is depreciated** 👇, the new version of the course is here: https://huggingface.co/deep-rl-course/unit2/introduction"
# Unit 2: Introduction to Q-Learning
In this Unit, we're going to dive deeper into one of the Reinforcement Learning methods: value-based methods and **study our first RL algorithm: Q-Learning**.
@@ -70,7 +74,7 @@ You can work directly **with the colab notebook, which allows you not to have to
- To dive deeper on Monte Carlo and Temporal Difference Learning:
- [Why do temporal difference (TD) methods have lower variance than Monte Carlo methods?](https://stats.stackexchange.com/questions/355820/why-do-temporal-difference-td-methods-have-lower-variance-than-monte-carlo-met)
- [When are Monte Carlo methods preferred over temporal difference ones?](https://stats.stackexchange.com/questions/336974/when-are-monte-carlo-methods-preferred-over-temporal-difference-ones)
## How to make the most of this course
To make the most of the course, my advice is to:

View File

@@ -16,6 +16,11 @@
"id": "njb_ProuHiOe"
},
"source": [
"# DEPRECIATED NOTEBOOK, THE NEW UNIT 2 IS HERE: https://huggingface.co/deep-rl-course/unit2/introduction",
"\n",
"**Everything under is depreciated** 👇, the new version of the course is here: https://huggingface.co/deep-rl-course/unit2/introduction",
"\n",
"\n",
"# Unit 2: Q-Learning with FrozenLake-v1 ⛄ and Taxi-v3 🚕\n",
"\n",
"In this notebook, **you'll code from scratch your first Reinforcement Learning agent** playing FrozenLake ❄️ using Q-Learning, share it to the community, and experiment with different configurations\n",

View File

@@ -1,6 +1,10 @@
# DEPRECIATED THE NEW VERSION OF THIS UNIT IS HERE: https://huggingface.co/deep-rl-course/unit3/introduction
**Everything under is depreciated** 👇, the new version of the course is here: https://huggingface.co/deep-rl-course/unit3/introduction
# Unit 3: Deep Q-Learning with Atari Games 👾
In this Unit, **we'll study our first Deep Reinforcement Learning agent**: Deep Q-Learning.
In this Unit, **we'll study our first Deep Reinforcement Learning agent**: Deep Q-Learning.
And **we'll train it to play Space Invaders and other Atari environments using [RL-Zoo](https://github.com/DLR-RM/rl-baselines3-zoo)**, a training framework for RL using Stable-Baselines that provides scripts for training, evaluating agents, tuning hyperparameters, plotting results, and recording videos.

View File

@@ -16,6 +16,11 @@
"id": "k7xBVPzoXxOg"
},
"source": [
"# DEPRECIATED NOTEBOOK, THE NEW UNIT 3 IS HERE: https://huggingface.co/deep-rl-course/unit3/introduction",
"\n",
"**Everything under is depreciated** 👇, the new version of the course is here: https://huggingface.co/deep-rl-course/unit3/introduction",
"\n",
"\n",
"# Unit 3: Deep Q-Learning with Atari Games 👾 using RL Baselines3 Zoo\n",
"\n",
"In this notebook, **you'll train a Deep Q-Learning agent** playing Space Invaders using [RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo), a training framework based on [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/) that provides scripts for training, evaluating agents, tuning hyperparameters, plotting results and recording videos.\n",

View File

@@ -1,3 +1,7 @@
# DEPRECIATED THE NEW VERSION OF THIS UNIT IS HERE: https://huggingface.co/deep-rl-course/unit5/introduction
**Everything under is depreciated** 👇, the new version of the course is here: https://huggingface.co/deep-rl-course/unit5/introduction
# Unit 4: An Introduction to Unity MLAgents with Hugging Face 🤗
![cover](https://miro.medium.com/max/1400/1*8DV9EFl-vdijvcTHilHuEw.png)

View File

@@ -16,6 +16,11 @@
"id": "2D3NL_e4crQv"
},
"source": [
"# DEPRECIATED NOTEBOOK, THE NEW VERSION OF THIS UNIT IS HERE: https://huggingface.co/deep-rl-course/unit5/introduction",
"\n",
"**Everything under is depreciated** 👇, the new version of the course is here: https://huggingface.co/deep-rl-course/unit5/introduction",
"\n",
"\n",
"# Unit 4: Let's learn about Unity ML-Agents with Hugging Face 🤗\n",
"\n"
]
@@ -561,4 +566,4 @@
},
"nbformat": 4,
"nbformat_minor": 0
}
}

View File

@@ -1,6 +1,10 @@
# Unit 5: Policy Gradient with PyTorch
# DEPRECIATED THE NEW VERSION OF THIS UNIT IS HERE: https://huggingface.co/deep-rl-course/unit4/introduction
**Everything under is depreciated** 👇, the new version of the course is here: https://huggingface.co/deep-rl-course/unit4/introduction
In this Unit, **we'll study Policy Gradient Methods**.
# Unit 5: Policy Gradient with PyTorch
In this Unit, **we'll study Policy Gradient Methods**.
And we'll **implement Reinforce (a policy gradient method) from scratch using PyTorch**. Before testing its robustness using CartPole-v1, PixelCopter, and Pong.

View File

@@ -16,6 +16,11 @@
"id": "CjRWziAVU2lZ"
},
"source": [
"# DEPRECIATED NOTEBOOK, THE NEW VERSION OF THIS UNIT IS HERE: https://huggingface.co/deep-rl-course/unit4/introduction",
"\n",
"**Everything under is depreciated** 👇, the new version of the course is here: https://huggingface.co/deep-rl-course/unit4/introduction",
"\n",
"\n",
"# Unit 5: Code your first Deep Reinforcement Learning Algorithm with PyTorch: Reinforce. And test its robustness 💪\n",
"In this notebook, you'll code your first Deep Reinforcement Learning algorithm from scratch: Reinforce (also called Monte Carlo Policy Gradient).\n",
"\n",

View File

@@ -1,3 +1,7 @@
# DEPRECIATED THE NEW VERSION OF THIS UNIT IS HERE: https://huggingface.co/deep-rl-course/unit6/introduction
**Everything under is depreciated** 👇, the new version of the course is here: https://huggingface.co/deep-rl-course/unit6/introduction
# Unit 7: Advantage Actor Critic (A2C) using Robotics Simulations with PyBullet 🤖
One of the major industries that use Reinforcement Learning is robotics. Unfortunately, **having access to robot equipment is very expensive**. Fortunately, some simulations exist to train Robots:
@@ -32,7 +36,7 @@ Thanks to a leaderboard, you'll be able to compare your results with other class
The leaderboard 👉 https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard
## Additional readings 📚
- [Making Sense of the Bias / Variance Trade-off in (Deep) Reinforcement Learning](https://blog.mlreview.com/making-sense-of-the-bias-variance-trade-off-in-deep-reinforcement-learning-79cf1e83d565)
- [Making Sense of the Bias / Variance Trade-off in (Deep) Reinforcement Learning](https://blog.mlreview.com/making-sense-of-the-bias-variance-trade-off-in-deep-reinforcement-learning-79cf1e83d565)
- [Bias-variance Tradeoff in Reinforcement Learning](https://www.endtoend.ai/blog/bias-variance-tradeoff-in-reinforcement-learning/)
- [Foundations of Deep RL Series, L3 Policy Gradients and Advantage Estimation by Pieter Abbeel](https://youtu.be/AKbX1Zvo7r8)

View File

@@ -34,6 +34,11 @@
{
"cell_type": "markdown",
"source": [
"# DEPRECIATED NOTEBOOK, THE NEW VERSION OF THIS UNIT IS HERE: https://huggingface.co/deep-rl-course/unit6/introduction",
"\n",
"**Everything under is depreciated** 👇, the new version of the course is here: https://huggingface.co/deep-rl-course/unit6/introduction",
"\n",
"\n",
"# Unit 7: Advantage Actor Critic (A2C) using Robotics Simulations with PyBullet 🤖\n",
"In this small notebook you'll learn to use A2C with PyBullet. And train an agent to walk. More precisely a spider (they say Ant but come on... it's a spider 😆) 🕸️\n",
"\n",
@@ -533,4 +538,4 @@
}
}
]
}
}

View File

@@ -1,3 +1,7 @@
# DEPRECIATED THE NEW VERSION OF THIS UNIT IS HERE: https://huggingface.co/deep-rl-course/unit8/introduction
**Everything under is depreciated** 👇, the new version of the course is here: https://huggingface.co/deep-rl-course/unit8/introduction
# Unit 8: Proximal Policy Optimization (PPO) with PyTorch
Today we'll learn about Proximal Policy Optimization (PPO), an architecture that improves our agent's training stability by avoiding too large policy updates. To do that, we use a ratio that will indicates the difference between our current and old policy and clip this ratio from a specific range $[1 - \epsilon, 1 + \epsilon]$. Doing this will ensure that our policy update will not be too large and that the training is more stable.
@@ -29,7 +33,7 @@ Thanks to a leaderboard, you'll be able to compare your results with other class
The leaderboard 👉 https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard
## Additional readings 📚
- [Towards Delivering a Coherent Self-Contained Explanation of Proximal Policy Optimization by Daniel Bick](https://fse.studenttheses.ub.rug.nl/25709/1/mAI_2021_BickD.pdf)
- [Towards Delivering a Coherent Self-Contained Explanation of Proximal Policy Optimization by Daniel Bick](https://fse.studenttheses.ub.rug.nl/25709/1/mAI_2021_BickD.pdf)
- [What is the way to understand Proximal Policy Optimization Algorithm in RL?](https://stackoverflow.com/questions/46422845/what-is-the-way-to-understand-proximal-policy-optimization-algorithm-in-rl)
- [Foundations of Deep RL Series, L4 TRPO and PPO by Pieter Abbeel](https://youtu.be/KjWF8VIMGiY)
- [OpenAI PPO Blogpost](https://openai.com/blog/openai-baselines-ppo/)

View File

@@ -16,6 +16,11 @@
"id": "-cf5-oDPjwf8"
},
"source": [
"# DEPRECIATED NOTEBOOK, THE NEW VERSION OF THIS UNIT IS HERE: https://huggingface.co/deep-rl-course/unit8/introduction",
"\n",
"**Everything under is depreciated** 👇, the new version of the course is here: https://huggingface.co/deep-rl-course/unit8/introduction",
"\n",
"\n",
"# Unit 8: Proximal Policy Gradient (PPO) with PyTorch 🤖\n",
"\n",
"In this unit, you'll learn to **code your PPO agent from scratch with PyTorch**.\n",

View File

@@ -1,8 +1,13 @@
# DEPRECIATED THE NEW VERSION OF THIS UNIT IS HERE: https://huggingface.co/deep-rl-course/unitbonus3/decision-transformers
**Everything under is depreciated** 👇, the new version of the course is here: https://huggingface.co/deep-rl-course/unitbonus3/decision-transformers
# Unit 9: Decision Transformers and offline Reinforcement Learning 🤖
![cover](assets/img/thumbnail.gif)
In this Unit, you'll learn what is Decision Transformer and Offline Reinforcement Learning. And then, youll train your first Offline Decision Transformer model from scratch to make a half-cheetah run.
In this Unit, you'll learn what is Decision Transformer and Offline Reinforcement Learning. And then, youll train your first Offline Decision Transformer model from scratch to make a half-cheetah run.
This course is **self-paced**, you can start whenever you want.
@@ -18,12 +23,12 @@ Here are the steps for this Unit:
2⃣ 👩‍💻 Then dive on the first hands-on.
👩‍💻 The hands-on 👉 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1K3UuajwoPY1MzRKNkONNRS3gS5DxZ-qF?usp=sharing)
3⃣ 📖 Read [Train your first Decision Transformer](https://huggingface.co/blog/train-decision-transformers)
4⃣ 👩‍💻 Then dive on the hands-on, where **youll train your first Offline Decision Transformer model from scratch to make a half-cheetah run**.
4⃣ 👩‍💻 Then dive on the hands-on, where **youll train your first Offline Decision Transformer model from scratch to make a half-cheetah run**.
👩‍💻 The hands-on 👉 https://github.com/huggingface/blog/blob/main/notebooks/101_train-decision-transformers.ipynb
## How to make the most of this course
To make the most of the course, my advice is to:

View File

@@ -178,6 +178,52 @@
title: Conclusion
- local: unit7/additional-readings
title: Additional Readings
- title: Unit 8. Part 1 Proximal Policy Optimization (PPO)
sections:
- local: unit8/introduction
title: Introduction
- local: unit8/intuition-behind-ppo
title: The intuition behind PPO
- local: unit8/clipped-surrogate-objective
title: Introducing the Clipped Surrogate Objective Function
- local: unit8/visualize
title: Visualize the Clipped Surrogate Objective Function
- local: unit8/hands-on-cleanrl
title: PPO with CleanRL
- local: unit8/conclusion
title: Conclusion
- local: unit8/additional-readings
title: Additional Readings
- title: Unit 8. Part 2 Proximal Policy Optimization (PPO) with Doom
sections:
- local: unit8/introduction-sf
title: Introduction
- local: unit8/hands-on-sf
title: PPO with Sample Factory and Doom
- local: unit8/conclusion-sf
title: Conclusion
- title: Bonus Unit 3. Advanced Topics in Reinforcement Learning
sections:
- local: unitbonus3/introduction
title: Introduction
- local: unitbonus3/model-based
title: Model-Based Reinforcement Learning
- local: unitbonus3/offline-online
title: Offline vs. Online Reinforcement Learning
- local: unitbonus3/rlhf
title: Reinforcement Learning from Human Feedback
- local: unitbonus3/decision-transformers
title: Decision Transformers and Offline RL
- local: unitbonus3/language-models
title: Language models in RL
- local: unitbonus3/curriculum-learning
title: (Automatic) Curriculum Learning for RL
- local: unitbonus3/envs-to-try
title: Interesting environments to try
- local: unitbonus3/godotrl
title: An Introduction to Godot RL
- local: unitbonus3/rl-documentation
title: Brief introduction to RL documentation
- title: Certification and congratulations
sections:
- local: communication/conclusion

View File

@@ -9,11 +9,11 @@ Discord is a free chat platform. If you've used Slack, **it's quite similar**. T
Starting in Discord can be a bit intimidating, so let me take you through it.
When you sign-up to our Discord server, you'll need to specify which topics you're interested in by **clicking #role-assignment at the left**.
When you sign-up to our Discord server, you'll need to specify which topics you're interested in by **clicking #role-assignment at the left**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/discord1.jpg" alt="Discord"/>
In #role-assignment, you can pick different categories. Make sure to **click "Reinforcement Learning"**. You'll then get to **introduce yourself in the `#introduction-yourself` channel**.
In #role-assignment, you can pick different categories. Make sure to **click "Reinforcement Learning"**. You'll then get to **introduce yourself in the `#introduce-yourself` channel**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/discord2.jpg" alt="Discord"/>

View File

@@ -22,6 +22,8 @@ To validate this hands-on for the [certification process](https://huggingface.co
To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**
**If you don't find your model, go to the bottom of the page and click on the refresh button.**
For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
And you can check your progress here 👉 https://huggingface.co/spaces/ThomasSimonini/Check-my-progress-Deep-RL-Course
@@ -43,13 +45,6 @@ You can either do this hands-on by reading the notebook or following it with the
In this notebook, you'll train your **first Deep Reinforcement Learning agent** a Lunar Lander agent that will learn to **land correctly on the Moon 🌕**. Using [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/) a Deep Reinforcement Learning library, share them with the community, and experiment with different configurations
⬇️ Here is an example of what **you will achieve in just a couple of minutes.** ⬇️
```python
%%html
<video controls autoplay><source src="https://huggingface.co/ThomasSimonini/ppo-LunarLander-v2/resolve/main/replay.mp4" type="video/mp4"></video>
```
### The environment 🎮
- [LunarLander-v2](https://www.gymlibrary.dev/environments/box2d/lunar_lander/)
@@ -92,7 +87,7 @@ Before diving into the notebook, you need to:
🔲 📝 **Read Unit 0** that gives you all the **information about the course and help you to onboard** 🤗
🔲 📚 **Develop an understanding of the foundations of Reinforcement learning** (MC, TD, Rewards hypothesis...) by doing Unit 1
🔲 📚 **Develop an understanding of the foundations of Reinforcement learning** by reading Unit 1
## A small recap of what is Deep Reinforcement Learning 📚
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process_game.jpg" alt="The RL process" width="100%">
@@ -664,8 +659,7 @@ If youre still feel confused with all these elements...it's totally normal! *
Take time to really **grasp the material before continuing and try the additional challenges**. Its important to master these elements and having a solid foundations.
Naturally, during the course, were going to use and deeper explain again these terms but **its better to have a good understanding of them now before diving into the next chapters.**
Naturally, during the course, were going to dive deeper into these concepts but **its better to have a good understanding of them now before diving into the next chapters.**
Next time, in the bonus unit 1, you'll train Huggy the Dog to fetch the stick.

View File

@@ -22,6 +22,6 @@ It's essential **to master these elements** before diving into implementing Dee
After this unit, in a bonus unit, you'll be **able to train Huggy the Dog 🐶 to fetch the stick and play with him 🤗**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/huggy.jpg" alt="Huggy"/>
<video src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/huggy.mp4" type="video/mp4" controls autoplay loop mute />
So let's get started! 🚀

View File

@@ -17,7 +17,7 @@ For instance, think about Super Mario Bros: an episode begin at the launch of a
## Continuing tasks [[continuing-tasks]]
These are tasks that continue forever (no terminal state). In this case, the agent must **learn how to choose the best actions and simultaneously interact with the environment.**
These are tasks that continue forever (**no terminal state**). In this case, the agent must **learn how to choose the best actions and simultaneously interact with the environment.**
For instance, an agent that does automated stock trading. For this task, there is no starting point and terminal state. **The agent keeps running until we decide to stop it.**

View File

@@ -8,7 +8,7 @@ In other terms, how to build an RL agent that can **select the actions that ma
## The Policy π: the agents brain [[policy]]
The Policy **π** is the **brain of our Agent**, its the function that tells us what **action to take given the state we are.** So it **defines the agents behavior** at a given time.
The Policy **π** is the **brain of our Agent**, its the function that tells us what **action to take given the state we are in.** So it **defines the agents behavior** at a given time.
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy_1.jpg" alt="Policy" />
@@ -67,7 +67,7 @@ If we recap:
## Value-based methods [[value-based]]
In value-based methods, instead of training a policy function, we **train a value function** that maps a state to the expected value **of being at that state.**
In value-based methods, instead of learning a policy function, we **learn a value function** that maps a state to the expected value **of being at that state.**
The value of a state is the **expected discounted return** the agent can get if it **starts in that state, and then act according to our policy.**

View File

@@ -16,6 +16,8 @@ Now that we studied the Q-Learning algorithm, let's implement it from scratch an
Thanks to a [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard), you'll be able to compare your results with other classmates and exchange the best practices to improve your agent's scores. Who will win the challenge for Unit 2?
**If you don't find your model, go to the bottom of the page and click on the refresh button.**
To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push your trained Taxi model to the Hub and **get a result of >= 4.5**.
To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**
@@ -259,7 +261,8 @@ print("There are ", action_space, " possible actions")
```
```python
# Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros
# Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros. np.zeros needs a tuple (a,b)
def initialize_q_table(state_space, action_space):
Qtable =
return Qtable
@@ -369,7 +372,7 @@ def epsilon_greedy_policy(Qtable, state, epsilon):
```
## Define the hyperparameters ⚙️
The exploration related hyperparamters are some of the most important ones.
The exploration related hyperparameters are some of the most important ones.
- We need to make sure that our agent **explores enough of the state space** to learn a good value approximation. To do that, we need to have progressive decay of the epsilon.
- If you decrease epsilon too fast (too high decay_rate), **you take the risk that your agent will be stuck**, since your agent didn't explore enough of the state space and hence can't solve the problem.

View File

@@ -22,6 +22,8 @@ To validate this hands-on for the certification process, you need to push your t
To find your result, go to the leaderboard and find your model, **the result = mean_reward - std of reward**
**If you don't find your model, go to the bottom of the page and click on the refresh button.**
For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
And you can check your progress here 👉 https://huggingface.co/spaces/ThomasSimonini/Check-my-progress-Deep-RL-Course
@@ -36,7 +38,7 @@ And you can check your progress here 👉 https://huggingface.co/spaces/ThomasSi
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/thumbnail.jpg" alt="Unit 3 Thumbnail">
In this notebook, **you'll train a Deep Q-Learning agent** playing Space Invaders using [RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo), a training framework based on [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/) that provides scripts for training, evaluating agents, tuning hyperparameters, plotting results and recording videos.
In this notebook, **you'll train a Deep Q-Learning agent** playing Space Invaders using [RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo), a training framework based on [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/) that provides scripts for training, evaluating agents, tuning arameters, plotting results and recording videos.
We're using the [RL-Baselines-3 Zoo integration, a vanilla version of Deep Q-Learning](https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html) with no extensions such as Double-DQN, Dueling-DQN, and Prioritized Experience Replay.
@@ -131,7 +133,7 @@ pip install -r requirements.txt
## Train our Deep Q-Learning Agent to Play Space Invaders 👾
To train an agent with RL-Baselines3-Zoo, we just need to do two things:
1. We define the hyperparameters in `rl-baselines3-zoo/hyperparams/dqn.yml`
1. We define the hyperparameters in `/content/rl-baselines3-zoo/hyperparams/dqn.yml`
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit3/hyperparameters.png" alt="DQN Hyperparameters">

View File

@@ -26,6 +26,8 @@ To validate this hands-on for the certification process, you need to push your t
To find your result, go to the leaderboard and find your model, **the result = mean_reward - std of reward**. **If you don't see your model on the leaderboard, go at the bottom of the leaderboard page and click on the refresh button**.
**If you don't find your model, go to the bottom of the page and click on the refresh button.**
For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
And you can check your progress here 👉 https://huggingface.co/spaces/ThomasSimonini/Check-my-progress-Deep-RL-Course
@@ -373,11 +375,11 @@ The second question you may ask is **why do we minimize the loss**? Did you talk
- We want to maximize our utility function $J(\theta)$, but in PyTorch and TensorFlow, it's better to **minimize an objective function.**
- So let's say we want to reinforce action 3 at a certain timestep. Before training this action P is 0.25.
- So we want to modify $\theta$ such that $\pi_\theta(a_3|s; \theta) > 0.25$
- Because all P must sum to 1, max $\pi_\theta(a_3|s; \theta)$ will **minimize other action probability.**
- So we should tell PyTorch **to min $1 - \pi_\theta(a_3|s; \theta)$.**
- This loss function approaches 0 as $\pi_\theta(a_3|s; \theta)$ nears 1.
- So we are encouraging the gradient to max $\pi_\theta(a_3|s; \theta)$
- So we want to modify \\(theta \\) such that \\(\pi_\theta(a_3|s; \theta) > 0.25 \\)
- Because all P must sum to 1, max \\(pi_\theta(a_3|s; \theta)\\) will **minimize other action probability.**
- So we should tell PyTorch **to min \\(1 - \pi_\theta(a_3|s; \theta)\\).**
- This loss function approaches 0 as \\(\pi_\theta(a_3|s; \theta)\\) nears 1.
- So we are encouraging the gradient to max \\(\pi_\theta(a_3|s; \theta)\\)
```python

View File

@@ -16,7 +16,7 @@ On the other hand, your friend (Critic) will also update their way to provide fe
This is the idea behind Actor-Critic. We learn two function approximations:
- *A policy* that **controls how our agent acts**: \\( \pi_{\theta}(s,a) \\)
- *A policy* that **controls how our agent acts**: \\( \pi_{\theta}(s) \\)
- *A value function* to assist the policy update by measuring how good the action taken is: \\( \hat{q}_{w}(s,a) \\)
@@ -24,7 +24,7 @@ This is the idea behind Actor-Critic. We learn two function approximations:
Now that we have seen the Actor Critic's big picture, let's dive deeper to understand how Actor and Critic improve together during the training.
As we saw, with Actor-Critic methods, there are two function approximations (two neural networks):
- *Actor*, a **policy function** parameterized by theta: \\( \pi_{\theta}(s,a) \\)
- *Actor*, a **policy function** parameterized by theta: \\( \pi_{\theta}(s) \\)
- *Critic*, a **value function** parameterized by w: \\( \hat{q}_{w}(s,a) \\)
Let's see the training process to understand how Actor and Critic are optimized:

View File

@@ -28,6 +28,8 @@ To validate this hands-on for the certification process, you need to push your t
To find your result, [go to the leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**
**If you don't find your model, go to the bottom of the page and click on the refresh button.**
For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
**To start the hands-on click on Open In Colab button** 👇 :
@@ -191,7 +193,7 @@ env = # TODO: Add the wrapper
```python
env = make_vec_env(env_id, n_envs=4)
env = VecNormalize(env, norm_obs=True, norm_reward=False, clip_obs=10.0)
env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10.0)
```
### Create the A2C Model 🤖

View File

@@ -24,6 +24,8 @@ More precisely, AI vs. AI is three tools:
- A *leaderboard* getting the match history results and displaying the models ELO ratings: [https://huggingface.co/spaces/huggingface-projects/AIvsAI-SoccerTwos](https://huggingface.co/spaces/huggingface-projects/AIvsAI-SoccerTwos)
- A *Space demo* to visualize your agents playing against others: [https://huggingface.co/spaces/unity/ML-Agents-SoccerTwos](https://huggingface.co/spaces/unity/ML-Agents-SoccerTwos)
In addition to these three tools, your classmate cyllum created a 🤗 SoccerTwos Challenge Analytics where you can check the detailed match results of a model: [https://huggingface.co/spaces/cyllum/soccertwos-analytics](https://huggingface.co/spaces/cyllum/soccertwos-analytics)
We're going to write a blog post to explain this AI vs. AI tool in detail, but to give you the big picture it works this way:
- Every four hours, our algorithm **fetches all the available models for a given environment (in our case ML-Agents-SoccerTwos).**

View File

@@ -0,0 +1,21 @@
# Additional Readings [[additional-readings]]
These are **optional readings** if you want to go deeper.
## PPO Explained
- [Towards Delivering a Coherent Self-Contained Explanation of Proximal Policy Optimization by Daniel Bick](https://fse.studenttheses.ub.rug.nl/25709/1/mAI_2021_BickD.pdf)
- [What is the way to understand Proximal Policy Optimization Algorithm in RL?](https://stackoverflow.com/questions/46422845/what-is-the-way-to-understand-proximal-policy-optimization-algorithm-in-rl)
- [Foundations of Deep RL Series, L4 TRPO and PPO by Pieter Abbeel](https://youtu.be/KjWF8VIMGiY)
- [OpenAI PPO Blogpost](https://openai.com/blog/openai-baselines-ppo/)
- [Spinning Up RL PPO](https://spinningup.openai.com/en/latest/algorithms/ppo.html)
- [Paper Proximal Policy Optimization Algorithms](https://arxiv.org/abs/1707.06347)
## PPO Implementation details
- [The 37 Implementation Details of Proximal Policy Optimization](https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/)
- [Part 1 of 3 — Proximal Policy Optimization Implementation: 11 Core Implementation Details](https://www.youtube.com/watch?v=MEt6rrxH8W4)
## Importance Sampling
- [Importance Sampling Explained](https://youtu.be/C3p2wI4RAi8)

View File

@@ -0,0 +1,69 @@
# Introducing the Clipped Surrogate Objective Function
## Recap: The Policy Objective Function
Lets remember what is the objective to optimize in Reinforce:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/lpg.jpg" alt="Reinforce"/>
The idea was that by taking a gradient ascent step on this function (equivalent to taking gradient descent of the negative of this function), we would **push our agent to take actions that lead to higher rewards and avoid harmful actions.**
However, the problem comes from the step size:
- Too small, **the training process was too slow**
- Too high, **there was too much variability in the training**
Here with PPO, the idea is to constrain our policy update with a new objective function called the *Clipped surrogate objective function* that **will constrain the policy change in a small range using a clip.**
This new function **is designed to avoid destructive large weights updates** :
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/ppo-surrogate.jpg" alt="PPO surrogate function"/>
Lets study each part to understand how it works.
## The Ratio Function
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/ratio1.jpg" alt="Ratio"/>
This ratio is calculated this way:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/ratio2.jpg" alt="Ratio"/>
Its the probability of taking action \\( a_t \\) at state \\( s_t \\) in the current policy divided by the previous one.
As we can see, \\( r_t(\theta) \\) denotes the probability ratio between the current and old policy:
- If \\( r_t(\theta) > 1 \\), the **action \\( a_t \\) at state \\( s_t \\) is more likely in the current policy than the old policy.**
- If \\( r_t(\theta) \\) is between 0 and 1, the **action is less likely for the current policy than for the old one**.
So this probability ratio is an **easy way to estimate the divergence between old and current policy.**
## The unclipped part of the Clipped Surrogate Objective function
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/unclipped1.jpg" alt="PPO"/>
This ratio **can replace the log probability we use in the policy objective function**. This gives us the left part of the new objective function: multiplying the ratio by the advantage.
<figure class="image table text-center m-0 w-full">
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/unclipped2.jpg" alt="PPO"/>
<figcaption><a href="https://arxiv.org/pdf/1707.06347.pdf">Proximal Policy Optimization Algorithms</a></figcaption>
</figure>
However, without a constraint, if the action taken is much more probable in our current policy than in our former, **this would lead to a significant policy gradient step** and, therefore, an **excessive policy update.**
## The clipped Part of the Clipped Surrogate Objective function
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/clipped.jpg" alt="PPO"/>
Consequently, we need to constrain this objective function by penalizing changes that lead to a ratio away from 1 (in the paper, the ratio can only vary from 0.8 to 1.2).
**By clipping the ratio, we ensure that we do not have a too large policy update because the current policy can't be too different from the older one.**
To do that, we have two solutions:
- *TRPO (Trust Region Policy Optimization)* uses KL divergence constraints outside the objective function to constrain the policy update. But this method **is complicated to implement and takes more computation time.**
- *PPO* clip probability ratio directly in the objective function with its **Clipped surrogate objective function.**
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/clipped.jpg" alt="PPO"/>
This clipped part is a version where rt(theta) is clipped between \\( [1 - \epsilon, 1 + \epsilon] \\).
With the Clipped Surrogate Objective function, we have two probability ratios, one non-clipped and one clipped in a range (between \\( [1 - \epsilon, 1 + \epsilon] \\), epsilon is a hyperparameter that helps us to define this clip range (in the paper \\( \epsilon = 0.2 \\).).
Then, we take the minimum of the clipped and non-clipped objective, **so the final objective is a lower bound (pessimistic bound) of the unclipped objective.**
Taking the minimum of the clipped and non-clipped objective means **we'll select either the clipped or the non-clipped objective based on the ratio and advantage situation**.

View File

@@ -0,0 +1,13 @@
# Conclusion
That's all for today. Congrats on finishing this Unit and the tutorial! ⭐️
Now that you've successfully trained your Doom agent, why not try deathmatch? Remember, that's a much more complex level than the one you've just trained, **but it's a nice experiment and I advise you to try it.**
If you do it, don't hesitate to share your model in the `#rl-i-made-this` channel in our [discord server](https://www.hf.co/join/discord).
This concludes the last unit, but we are not finished yet! 🤗 The following **bonus unit includes some of the most interesting, advanced and cutting edge work in Deep Reinforcement Learning**.
See you next time 🔥,
## Keep Learning, Stay awesome 🤗

View File

@@ -0,0 +1,9 @@
# Conclusion [[Conclusion]]
Thats all for today. Congrats on finishing this unit and the tutorial!
The best way to learn is to practice and try stuff. **Why not improving the implementation to handle frames as input?**.
See you on second part of this Unit 🔥,
## Keep Learning, Stay awesome 🤗

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,430 @@
# Hands-on: advanced Deep Reinforcement Learning. Using Sample Factory to play Doom from pixels
<CourseFloatingBanner classNames="absolute z-10 right-0 top-0"
notebooks={[
{label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/unit8/unit8_part2.ipynb"}
]}
askForHelpUrl="http://hf.co/join/discord" />
The colab notebook:
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit8/unit8_part2.ipynb)
# Unit 8 Part 2: Advanced Deep Reinforcement Learning. Using Sample Factory to play Doom from pixels
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/thumbnail2.png" alt="Thumbnail"/>
In this notebook, we will learn how to train a Deep Neural Network to collect objects in a 3D environment based on the game of Doom, a video of the resulting policy is shown below. We train this policy using [Sample Factory](https://www.samplefactory.dev/), an asynchronous implementation of the PPO algorithm.
Please note the following points:
* [Sample Factory](https://www.samplefactory.dev/) is an advanced RL framework and **only functions on Linux and Mac** (not Windows).
* The framework performs best on a **GPU machine with many CPU cores**, where it can achieve speeds of 100k interactions per second. The resources available on a standard Colab notebook **limit the performance of this library**. So the speed in this setting **does not reflect the real-world performance**.
* Benchmarks for Sample Factory are available in a number of settings, check out the [examples](https://github.com/alex-petrenko/sample-factory/tree/master/sf_examples) if you want to find out more.
```python
from IPython.display import HTML
HTML(
"""<video width="640" height="480" controls>
<source src="https://huggingface.co/edbeeching/doom_health_gathering_supreme_3333/resolve/main/replay.mp4"
type="video/mp4">Your browser does not support the video tag.</video>"""
)
```
To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push one model:
- `doom_health_gathering_supreme` get a result of >= 5.
To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**
If you don't find your model, **go to the bottom of the page and click on the refresh button**
For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
## Set the GPU 💪
- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg" alt="GPU Step 1">
- `Hardware Accelerator > GPU`
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg" alt="GPU Step 2">
Before starting to train our agent, let's **study the library and environments we're going to use**.
## Sample Factory
[Sample Factory](https://www.samplefactory.dev/) is one of the **fastest RL libraries focused on very efficient synchronous and asynchronous implementations of policy gradients (PPO)**.
Sample Factory is thoroughly **tested, used by many researchers and practitioners**, and is actively maintained. Our implementation is known to **reach SOTA performance in a variety of domains while minimizing RL experiment training time and hardware requirements**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/samplefactoryenvs.png" alt="Sample factory"/>
### Key features
- Highly optimized algorithm [architecture](https://www.samplefactory.dev/06-architecture/overview/) for maximum learning throughput
- [Synchronous and asynchronous](https://www.samplefactory.dev/07-advanced-topics/sync-async/) training regimes
- [Serial (single-process) mode](https://www.samplefactory.dev/07-advanced-topics/serial-mode/) for easy debugging
- Optimal performance in both CPU-based and [GPU-accelerated environments](https://www.samplefactory.dev/09-environment-integrations/isaacgym/)
- Single- & multi-agent training, self-play, supports [training multiple policies](https://www.samplefactory.dev/07-advanced-topics/multi-policy-training/) at once on one or many GPUs
- Population-Based Training ([PBT](https://www.samplefactory.dev/07-advanced-topics/pbt/))
- Discrete, continuous, hybrid action spaces
- Vector-based, image-based, dictionary observation spaces
- Automatically creates a model architecture by parsing action/observation space specification. Supports [custom model architectures](https://www.samplefactory.dev/03-customization/custom-models/)
- Designed to be imported into other projects, [custom environments](https://www.samplefactory.dev/03-customization/custom-environments/) are first-class citizens
- Detailed [WandB and Tensorboard summaries](https://www.samplefactory.dev/05-monitoring/metrics-reference/), [custom metrics](https://www.samplefactory.dev/05-monitoring/custom-metrics/)
- [HuggingFace 🤗 integration](https://www.samplefactory.dev/10-huggingface/huggingface/) (upload trained models and metrics to the Hub)
- [Multiple](https://www.samplefactory.dev/09-environment-integrations/mujoco/) [example](https://www.samplefactory.dev/09-environment-integrations/atari/) [environment](https://www.samplefactory.dev/09-environment-integrations/vizdoom/) [integrations](https://www.samplefactory.dev/09-environment-integrations/dmlab/) with tuned parameters and trained models
All of the above policies are available on the 🤗 hub. Search for the tag [sample-factory](https://huggingface.co/models?library=sample-factory&sort=downloads)
### How sample-factory works
Sample-factory is one of the **most highly optimized RL implementations available to the community**.
It works by **spawning multiple processes that run rollout workers, inference workers and a learner worker**.
The *workers* **communicate through shared memory, which lowers the communication cost between processes**.
The *rollout workers* interact with the environment and send observations to the *inference workers*.
The *inferences workers* query a fixed version of the policy and **send actions back to the rollout worker**.
After *k* steps the rollout works send a trajectory of experience to the learner worker, **which it uses to update the agents policy network**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/samplefactory.png" alt="Sample factory"/>
### Actor Critic models in Sample-factory
Actor Critic models in Sample Factory are composed of three components:
- **Encoder** - Process input observations (images, vectors) and map them to a vector. This is the part of the model you will most likely want to customize.
- **Core** - Intergrate vectors from one or more encoders, can optionally include a single- or multi-layer LSTM/GRU in a memory-based agent.
- **Decoder** - Apply additional layers to the output of the model core before computing the policy and value outputs.
The library has been designed to automatically support any observation and action spaces. Users can easily add their custom models. You can find out more in the [documentation](https://www.samplefactory.dev/03-customization/custom-models/#actor-critic-models-in-sample-factory).
## ViZDoom
[ViZDoom](https://vizdoom.cs.put.edu.pl/) is an **open-source python interface for the Doom Engine**.
The library was created in 2016 by Marek Wydmuch, Michal Kempka at the Institute of Computing Science, Poznan University of Technology, Poland.
The library enables the **training of agents directly from the screen pixels in a number of scenarios**, including team deathmatch, shown in the video below. Because the ViZDoom environment is based on a game the was created in the 90s, it can be run on modern hardware at accelerated speeds, **allowing us to learn complex AI behaviors fairly quickly**.
The library includes feature such as:
- Multi-platform (Linux, macOS, Windows),
- API for Python and C++,
- [OpenAI Gym](https://www.gymlibrary.dev/) environment wrappers
- Easy-to-create custom scenarios (visual editors, scripting language, and examples available),
- Async and sync single-player and multiplayer modes,
- Lightweight (few MBs) and fast (up to 7000 fps in sync mode, single-threaded),
- Customizable resolution and rendering parameters,
- Access to the depth buffer (3D vision),
- Automatic labeling of game objects visible in the frame,
- Access to the audio buffer
- Access to the list of actors/objects and map geometry,
- Off-screen rendering and episode recording,
- Time scaling in async mode.
## We first need to install some dependencies that are required for the ViZDoom environment
Now that our Colab runtime is set up, we can start by installing the dependencies required to run ViZDoom on linux.
If you are following on your machine on Mac, you will want to follow the installation instructions on the [github page](https://github.com/Farama-Foundation/ViZDoom/blob/master/doc/Quickstart.md#-quickstart-for-macos-and-anaconda3-python-36).
```python
# Install ViZDoom deps from
# https://github.com/mwydmuch/ViZDoom/blob/master/doc/Building.md#-linux
apt-get install build-essential zlib1g-dev libsdl2-dev libjpeg-dev \
nasm tar libbz2-dev libgtk2.0-dev cmake git libfluidsynth-dev libgme-dev \
libopenal-dev timidity libwildmidi-dev unzip ffmpeg
# Boost libraries
apt-get install libboost-all-dev
# Lua binding dependencies
apt-get install liblua5.1-dev
```
## Then we can install Sample Factory and ViZDoom
- This can take 7min
```bash
pip install sample-factory
pip install vizdoom
```
## Setting up the Doom Environment in sample-factory
```python
import functools
from sample_factory.algo.utils.context import global_model_factory
from sample_factory.cfg.arguments import parse_full_cfg, parse_sf_args
from sample_factory.envs.env_utils import register_env
from sample_factory.train import run_rl
from sf_examples.vizdoom.doom.doom_model import make_vizdoom_encoder
from sf_examples.vizdoom.doom.doom_params import add_doom_env_args, doom_override_defaults
from sf_examples.vizdoom.doom.doom_utils import DOOM_ENVS, make_doom_env_from_spec
# Registers all the ViZDoom environments
def register_vizdoom_envs():
for env_spec in DOOM_ENVS:
make_env_func = functools.partial(make_doom_env_from_spec, env_spec)
register_env(env_spec.name, make_env_func)
# Sample Factory allows the registration of a custom Neural Network architecture
# See https://github.com/alex-petrenko/sample-factory/blob/master/sf_examples/vizdoom/doom/doom_model.py for more details
def register_vizdoom_models():
global_model_factory().register_encoder_factory(make_vizdoom_encoder)
def register_vizdoom_components():
register_vizdoom_envs()
register_vizdoom_models()
# parse the command line args and create a config
def parse_vizdoom_cfg(argv=None, evaluation=False):
parser, _ = parse_sf_args(argv=argv, evaluation=evaluation)
# parameters specific to Doom envs
add_doom_env_args(parser)
# override Doom default values for algo parameters
doom_override_defaults(parser)
# second parsing pass yields the final configuration
final_cfg = parse_full_cfg(parser, argv)
return final_cfg
```
Now that the setup if complete, we can train the agent. We have chosen here to learn a ViZDoom task called `Health Gathering Supreme`.
### The scenario: Health Gathering Supreme
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/Health-Gathering-Supreme.png" alt="Health-Gathering-Supreme"/>
The objective of this scenario is to **teach the agent how to survive without knowing what makes him survive**. Agent know only that **life is precious** and death is bad so **it must learn what prolongs his existence and that his health is connected with it**.
Map is a rectangle containing walls and with a green, acidic floor which **hurts the player periodically**. Initially there are some medkits spread uniformly over the map. A new medkit falls from the skies every now and then. **Medkits heal some portions of player's health** - to survive agent needs to pick them up. Episode finishes after player's death or on timeout.
Further configuration:
- Living_reward = 1
- 3 available buttons: turn left, turn right, move forward
- 1 available game variable: HEALTH
- death penalty = 100
You can find out more about the scenarios available in ViZDoom [here](https://github.com/Farama-Foundation/ViZDoom/tree/master/scenarios).
There are also a number of more complex scenarios that have been create for ViZDoom, such as the ones detailed on [this github page](https://github.com/edbeeching/3d_control_deep_rl).
## Training the agent
- We're going to train the agent for 4000000 steps it will take approximately 20min
```python
## Start the training, this should take around 15 minutes
register_vizdoom_components()
# The scenario we train on today is health gathering
# other scenarios include "doom_basic", "doom_two_colors_easy", "doom_dm", "doom_dwango5", "doom_my_way_home", "doom_deadly_corridor", "doom_defend_the_center", "doom_defend_the_line"
env = "doom_health_gathering_supreme"
cfg = parse_vizdoom_cfg(
argv=[f"--env={env}", "--num_workers=8", "--num_envs_per_worker=4", "--train_for_env_steps=4000000"]
)
status = run_rl(cfg)
```
## Let's take a look at the performance of the trained policy and output a video of the agent.
```python
from sample_factory.enjoy import enjoy
cfg = parse_vizdoom_cfg(
argv=[f"--env={env}", "--num_workers=1", "--save_video", "--no_render", "--max_num_episodes=10"], evaluation=True
)
status = enjoy(cfg)
```
## Now lets visualize the performance of the agent
```python
from base64 import b64encode
from IPython.display import HTML
mp4 = open("/content/train_dir/default_experiment/replay.mp4", "rb").read()
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
HTML(
"""
<video width=640 controls>
<source src="%s" type="video/mp4">
</video>
"""
% data_url
)
```
The agent has learned something, but its performance could be better. We would clearly need to train for longer. But let's upload this model to the Hub.
## Now lets upload your checkpoint and video to the Hugging Face Hub
To be able to share your model with the community there are three more steps to follow:
1⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join
2⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.
- Create a new token (https://huggingface.co/settings/tokens) **with write role**
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
- Copy the token
- Run the cell below and paste the token
If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
```python
from huggingface_hub import notebook_login
notebook_login()
!git config --global credential.helper store
```
```python
from sample_factory.enjoy import enjoy
hf_username = "ThomasSimonini" # insert your HuggingFace username here
cfg = parse_vizdoom_cfg(
argv=[
f"--env={env}",
"--num_workers=1",
"--save_video",
"--no_render",
"--max_num_episodes=10",
"--max_num_frames=100000",
"--push_to_hub",
f"--hf_repository={hf_username}/rl_course_vizdoom_health_gathering_supreme",
],
evaluation=True,
)
status = enjoy(cfg)
```
## Let's load another model
This agent's performance was good, but can do better! Let's download and visualize an agent trained for 10B timesteps from the hub.
```bash
#download the agent from the hub
python -m sample_factory.huggingface.load_from_hub -r edbeeching/doom_health_gathering_supreme_2222 -d ./train_dir
```
```bash
ls train_dir/doom_health_gathering_supreme_2222
```
```python
env = "doom_health_gathering_supreme"
cfg = parse_vizdoom_cfg(
argv=[
f"--env={env}",
"--num_workers=1",
"--save_video",
"--no_render",
"--max_num_episodes=10",
"--experiment=doom_health_gathering_supreme_2222",
"--train_dir=train_dir",
],
evaluation=True,
)
status = enjoy(cfg)
```
```python
mp4 = open("/content/train_dir/doom_health_gathering_supreme_2222/replay.mp4", "rb").read()
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
HTML(
"""
<video width=640 controls>
<source src="%s" type="video/mp4">
</video>
"""
% data_url
)
```
## Some additional challenges 🏆: Doom Deathmatch
Training an agent to play a Doom deathmatch **takes many hours on a more beefy machine than is available in Colab**.
Fortunately, we have have **already trained an agent in this scenario and it is available in the 🤗 Hub!** Lets download the model and visualize the agents performance.
```python
# Download the agent from the hub
python -m sample_factory.huggingface.load_from_hub -r edbeeching/doom_deathmatch_bots_2222 -d ./train_dir
```
Given the agent plays for a long time the video generation can take **10 minutes**.
```python
from sample_factory.enjoy import enjoy
register_vizdoom_components()
env = "doom_deathmatch_bots"
cfg = parse_vizdoom_cfg(
argv=[
f"--env={env}",
"--num_workers=1",
"--save_video",
"--no_render",
"--max_num_episodes=1",
"--experiment=doom_deathmatch_bots_2222",
"--train_dir=train_dir",
],
evaluation=True,
)
status = enjoy(cfg)
mp4 = open("/content/train_dir/doom_deathmatch_bots_2222/replay.mp4", "rb").read()
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
HTML(
"""
<video width=640 controls>
<source src="%s" type="video/mp4">
</video>
"""
% data_url
)
```
You **can try to train your agent in this environment** using the code above, but not on colab.
**Good luck 🤞**
If you prefer an easier scenario, **why not try training in another ViZDoom scenario such as `doom_deadly_corridor` or `doom_defend_the_center`.**
---
This concludes the last unit. But we are not finished yet! 🤗 The following **bonus section include some of the most interesting, advanced and cutting edge work in Deep Reinforcement Learning**.
## Keep learning, stay awesome 🤗

View File

@@ -0,0 +1,13 @@
# Introduction to PPO with Sample-Factory
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/thumbnail2.png" alt="thumbnail"/>
In this second part of Unit 8, we'll get deeper into PPO optimization by using [Sample-Factory](https://samplefactory.dev/), an **asynchronous implementation of the PPO algorithm**, to train our agent playing [vizdoom](https://vizdoom.cs.put.edu.pl/) (an open source version of Doom).
In the notebook, **you'll train your agent to play the Health Gathering level**, where the agent must collect health packs to avoid dying. After that, you can **train your agent to play more complex levels, such as Deathmatch**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/environments.png" alt="Environment"/>
Sounds exciting? Let's get started! 🚀
The hands-on is made by [Edward Beeching](https://twitter.com/edwardbeeching), a Machine Learning Research Scientist at Hugging Face. He worked on Godot Reinforcement Learning Agents, an open-source interface for developing environments and agents in the Godot Game Engine.

View File

@@ -0,0 +1,23 @@
# Introduction [[introduction]]
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/thumbnail.png" alt="Unit 8"/>
In Unit 6, we learned about Advantage Actor Critic (A2C), a hybrid architecture combining value-based and policy-based methods that help to stabilize the training by reducing the variance with:
- *An Actor* that controls **how our agent behaves** (policy-based method).
- *A Critic* that measures **how good the action taken is** (value-based method).
Today we'll learn about Proximal Policy Optimization (PPO), an architecture that **improves our agent's training stability by avoiding too large policy updates**. To do that, we use a ratio that indicates the difference between our current and old policy and clip this ratio from a specific range \\( [1 - \epsilon, 1 + \epsilon] \\) .
Doing this will ensure **that our policy update will not be too large and that the training is more stable.**
This Unit is in two parts:
- In this first part, you'll learn the theory behind PPO and code your PPO agent from scratch using [CleanRL](https://github.com/vwxyzjn/cleanrl) implementation. To test its robustness with LunarLander-v2. LunarLander-v2 **is the first environment you used when you started this course**. At that time, you didn't know how PPO worked, and now, **you can code it from scratch and train it. How incredible is that 🤩**.
- In the second part, we'll get deeper into PPO optimization by using [Sample-Factory](https://samplefactory.dev/) and train an agent playing vizdoom (an open source version of Doom).
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/environments.png" alt="Environment"/>
<figcaption>These are the environments you're going to use to train your agents: VizDoom environments</figcaption>
</figure>
Sounds exciting? Let's get started! 🚀

View File

@@ -0,0 +1,16 @@
# The intuition behind PPO [[the-intuition-behind-ppo]]
The idea with Proximal Policy Optimization (PPO) is that we want to improve the training stability of the policy by limiting the change you make to the policy at each training epoch: **we want to avoid having too large policy updates.**
For two reasons:
- We know empirically that smaller policy updates during training are **more likely to converge to an optimal solution.**
- A too-big step in a policy update can result in falling “off the cliff” (getting a bad policy) **and having a long time or even no possibility to recover.**
<figure class="image table text-center m-0 w-full">
<img class="center" src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/cliff.jpg" alt="Policy Update cliff"/>
<figcaption>Taking smaller policy updates to improve the training stability</figcaption>
<figcaption>Modified version from RL — Proximal Policy Optimization (PPO) <a href="https://jonathan-hui.medium.com/rl-proximal-policy-optimization-ppo-explained-77f014ec3f12">Explained by Jonathan Hui</a></figcaption>
</figure>
**So with PPO, we update the policy conservatively**. To do so, we need to measure how much the current policy changed compared to the former one using a ratio calculation between the current and former policy. And we clip this ratio in a range \\( [1 - \epsilon, 1 + \epsilon] \\), meaning that we **remove the incentive for the current policy to go too far from the old one (hence the proximal policy term).**

View File

@@ -0,0 +1,68 @@
# Visualize the Clipped Surrogate Objective Function
Don't worry. **It's normal if this seems complex to handle right now**. But we're going to see what this Clipped Surrogate Objective Function looks like, and this will help you to visualize better what's going on.
<figure class="image table text-center m-0 w-full">
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/recap.jpg" alt="PPO"/>
<figcaption><a href="https://fse.studenttheses.ub.rug.nl/25709/1/mAI_2021_BickD.pdf">Table from "Towards Delivering a Coherent Self-Contained
Explanation of Proximal Policy Optimization" by Daniel Bick</a></figcaption>
</figure>
We have six different situations. Remember first that we take the minimum between the clipped and unclipped objectives.
## Case 1 and 2: the ratio is between the range
In situations 1 and 2, **the clipping does not apply since the ratio is between the range** \\( [1 - \epsilon, 1 + \epsilon] \\)
In situation 1, we have a positive advantage: the **action is better than the average** of all the actions in that state. Therefore, we should encourage our current policy to increase the probability of taking that action in that state.
Since the ratio is between intervals, **we can increase our policy's probability of taking that action at that state.**
In situation 2, we have a negative advantage: the action is worse than the average of all actions at that state. Therefore, we should discourage our current policy from taking that action in that state.
Since the ratio is between intervals, **we can decrease the probability that our policy takes that action at that state.**
## Case 3 and 4: the ratio is below the range
<figure class="image table text-center m-0 w-full">
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/recap.jpg" alt="PPO"/>
<figcaption><a href="https://fse.studenttheses.ub.rug.nl/25709/1/mAI_2021_BickD.pdf">Table from "Towards Delivering a Coherent Self-Contained
Explanation of Proximal Policy Optimization" by Daniel Bick</a></figcaption>
</figure>
If the probability ratio is lower than \\( [1 - \epsilon] \\), the probability of taking that action at that state is much lower than with the old policy.
If, like in situation 3, the advantage estimate is positive (A>0), then **you want to increase the probability of taking that action at that state.**
But if, like situation 4, the advantage estimate is negative, **we don't want to decrease further** the probability of taking that action at that state. Therefore, the gradient is = 0 (since we're on a flat line), so we don't update our weights.
## Case 5 and 6: the ratio is above the range
<figure class="image table text-center m-0 w-full">
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/recap.jpg" alt="PPO"/>
<figcaption><a href="https://fse.studenttheses.ub.rug.nl/25709/1/mAI_2021_BickD.pdf">Table from "Towards Delivering a Coherent Self-Contained
Explanation of Proximal Policy Optimization" by Daniel Bick</a></figcaption>
</figure>
If the probability ratio is higher than \\( [1 + \epsilon] \\), the probability of taking that action at that state in the current policy is **much higher than in the former policy.**
If, like in situation 5, the advantage is positive, **we don't want to get too greedy**. We already have a higher probability of taking that action at that state than the former policy. Therefore, the gradient is = 0 (since we're on a flat line), so we don't update our weights.
If, like in situation 6, the advantage is negative, we want to decrease the probability of taking that action at that state.
So if we recap, **we only update the policy with the unclipped objective part**. When the minimum is the clipped objective part, we don't update our policy weights since the gradient will equal 0.
So we update our policy only if:
- Our ratio is in the range \\( [1 - \epsilon, 1 + \epsilon] \\)
- Our ratio is outside the range, but **the advantage leads to getting closer to the range**
- Being below the ratio but the advantage is > 0
- Being above the ratio but the advantage is < 0
**You might wonder why, when the minimum is the clipped ratio, the gradient is 0.** When the ratio is clipped, the derivative in this case will not be the derivative of the \\( r_t(\theta) * A_t \\) but the derivative of either \\( (1 - \epsilon)* A_t\\) or the derivative of \\( (1 + \epsilon)* A_t\\) which both = 0.
To summarize, thanks to this clipped surrogate objective, **we restrict the range that the current policy can vary from the old one.** Because we remove the incentive for the probability ratio to move outside of the interval since, the clip have the effect to gradient. If the ratio is > \\( 1 + \epsilon \\) or < \\( 1 - \epsilon \\) the gradient will be equal to 0.
The final Clipped Surrogate Objective Loss for PPO Actor-Critic style looks like this, it's a combination of Clipped Surrogate Objective function, Value Loss Function and Entropy bonus:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/ppo-objective.jpg" alt="PPO objective"/>
That was quite complex. Take time to understand these situations by looking at the table and the graph. **You must understand why this makes sense.** If you want to go deeper, the best resource is the article [Towards Delivering a Coherent Self-Contained Explanation of Proximal Policy Optimization" by Daniel Bick, especially part 3.4](https://fse.studenttheses.ub.rug.nl/25709/1/mAI_2021_BickD.pdf).

View File

@@ -0,0 +1,54 @@
# (Automatic) Curriculum Learning for RL
While most of the RL methods seen in this course work well in practice, there are some cases where using them alone fails. It is for instance the case where:
- the task to learn is hard and requires an **incremental acquisition of skills** (for instance when one wants to make a bipedal agent learn to go through hard obstacles, it must first learn to stand, then walk, then maybe jump…)
- there are variations in the environment (that affect the difficulty) and one wants its agent to be **robust** to them
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/bipedal.gif" alt="Bipedal"/>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/movable_creepers.gif" alt="Movable creepers"/>
<figcaption> <a href="https://developmentalsystems.org/TeachMyAgent/">TeachMyAgent</a> </figcaption>
</figure>
In such cases, it seems needed to propose different tasks to our RL agent and organize them such that it allows the agent to progressively acquire skills. This approach is called **Curriculum Learning** and usually implies a hand-designed curriculum (or set of tasks organized in a specific order). In practice, one can for instance control the generation of the environment, the initial states, or use Self-Play an control the level of opponents proposed to the RL agent.
As designing such a curriculum is not always trivial, the field of **Automatic Curriculum Learning (ACL) proposes to design approaches that learn to create such and organization of tasks in order to maximize the RL agents performances**. Portelas et al. proposed to define ACL as:
> … a family of mechanisms that automatically adapt the distribution of training data by learning to adjust the selection of learning situations to the capabilities of RL agents.
>
As an example, OpenAI used **Domain Randomization** (they applied random variations on the environment) to make a robot hand solve Rubiks Cubes.
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/dr.jpg" alt="Dr"/>
<figcaption> <a href="https://openai.com/blog/solving-rubiks-cube/">OpenAI - Solving Rubiks Cube with a Robot Hand</a></figcaption>
</figure>
Finally, you can play with the robustness of agents trained in the <a href="https://huggingface.co/spaces/flowers-team/Interactive_DeepRL_Demo">TeachMyAgent</a> benchmark by controlling environment variations or even drawing the terrain 👇
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/demo.png" alt="Demo"/>
<figcaption> <a href="https://huggingface.co/spaces/flowers-team/Interactive_DeepRL_Demo">https://huggingface.co/spaces/flowers-team/Interactive_DeepRL_Demo</a></figcaption>
</figure>
## Further reading
For more information, we recommend you check out the following resources:
### Overview of the field
- [Automatic Curriculum Learning For Deep RL: A Short Survey](https://arxiv.org/pdf/2003.04664.pdf)
- [Curriculum for Reinforcement Learning](https://lilianweng.github.io/posts/2020-01-29-curriculum-rl/)
### Recent methods
- [Evolving Curricula with Regret-Based Environment Design](https://arxiv.org/abs/2203.01302)
- [Curriculum Reinforcement Learning via Constrained Optimal Transport](https://proceedings.mlr.press/v162/klink22a.html)
- [Prioritized Level Replay](https://arxiv.org/abs/2010.03934)
## Author
This section was written by <a href="https://twitter.com/ClementRomac"> Clément Romac </a>

View File

@@ -0,0 +1,31 @@
# Decision Transformers
The Decision Transformer model was introduced by ["Decision Transformer: Reinforcement Learning via Sequence Modeling” by Chen L. et al](https://arxiv.org/abs/2106.01345). It abstracts Reinforcement Learning as a conditional-sequence modeling problem.
The main idea is that instead of training a policy using RL methods, such as fitting a value function, that will tell us what action to take to maximize the return (cumulative reward), **we use a sequence modeling algorithm (Transformer) that, given a desired return, past states, and actions, will generate future actions to achieve this desired return**.
Its an autoregressive model conditioned on the desired return, past states, and actions to generate future actions that achieve the desired return.
This is a complete shift in the Reinforcement Learning paradigm since we use generative trajectory modeling (modeling the joint distribution of the sequence of states, actions, and rewards) to replace conventional RL algorithms. It means that in Decision Transformers, we dont maximize the return but rather generate a series of future actions that achieve the desired return.
The 🤗 Transformers team integrated the Decision Transformer, an Offline Reinforcement Learning method, into the library as well as the Hugging Face Hub.
## Learn about Decision Transformers
To learn more about Decision Transformers, you should read the blogpost we wrote about it [Introducing Decision Transformers on Hugging Face](https://huggingface.co/blog/decision-transformers)
## Train your first Decision Transformers
Now that you understand how Decision Transformers work thanks to [Introducing Decision Transformers on Hugging Face](https://huggingface.co/blog/decision-transformers). Youre ready to learn to train your first Offline Decision Transformer model from scratch to make a half-cheetah run.
Start the tutorial here 👉 https://huggingface.co/blog/train-decision-transformers
## Further reading
For more information, we recommend you check out the following resources:
- [Decision Transformer: Reinforcement Learning via Sequence Modeling](https://arxiv.org/abs/2106.01345)
- [Online Decision Transformer](https://arxiv.org/abs/2202.05607)
## Author
This section was written by <a href="https://twitter.com/edwardbeeching">Edward Beeching</a>

View File

@@ -0,0 +1,49 @@
# Interesting Environments to try
We provide here a list of interesting environments you can try to train your agents on:
## MineRL
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit12/minerl.jpg" alt="MineRL"/>
MineRL is a Python library that provides a Gym interface for interacting with the video game Minecraft, accompanied by datasets of human gameplay.
Every year, there are challenges with this library. Check the [website](https://minerl.io/)
To start using this environment, check these resources:
- [What is MineRL?](https://www.youtube.com/watch?v=z6PTrGifupU)
- [First steps in MineRL](https://www.youtube.com/watch?v=8yIrWcyWGek)
- [MineRL documentation and tutorials](https://minerl.readthedocs.io/en/latest/)
## DonkeyCar Simulator
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit12/donkeycar.jpg" alt="Donkey Car"/>
Donkey is a Self Driving Car Platform for hobby remote control cars.
This simulator version is built on the Unity game platform. It uses their internal physics and graphics and connects to a donkey Python process to use our trained model to control the simulated Donkey (car).
To start using this environment, check these resources:
- [DonkeyCar Simulator documentation](https://docs.donkeycar.com/guide/deep_learning/simulator/)
- [Learn to Drive Smoothly (Antonin Raffin's tutorial) Part 1](https://www.youtube.com/watch?v=ngK33h00iBE)
- [Learn to Drive Smoothly (Antonin Raffin's tutorial) Part 2](https://www.youtube.com/watch?v=DUqssFvcSOY)
- [Learn to Drive Smoothly (Antonin Raffin's tutorial) Part 3](https://www.youtube.com/watch?v=v8j2bpcE4Rg)
- Pretrained agents:
- https://huggingface.co/araffin/tqc-donkey-mountain-track-v0
- https://huggingface.co/araffin/tqc-donkey-avc-sparkfun-v0
- https://huggingface.co/araffin/tqc-donkey-minimonaco-track-v0
## Starcraft II
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit12/alphastar.jpg" alt="Alphastar"/>
Starcraft II is a famous *real-time strategy game*. DeepMind has used this game for their Deep Reinforcement Learning research with [Alphastar](https://www.deepmind.com/blog/alphastar-mastering-the-real-time-strategy-game-starcraft-ii)
To start using this environment, check these resources:
- [Starcraft gym](http://starcraftgym.com/)
- [A. I. Learns to Play Starcraft 2 (Reinforcement Learning) tutorial](https://www.youtube.com/watch?v=q59wap1ELQ4)
## Author
This section was written by <a href="https://twitter.com/ThomasSimonini"> Thomas Simonini</a>

View File

@@ -0,0 +1,208 @@
# Godot RL Agents
[Godot RL Agents](https://github.com/edbeeching/godot_rl_agents) is an Open Source package that allows video game creators, AI researchers and hobbyists the opportunity **to learn complex behaviors for their Non Player Characters or agents**.
The library provides:
- An interface between games created in the [Godot Engine](https://godotengine.org/) and Machine Learning algorithms running in Python
- Wrappers for four well known rl frameworks: [StableBaselines3](https://stable-baselines3.readthedocs.io/en/master/), [CleanRL](https://docs.cleanrl.dev/), [Sample Factory](https://www.samplefactory.dev/) and [Ray RLLib](https://docs.ray.io/en/latest/rllib-algorithms.html)
- Support for memory-based agents with LSTM or attention based interfaces
- Support for *2D and 3D games*
- A suite of *AI sensors* to augment your agent's capacity to observe the game world
- Godot and Godot RL Agents are **completely free and open source under a very permissive MIT license**. No strings attached, no royalties, nothing.
You can find out more about Godot RL agents on their [GitHub page](https://github.com/edbeeching/godot_rl_agents) or their AAAI-2022 Workshop [paper](https://arxiv.org/abs/2112.03636). The library's creator, [Ed Beeching](https://edbeeching.github.io/), is a Research Scientist here at Hugging Face.
## Create a custom RL environment with Godot RL Agents
In this section, you will **learn how to create a custom environment in the Godot Game Engine** and then implement an AI controller that learns to play with Deep Reinforcement Learning.
The example game we create today is simple, **but shows off many of the features of the Godot Engine and the Godot RL Agents library**.You can then dive into the examples for more complex environments and behaviors.
The environment we will be building today is called Ring Pong, the game of pong but the pitch is a ring and the paddle moves around the ring. The **objective is to keep the ball bouncing inside the ring**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/ringpong.gif" alt="Ring Pong">
### Installing the Godot Game Engine
The [Godot game engine](https://godotengine.org/) is an open source tool for the **creation of video games, tools and user interfaces**.
Godot Engine is a feature-packed, cross-platform game engine designed to create 2D and 3D games from a unified interface. It provides a comprehensive set of common tools, so users **can focus on making games without having to reinvent the wheel**. Games can be exported in one click to a number of platforms, including the major desktop platforms (Linux, macOS, Windows) as well as mobile (Android, iOS) and web-based (HTML5) platforms.
While we will guide you through the steps to implement your agent, you may wish to learn more about the Godot Game Engine. Their [documentation](https://docs.godotengine.org/en/latest/index.html) is thorough, there are many tutorials on YouTube we would also recommend [GDQuest](https://www.gdquest.com/), [KidsCanCode](https://kidscancode.org/godot_recipes/4.x/) and [Bramwell](https://www.youtube.com/channel/UCczi7Aq_dTKrQPF5ZV5J3gg) as sources of information.
In order to create games in Godot, **you must first download the editor**. The latest version Godot RL agents was updated to use Godot 4 beta, as we are expecting this to be released in the next few months.
At the time of writing the latest beta version was beta 14 which can be downloaded at the following links:
- [Windows](https://downloads.tuxfamily.org/godotengine/4.0/beta14/Godot_v4.0-beta14_win64.exe.zip)
- [Mac](https://downloads.tuxfamily.org/godotengine/4.0/beta14/Godot_v4.0-beta14_macos.universal.zip)
- [Linux](https://downloads.tuxfamily.org/godotengine/4.0/beta14/Godot_v4.0-beta14_linux.x86_64.zip)
### Loading the starter project
We provide two versions of the codebase:
- [A starter project, to download and follow along for this tutorial](https://drive.google.com/file/d/1C7xd3TibJHlxFEJPBgBLpksgxrFZ3D8e/view?usp=share_link)
- [A final version of the project, for comparison and debugging.](https://drive.google.com/file/d/1k-b2Bu7uIA6poApbouX4c3sq98xqogpZ/view?usp=share_link)
To load the project, in the Godot Project Manager click **Import**, navigate to where the files are located and load the **project.godot** file.
If you press F5 or play in the editor, you should be able to play the game in human mode. There are several instances of the game running, this is because we want to speed up training our AI agent with many parallel environments.
### Installing the Godot RL Agents plugin
The Godot RL Agents plugin can be installed from the Github repo or with the Godot Asset Lib in the editor.
First click on the AssetLib and search for “rl”
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/godot1.png" alt="Godot">
Then click on Godot RL Agents, click Download and unselect the LICIENSE and [README.md](http://README.md) files. Then click install.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/godot2.png" alt="Godot">
The Godot RL Agents plugin is now downloaded to your machine your machine. Now click on Project → Project settings and enable the addon:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/godot3.png" alt="Godot">
### Adding the AI controller
We now want to add an AI controller to our game. Open the player.tscn scene, on the left you should see a hierarchy of nodes that looks like this:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/godot4.png" alt="Godot">
Right click the **Player** node and click **Add Child Node.** There are many nodes listed here, search for AIController3D and create it.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/godot5.png" alt="Godot">
The AI Controller Node should have been added to the scene tree, next to it is a scroll. Click on it to open the script that is attached to the AIController. The Godot game engine uses a scripting language called GDScript, which is syntactically similar to python. The script contains methods that need to be implemented in order to get our AI controller working.
```python
#-- Methods that need implementing using the "extend script" option in Godot --#
func get_obs() -> Dictionary:
assert(false, "the get_obs method is not implemented when extending from ai_controller")
return {"obs":[]}
func get_reward() -> float:
assert(false, "the get_reward method is not implemented when extending from ai_controller")
return 0.0
func get_action_space() -> Dictionary:
assert(false, "the get get_action_space method is not implemented when extending from ai_controller")
return {
"example_actions_continous" : {
"size": 2,
"action_type": "continuous"
},
"example_actions_discrete" : {
"size": 2,
"action_type": "discrete"
},
}
func set_action(action) -> void:
assert(false, "the get set_action method is not implemented when extending from ai_controller")
# -----------------------------------------------------------------------------#
```
In order to implement these methods, we will need to create a class that inherits from AIController3D. This is easy to do in Godot, and is called “extending” a class.
Right click the AIController3D Node and click “Extend Script” and call the new script `controller.gd`. You should now have an almost empty script file that looks like this:
```python
extends AIController3D
# Called when the node enters the scene tree for the first time.
func _ready():
pass # Replace with function body.
# Called every frame. 'delta' is the elapsed time since the previous frame.
func _process(delta):
pass
```
We will now implement the 4 missing methods, delete this code and replace it with the following:
```python
extends AIController3D
# Stores the action sampled for the agent's policy, running in python
var move_action : float = 0.0
func get_obs() -> Dictionary:
# get the balls position and velocity in the paddle's frame of reference
var ball_pos = to_local(_player.ball.global_position)
var ball_vel = to_local(_player.ball.linear_velocity)
var obs = [ball_pos.x, ball_pos.z, ball_vel.x/10.0, ball_vel.z/10.0]
return {"obs":obs}
func get_reward() -> float:
return reward
func get_action_space() -> Dictionary:
return {
"move_action" : {
"size": 1,
"action_type": "continuous"
},
}
func set_action(action) -> void:
move_action = clamp(action["move_action"][0], -1.0, 1.0)
```
We have now defined the agents observation, which is the position and velocity of the ball in its local cooridinate space. We have also defined the action space of the agent, which is a single contuninous value ranging from -1 to +1.
The next step is to update the Players script to use the actions from the AIController, edit the Players script by clicking on the scroll next to the player node, update the code in `Player.gd` to the following the following:
```python
extends Node3D
@export var rotation_speed = 3.0
@onready var ball = get_node("../Ball")
@onready var ai_controller = $AIController3D
func _ready():
ai_controller.init(self)
func game_over():
ai_controller.done = true
ai_controller.needs_reset = true
func _physics_process(delta):
if ai_controller.needs_reset:
ai_controller.reset()
ball.reset()
return
var movement : float
if ai_controller.heuristic == "human":
movement = Input.get_axis("rotate_anticlockwise", "rotate_clockwise")
else:
movement = ai_controller.move_action
rotate_y(movement*delta*rotation_speed)
func _on_area_3d_body_entered(body):
ai_controller.reward += 1.0
```
We now need to synchronize between the game running in Godot and the neural network being trained in Python. Godot RL agents provides a node that does just that. Open the train.tscn scene, right click on the root node and click “Add child node”. Then, search for “sync” and add a Godot RL Agents Sync node. This node handles the communication between Python and Godot over TCP.
You can run training live in the the editor, but first launching the python training with `python examples/clean_rl_example.py —env-id=debug`
In this simple example, a reasonable policy is learned in several minutes. You may wish to speed up training, click on the Sync node in the train scene and you will see there is a “Speed Up” property exposed in the editor:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/godot6.png" alt="Godot">
Try setting this property up to 8 to speed up training. This can be a great benefit on more complex environments, like the multi-player FPS we will learn about in the next chapter.
### Theres more!
We have only scratched the surface of what can be achieved with Godot RL Agents, the library includes custom sensors and cameras to enrich the information available to the agent. Take a look at the [examples](https://github.com/edbeeching/godot_rl_agents_examples) to find out more!
## Author
This section was written by <a href="https://twitter.com/edwardbeeching">Edward Beeching</a>

View File

@@ -0,0 +1,11 @@
# Introduction
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit12/thumbnail.png" alt="Unit bonus 3 thumbnail"/>
Congratulations on finishing this course! **You now have a solid background in Deep Reinforcement Learning**.
But this course was just the beginning of your Deep Reinforcement Learning journey, there are so many subsections to discover. In this optional unit, we **give you resources to explore multiple concepts and research topics in Reinforcement Learning**.
Contrary to other units, this unit is a collective work of multiple people from Hugging Face. We mention the author for each unit.
Sounds fun? Let's get started 🔥,

View File

@@ -0,0 +1,45 @@
# Language models in RL
## LMs encode useful knowledge for agents
**Language models** (LMs) can exhibit impressive abilities when manipulating text such as question-answering or even step-by-step reasoning. Additionally, their training on massive text corpora allowed them to **encode various knowledge including abstract ones about the physical rules of our world** (for instance what is possible to do with an object, what happens when one rotates an object…).
A natural question recently studied was could such knowledge benefit agents such as robots when trying to solve everyday tasks. And while these works showed interesting results, the proposed agents lacked of any learning method. **This limitation prevents these agent from adapting to the environment (e.g. fixing wrong knowledge) or learning new skills.**
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit12/language.png" alt="Language">
<figcaption>Source: <a href="https://ai.googleblog.com/2022/08/towards-helpful-robots-grounding.html">Towards Helpful Robots: Grounding Language in Robotic Affordances</a></figcaption>
</figure>
## LMs and RL
There is therefore a potential synergy between LMs which can bring knowledge about the world, and RL which can align and correct these knowledge by interacting with an environment. It is especially interesting from a RL point-of-view as the RL field mostly relies on the **Tabula-rasa** setup where everything is learned from scratch by agent leading to:
1) Sample inefficiency
2) Unexpected behaviors from humans eyes
As a first attempt, the paper [“Grounding Large Language Models with Online Reinforcement Learning”](https://arxiv.org/abs/2302.02662v1) tackled the problem of **adapting or aligning a LM to a textual environment using PPO**. They showed that the knowledge encoded in the LM lead to a fast adaptation to the environment (opening avenue for sample efficiency RL agents) but also that such knowledge allowed the LM to better generalize to new tasks once aligned.
<video src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit12/papier_v4.mp4" type="video/mp4" controls />
Another direction studied in [“Guiding Pretraining in Reinforcement Learning with Large Language Models”](https://arxiv.org/abs/2302.06692) was to keep the LM frozen but leverage its knowledge to **guide an RL agents exploration**. Such method allows the RL agent to be guided towards human-meaningful and plausibly useful behaviors without requiring a human in the loop during training.
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit12/language2.png" alt="Language">
<figcaption> Source: <a href="https://ai.googleblog.com/2022/08/towards-helpful-robots-grounding.html"> Towards Helpful Robots: Grounding Language in Robotic Affordances</a> </figcaption>
</figure>
Several limitations make these works still very preliminary such as the need to convert the agent's observation to text before giving it to a LM as well as the compute cost of interacting with very large LMs.
## Further reading
For more information we recommend you check out the following resources:
- [Google Research, 2022 & beyond: Robotics](https://ai.googleblog.com/2023/02/google-research-2022-beyond-robotics.html)
- [Pre-Trained Language Models for Interactive Decision-Making](https://arxiv.org/abs/2202.01771)
- [Grounding Large Language Models with Online Reinforcement Learning](https://arxiv.org/abs/2302.02662v1)
- [Guiding Pretraining in Reinforcement Learning with Large Language Models](https://arxiv.org/abs/2302.06692)
## Author
This section was written by <a href="https://twitter.com/ClementRomac"> Clément Romac </a>

View File

@@ -0,0 +1,32 @@
# Model Based Reinforcement Learning (MBRL)
Model-based reinforcement learning only differs from its model-free counterpart in learning a *dynamics model*, but that has substantial downstream effects on how the decisions are made.
The dynamics models usually model the environment transition dynamics, \\( s_{t+1} = f_\theta (s_t, a_t) \\), but things like inverse dynamics models (mapping from states to actions) or reward models (predicting rewards) can be used in this framework.
## Simple definition
- There is an agent that repeatedly tries to solve a problem, **accumulating state and action data**.
- With that data, the agent creates a structured learning tool, *a dynamics model*, to reason about the world.
- With the dynamics model, the agent **decides how to act by predicting the future**.
- With those actions, **the agent collects more data, improves said model, and hopefully improves future actions**.
## Academic definition
Model-based reinforcement learning (MBRL) follows the framework of an agent interacting in an environment, **learning a model of said environment**, and then **leveraging the model for control (making decisions).
Specifically, the agent acts in a Markov Decision Process (MDP) governed by a transition function \\( s_{t+1} = f (s_t , a_t) \\) and returns a reward at each step \\( r(s_t, a_t) \\). With a collected dataset \\( D :={ s_i, a_i, s_{i+1}, r_i} \\), the agent learns a model, \\( s_{t+1} = f_\theta (s_t , a_t) \\) **to minimize the negative log-likelihood of the transitions**.
We employ sample-based model-predictive control (MPC) using the learned dynamics model, which optimizes the expected reward over a finite, recursively predicted horizon, \\( \tau \\), from a set of actions sampled from a uniform distribution \\( U(a) \\), (see [paper](https://arxiv.org/pdf/2002.04523) or [paper](https://arxiv.org/pdf/2012.09156.pdf) or [paper](https://arxiv.org/pdf/2009.01221.pdf)).
## Further reading
For more information on MBRL, we recommend you check out the following resources:
- A [blog post on debugging MBRL](https://www.natolambert.com/writing/debugging-mbrl).
- A [recent review paper on MBRL](https://arxiv.org/abs/2006.16712),
## Author
This section was written by <a href="https://twitter.com/natolambert"> Nathan Lambert </a>

View File

@@ -0,0 +1,37 @@
# Offline vs. Online Reinforcement Learning
Deep Reinforcement Learning (RL) is a framework **to build decision-making agents**. These agents aim to learn optimal behavior (policy) by interacting with the environment through **trial and error and receiving rewards as unique feedback**.
The agents goal **is to maximize its cumulative reward**, called return. Because RL is based on the *reward hypothesis*: all goals can be described as the **maximization of the expected cumulative reward**.
Deep Reinforcement Learning agents **learn with batches of experience**. The question is, how do they collect it?:
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit12/offlinevsonlinerl.gif" alt="Unit bonus 3 thumbnail">
<figcaption>A comparison between Reinforcement Learning in an Online and Offline setting, figure taken from <a href="https://offline-rl.github.io/">this post</a></figcaption>
</figure>
- In *online reinforcement learning*, which is what we've learned during this course, the agent **gathers data directly**: it collects a batch of experience by **interacting with the environment**. Then, it uses this experience immediately (or via some replay buffer) to learn from it (update its policy).
But this implies that either you **train your agent directly in the real world or have a simulator**. If you dont have one, you need to build it, which can be very complex (how to reflect the complex reality of the real world in an environment?), expensive, and insecure (if the simulator has flaws that may provide a competitive advantage, the agent will exploit them).
- On the other hand, in *offline reinforcement learning*, the agent only **uses data collected from other agents or human demonstrations**. It does **not interact with the environment**.
The process is as follows:
- **Create a dataset** using one or more policies and/or human interactions.
- Run **offline RL on this dataset** to learn a policy
This method has one drawback: the *counterfactual queries problem*. What do we do if our agent **decides to do something for which we dont have the data?** For instance, turning right on an intersection but we dont have this trajectory.
There exist some solutions on this topic, but if you want to know more about offline reinforcement learning, you can [watch this video](https://www.youtube.com/watch?v=k08N5a0gG0A)
## Further reading
For more information, we recommend you check out the following resources:
- [Offline Reinforcement Learning, Talk by Sergei Levine](https://www.youtube.com/watch?v=qgZPZREor5I)
- [Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems](https://arxiv.org/abs/2005.01643)
## Author
This section was written by <a href="https://twitter.com/ThomasSimonini"> Thomas Simonini</a>

View File

@@ -0,0 +1,56 @@
# Brief introduction to RL documentation
In this advanced topic, we address the question: **how should we monitor and keep track of powerful reinforcement learning agents that we are training in the real world and
interfacing with humans?**
As machine learning systems have increasingly impacted modern life, **call for documentation of these systems has grown**.
Such documentation can cover aspects such as the training data used — where it is stored, when it was collected, who was involved, etc.
— or the model optimization framework — the architecture, evaluation metrics, relevant papers, etc. — and more.
Today, model cards and datasheets are becoming increasingly available. For example, on the Hub
(see documentation [here](https://huggingface.co/docs/hub/model-cards)).
If you click on a [popular model on the Hub](https://huggingface.co/models), you can learn about its creation process.
These model and data specific logs are designed to be completed when the model or dataset are created, leaving them to go un-updated when these models are built into evolving systems in the future.
## Motivating Reward Reports
Reinforcement learning systems are fundamentally designed to optimize based on measurements of reward and time.
While the notion of a reward function can be mapped nicely to many well-understood fields of supervised learning (via a loss function),
understanding how machine learning systems evolve over time is limited.
To that end, the authors introduce [*Reward Reports for Reinforcement Learning*](https://www.notion.so/Brief-introduction-to-RL-documentation-b8cbda5a6f5242338e0756e6bef72af4) (the pithy naming is designed to mirror the popular papers *Model Cards for Model Reporting* and *Datasheets for Datasets*).
The goal is to propose a type of documentation focused on the **human factors of reward** and **time-varying feedback systems**.
Building on the documentation frameworks for [model cards](https://arxiv.org/abs/1810.03993) and [datasheets](https://arxiv.org/abs/1803.09010) proposed by Mitchell et al. and Gebru et al., we argue the need for Reward Reports for AI systems.
**Reward Reports** are living documents for proposed RL deployments that demarcate design choices.
However, many questions remain about the applicability of this framework to different RL applications, roadblocks to system interpretability,
and the resonances between deployed supervised machine learning systems and the sequential decision-making utilized in RL.
At a minimum, Reward Reports are an opportunity for RL practitioners to deliberate on these questions and begin the work of deciding how to resolve them in practice.
## Capturing temporal behavior with documentation
The core piece specific to documentation designed for RL and feedback-driven ML systems is a *change-log*. The change-log updates information
from the designer (changed training parameters, data, etc.) along with noticed changes from the user (harmful behavior, unexpected responses, etc.).
The change log is accompanied by update triggers that encourage monitoring these effects.
## Contributing
Some of the most impactful RL-driven systems are multi-stakeholder in nature and behind closed doors of private corporations.
These corporations are largely without regulation, so the burden of documentation falls on the public.
If you are interested in contributing, we are building Reward Reports for popular machine learning systems on a public
record on [GitHub](https://github.com/RewardReports/reward-reports).
For further reading, you can visit the Reward Reports [paper](https://arxiv.org/abs/2204.10817)
or look [an example report](https://github.com/RewardReports/reward-reports/tree/main/examples).
## Author
This section was written by <a href="https://twitter.com/natolambert"> Nathan Lambert </a>

View File

@@ -0,0 +1,50 @@
# RLHF
Reinforcement learning from human feedback (RLHF) is a **methodology for integrating human data labels into a RL-based optimization process**.
It is motivated by the **challenge of modeling human preferences**.
For many questions, even if you could try and write down an equation for one ideal, humans differ on their preferences.
Updating models **based on measured data is an avenue to try and alleviate these inherently human ML problems**.
## Start Learning about RLHF
To start learning about RLHF:
1. Read this introduction: [Illustrating Reinforcement Learning from Human Feedback (RLHF)](https://huggingface.co/blog/rlhf).
2. Watch the recorded live we did some weeks ago, where Nathan covered the basics of Reinforcement Learning from Human Feedback (RLHF) and how this technology is being used to enable state-of-the-art ML tools like ChatGPT.
Most of the talk is an overview of the interconnected ML models. It covers the basics of Natural Language Processing and RL and how RLHF is used on large language models. We then conclude with the open question in RLHF.
<Youtube id="2MBJOuVq380" />
3. Read other blogs on this topic, such as [Closed-API vs Open-source continues: RLHF, ChatGPT, data moats](https://robotic.substack.com/p/rlhf-chatgpt-data-moats). Let us know if there are more you like!
## Additional readings
*Note, this is copied from the Illustrating RLHF blog post above*.
Here is a list of the most prevalent papers on RLHF to date. The field was recently popularized with the emergence of DeepRL (around 2017) and has grown into a broader study of the applications of LLMs from many large technology companies.
Here are some papers on RLHF that pre-date the LM focus:
- [TAMER: Training an Agent Manually via Evaluative Reinforcement](https://www.cs.utexas.edu/~pstone/Papers/bib2html-links/ICDL08-knox.pdf) (Knox and Stone 2008): Proposed a learned agent where humans provided scores on the actions taken iteratively to learn a reward model.
- [Interactive Learning from Policy-Dependent Human Feedback](http://proceedings.mlr.press/v70/macglashan17a/macglashan17a.pdf) (MacGlashan et al. 2017): Proposed an actor-critic algorithm, COACH, where human feedback (both positive and negative) is used to tune the advantage function.
- [Deep Reinforcement Learning from Human Preferences](https://proceedings.neurips.cc/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html) (Christiano et al. 2017): RLHF applied on preferences between Atari trajectories.
- [Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces](https://ojs.aaai.org/index.php/AAAI/article/view/11485) (Warnell et al. 2018): Extends the TAMER framework where a deep neural network is used to model the reward prediction.
And here is a snapshot of the growing set of papers that show RLHF's performance for LMs:
- [Fine-Tuning Language Models from Human Preferences](https://arxiv.org/abs/1909.08593) (Zieglar et al. 2019): An early paper that studies the impact of reward learning on four specific tasks.
- [Learning to summarize with human feedback](https://proceedings.neurips.cc/paper/2020/hash/1f89885d556929e98d3ef9b86448f951-Abstract.html) (Stiennon et al., 2020): RLHF applied to the task of summarizing text. Also, [Recursively Summarizing Books with Human Feedback](https://arxiv.org/abs/2109.10862) (OpenAI Alignment Team 2021), follow on work summarizing books.
- [WebGPT: Browser-assisted question-answering with human feedback](https://arxiv.org/abs/2112.09332) (OpenAI, 2021): Using RLHF to train an agent to navigate the web.
- InstructGPT: [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155) (OpenAI Alignment Team 2022): RLHF applied to a general language model [[Blog post](https://openai.com/blog/instruction-following/) on InstructGPT].
- GopherCite: [Teaching language models to support answers with verified quotes](https://www.deepmind.com/publications/gophercite-teaching-language-models-to-support-answers-with-verified-quotes) (Menick et al. 2022): Train a LM with RLHF to return answers with specific citations.
- Sparrow: [Improving alignment of dialogue agents via targeted human judgements](https://arxiv.org/abs/2209.14375) (Glaese et al. 2022): Fine-tuning a dialogue agent with RLHF
- [ChatGPT: Optimizing Language Models for Dialogue](https://openai.com/blog/chatgpt/) (OpenAI 2022): Training a LM with RLHF for suitable use as an all-purpose chat bot.
- [Scaling Laws for Reward Model Overoptimization](https://arxiv.org/abs/2210.10760) (Gao et al. 2022): studies the scaling properties of the learned preference model in RLHF.
- [Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback](https://arxiv.org/abs/2204.05862) (Anthropic, 2022): A detailed documentation of training a LM assistant with RLHF.
- [Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned](https://arxiv.org/abs/2209.07858) (Ganguli et al. 2022): A detailed documentation of efforts to “discover, measure, and attempt to reduce [language models] potentially harmful outputs.”
- [Dynamic Planning in Open-Ended Dialogue using Reinforcement Learning](https://arxiv.org/abs/2208.02294) (Cohen at al. 2022): Using RL to enhance the conversational skill of an open-ended dialogue agent.
- [Is Reinforcement Learning (Not) for Natural Language Processing?: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization](https://arxiv.org/abs/2210.01241) (Ramamurthy and Ammanabrolu et al. 2022): Discusses the design space of open-source tools in RLHF and proposes a new algorithm NLPO (Natural Language Policy Optimization) as an alternative to PPO.
## Author
This section was written by <a href="https://twitter.com/natolambert"> Nathan Lambert </a>