From 416ec655d0e9907e0d0caa6259ec2a4050002c73 Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Wed, 10 May 2023 08:41:27 +0200 Subject: [PATCH 1/7] Update (gymnasium) --- notebooks/unit6/requirements-unit6.txt | 5 +- units/en/unit6/hands-on.mdx | 441 +------------------------ units/en/unit6/introduction.mdx | 3 +- 3 files changed, 6 insertions(+), 443 deletions(-) diff --git a/notebooks/unit6/requirements-unit6.txt b/notebooks/unit6/requirements-unit6.txt index 1c8ffaa..b92d196 100644 --- a/notebooks/unit6/requirements-unit6.txt +++ b/notebooks/unit6/requirements-unit6.txt @@ -1,4 +1,3 @@ -stable-baselines3[extra] +stable-baselines3==2.0.0a4 huggingface_sb3 -panda_gym==2.0.0 -pyglet==1.5.1 +panda-gym \ No newline at end of file diff --git a/units/en/unit6/hands-on.mdx b/units/en/unit6/hands-on.mdx index 9d34e59..4938ca1 100644 --- a/units/en/unit6/hands-on.mdx +++ b/units/en/unit6/hands-on.mdx @@ -8,14 +8,10 @@ askForHelpUrl="http://hf.co/join/discord" /> -Now that you've studied the theory behind Advantage Actor Critic (A2C), **you're ready to train your A2C agent** using Stable-Baselines3 in robotic environments. And train two robots: - -- A spider ๐Ÿ•ท๏ธ to learn to move. +Now that you've studied the theory behind Advantage Actor Critic (A2C), **you're ready to train your A2C agent** using Stable-Baselines3 in a robotic environment. And train a: - A robotic arm ๐Ÿฆพ to move to the correct position. -We're going to use two Robotics environments: - -- [PyBullet](https://github.com/bulletphysics/bullet3) +We're going to use - [panda-gym](https://github.com/qgallouedec/panda-gym) Environments @@ -23,444 +19,13 @@ We're going to use two Robotics environments: To validate this hands-on for the certification process, you need to push your two trained models to the Hub and get the following results: -- `AntBulletEnv-v0` get a result of >= 650. -- `PandaReachDense-v2` get a result of >= -3.5. +- `PandaReachDense-v3` get a result of >= -3.5. To find your result, [go to the leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward** -**If you don't find your model, go to the bottom of the page and click on the refresh button.** - For more information about the certification process, check this section ๐Ÿ‘‰ https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process **To start the hands-on click on Open In Colab button** ๐Ÿ‘‡ : [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit6/unit6.ipynb) - -# Unit 6: Advantage Actor Critic (A2C) using Robotics Simulations with PyBullet and Panda-Gym ๐Ÿค– - -### ๐ŸŽฎ Environments: - -- [PyBullet](https://github.com/bulletphysics/bullet3) -- [Panda-Gym](https://github.com/qgallouedec/panda-gym) - -### ๐Ÿ“š RL-Library: - -- [Stable-Baselines3](https://stable-baselines3.readthedocs.io/) - -We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues). - -## Objectives of this notebook ๐Ÿ† - -At the end of the notebook, you will: - -- Be able to use the environment librairies **PyBullet** and **Panda-Gym**. -- Be able to **train robots using A2C**. -- Understand why **we need to normalize the input**. -- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score ๐Ÿ”ฅ. - -## Prerequisites ๐Ÿ—๏ธ -Before diving into the notebook, you need to: - -๐Ÿ”ฒ ๐Ÿ“š Study [Actor-Critic methods by reading Unit 6](https://huggingface.co/deep-rl-course/unit6/introduction) ๐Ÿค— - -# Let's train our first robots ๐Ÿค– - -## Set the GPU ๐Ÿ’ช - -- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type` - -GPU Step 1 - -- `Hardware Accelerator > GPU` - -GPU Step 2 - -## Create a virtual display ๐Ÿ”ฝ - -During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames). - -The following cell will install the librairies and create and run a virtual screen ๐Ÿ–ฅ - -```python -%%capture -!apt install python-opengl -!apt install ffmpeg -!apt install xvfb -!pip3 install pyvirtualdisplay -``` - -```python -# Virtual display -from pyvirtualdisplay import Display - -virtual_display = Display(visible=0, size=(1400, 900)) -virtual_display.start() -``` - -### Install dependencies ๐Ÿ”ฝ -The first step is to install the dependencies, weโ€™ll install multiple ones: - -- `pybullet`: Contains the walking robots environments. -- `panda-gym`: Contains the robotics arm environments. -- `stable-baselines3[extra]`: The SB3 deep reinforcement learning library. -- `huggingface_sb3`: Additional code for Stable-baselines3 to load and upload models from the Hugging Face ๐Ÿค— Hub. -- `huggingface_hub`: Library allowing anyone to work with the Hub repositories. - -```bash -!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit6/requirements-unit6.txt -``` - -## Import the packages ๐Ÿ“ฆ - -```python -import pybullet_envs -import panda_gym -import gym - -import os - -from huggingface_sb3 import load_from_hub, package_to_hub - -from stable_baselines3 import A2C -from stable_baselines3.common.evaluation import evaluate_policy -from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize -from stable_baselines3.common.env_util import make_vec_env - -from huggingface_hub import notebook_login -``` - -## Environment 1: AntBulletEnv-v0 ๐Ÿ•ธ - -### Create the AntBulletEnv-v0 -#### The environment ๐ŸŽฎ - -In this environment, the agent needs to use its different joints correctly in order to walk. -You can find a detailled explanation of this environment here: https://hackmd.io/@jeffreymo/SJJrSJh5_#PyBullet - -```python -env_id = "AntBulletEnv-v0" -# Create the env -env = gym.make(env_id) - -# Get the state space and action space -s_size = env.observation_space.shape[0] -a_size = env.action_space -``` - -```python -print("_____OBSERVATION SPACE_____ \n") -print("The State Space is: ", s_size) -print("Sample observation", env.observation_space.sample()) # Get a random observation -``` - -The observation Space (from [Jeffrey Y Mo](https://hackmd.io/@jeffreymo/SJJrSJh5_#PyBullet)): -The difference is that our observation space is 28 not 29. - -PyBullet Ant Obs space - - -```python -print("\n _____ACTION SPACE_____ \n") -print("The Action Space is: ", a_size) -print("Action Space Sample", env.action_space.sample()) # Take a random action -``` - -The action Space (from [Jeffrey Y Mo](https://hackmd.io/@jeffreymo/SJJrSJh5_#PyBullet)): - -PyBullet Ant Obs space - - -### Normalize observation and rewards - -A good practice in reinforcement learning is to [normalize input features](https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html). - -For that purpose, there is a wrapper that will compute a running average and standard deviation of input features. - -We also normalize rewards with this same wrapper by adding `norm_reward = True` - -[You should check the documentation to fill this cell](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize) - -```python -env = make_vec_env(env_id, n_envs=4) - -# Adding this wrapper to normalize the observation and the reward -env = # TODO: Add the wrapper -``` - -#### Solution - -```python -env = make_vec_env(env_id, n_envs=4) - -env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10.0) -``` - -### Create the A2C Model ๐Ÿค– - -In this case, because we have a vector of 28 values as input, we'll use an MLP (multi-layer perceptron) as policy. - -For more information about A2C implementation with StableBaselines3 check: https://stable-baselines3.readthedocs.io/en/master/modules/a2c.html#notes - -To find the best parameters I checked the [official trained agents by Stable-Baselines3 team](https://huggingface.co/sb3). - -```python -model = # Create the A2C model and try to find the best parameters -``` - -#### Solution - -```python -model = A2C( - policy="MlpPolicy", - env=env, - gae_lambda=0.9, - gamma=0.99, - learning_rate=0.00096, - max_grad_norm=0.5, - n_steps=8, - vf_coef=0.4, - ent_coef=0.0, - policy_kwargs=dict(log_std_init=-2, ortho_init=False), - normalize_advantage=False, - use_rms_prop=True, - use_sde=True, - verbose=1, -) -``` - -### Train the A2C agent ๐Ÿƒ - -- Let's train our agent for 2,000,000 timesteps. Don't forget to use GPU on Colab. It will take approximately ~25-40min - -```python -model.learn(2_000_000) -``` - -```python -# Save the model and VecNormalize statistics when saving the agent -model.save("a2c-AntBulletEnv-v0") -env.save("vec_normalize.pkl") -``` - -### Evaluate the agent ๐Ÿ“ˆ -- Now that our agent is trained, we need to **check its performance**. -- Stable-Baselines3 provides a method to do that: `evaluate_policy` -- In my case, I got a mean reward of `2371.90 +/- 16.50` - -```python -from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize - -# Load the saved statistics -eval_env = DummyVecEnv([lambda: gym.make("AntBulletEnv-v0")]) -eval_env = VecNormalize.load("vec_normalize.pkl", eval_env) - -# do not update them at test time -eval_env.training = False -# reward normalization is not needed at test time -eval_env.norm_reward = False - -# Load the agent -model = A2C.load("a2c-AntBulletEnv-v0") - -mean_reward, std_reward = evaluate_policy(model, env) - -print(f"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}") -``` - -### Publish your trained model on the Hub ๐Ÿ”ฅ -Now that we saw we got good results after the training, we can publish our trained model on the Hub with one line of code. - -๐Ÿ“š The libraries documentation ๐Ÿ‘‰ https://github.com/huggingface/huggingface_sb3/tree/main#hugging-face--x-stable-baselines3-v20 - -Here's an example of a Model Card (with a PyBullet environment): - -Model Card Pybullet - -By using `package_to_hub`, as we already mentionned in the former units, **you evaluate, record a replay, generate a model card of your agent and push it to the hub**. - -This way: -- You can **showcase our work** ๐Ÿ”ฅ -- You can **visualize your agent playing** ๐Ÿ‘€ -- You can **share an agent with the community that others can use** ๐Ÿ’พ -- You can **access a leaderboard ๐Ÿ† to see how well your agent is performing compared to your classmates** ๐Ÿ‘‰ https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard - - -To be able to share your model with the community there are three more steps to follow: - -1๏ธโƒฃ (If it's not already done) create an account to HF โžก https://huggingface.co/join - -2๏ธโƒฃ Sign in and then you need to get your authentication token from the Hugging Face website. -- Create a new token (https://huggingface.co/settings/tokens) **with write role** - -Create HF Token - -- Copy the token -- Run the cell below and paste the token - -```python -notebook_login() -!git config --global credential.helper store -``` - -If you don't want to use Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login` - -3๏ธโƒฃ We're now ready to push our trained agent to the ๐Ÿค— Hub ๐Ÿ”ฅ using `package_to_hub()` function - -```python -package_to_hub( - model=model, - model_name=f"a2c-{env_id}", - model_architecture="A2C", - env_id=env_id, - eval_env=eval_env, - repo_id=f"ThomasSimonini/a2c-{env_id}", # Change the username - commit_message="Initial commit", -) -``` - -## Take a coffee break โ˜• -- You already trained your first robot that learned to move congratutlations ๐Ÿฅณ! -- It's **time to take a break**. Don't hesitate to **save this notebook** `File > Save a copy to Drive` to work on this second part later. - - -## Environment 2: PandaReachDense-v2 ๐Ÿฆพ - -The agent we're going to train is a robotic arm that needs to do controls (moving the arm and using the end-effector). - -In robotics, the *end-effector* is the device at the end of a robotic arm designed to interact with the environment. - -In `PandaReach`, the robot must place its end-effector at a target position (green ball). - -We're going to use the dense version of this environment. This means we'll get a *dense reward function* that **will provide a reward at each timestep** (the closer the agent is to completing the task, the higher the reward). This is in contrast to a *sparse reward function* where the environment **return a reward if and only if the task is completed**. - -Also, we're going to use the *End-effector displacement control*, which means the **action corresponds to the displacement of the end-effector**. We don't control the individual motion of each joint (joint control). - -Robotics - - -This way **the training will be easier**. - - - -In `PandaReachDense-v2`, the robotic arm must place its end-effector at a target position (green ball). - - - -```python -import gym - -env_id = "PandaReachDense-v2" - -# Create the env -env = gym.make(env_id) - -# Get the state space and action space -s_size = env.observation_space.shape -a_size = env.action_space -``` - -```python -print("_____OBSERVATION SPACE_____ \n") -print("The State Space is: ", s_size) -print("Sample observation", env.observation_space.sample()) # Get a random observation -``` - -The observation space **is a dictionary with 3 different elements**: -- `achieved_goal`: (x,y,z) position of the goal. -- `desired_goal`: (x,y,z) distance between the goal position and the current object position. -- `observation`: position (x,y,z) and velocity of the end-effector (vx, vy, vz). - -Given it's a dictionary as observation, **we will need to use a MultiInputPolicy policy instead of MlpPolicy**. - -```python -print("\n _____ACTION SPACE_____ \n") -print("The Action Space is: ", a_size) -print("Action Space Sample", env.action_space.sample()) # Take a random action -``` - -The action space is a vector with 3 values: -- Control x, y, z movement - -Now it's your turn: - -1. Define the environment called "PandaReachDense-v2". -2. Make a vectorized environment. -3. Add a wrapper to normalize the observations and rewards. [Check the documentation](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize) -4. Create the A2C Model (don't forget verbose=1 to print the training logs). -5. Train it for 1M Timesteps. -6. Save the model and VecNormalize statistics when saving the agent. -7. Evaluate your agent. -8. Publish your trained model on the Hub ๐Ÿ”ฅ with `package_to_hub`. - -### Solution (fill the todo) - -```python -# 1 - 2 -env_id = "PandaReachDense-v2" -env = make_vec_env(env_id, n_envs=4) - -# 3 -env = VecNormalize(env, norm_obs=True, norm_reward=False, clip_obs=10.0) - -# 4 -model = A2C(policy="MultiInputPolicy", env=env, verbose=1) -# 5 -model.learn(1_000_000) -``` - -```python -# 6 -model_name = "a2c-PandaReachDense-v2" -model.save(model_name) -env.save("vec_normalize.pkl") - -# 7 -from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize - -# Load the saved statistics -eval_env = DummyVecEnv([lambda: gym.make("PandaReachDense-v2")]) -eval_env = VecNormalize.load("vec_normalize.pkl", eval_env) - -# do not update them at test time -eval_env.training = False -# reward normalization is not needed at test time -eval_env.norm_reward = False - -# Load the agent -model = A2C.load(model_name) - -mean_reward, std_reward = evaluate_policy(model, env) - -print(f"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}") - -# 8 -package_to_hub( - model=model, - model_name=f"a2c-{env_id}", - model_architecture="A2C", - env_id=env_id, - eval_env=eval_env, - repo_id=f"ThomasSimonini/a2c-{env_id}", # TODO: Change the username - commit_message="Initial commit", -) -``` - -## Some additional challenges ๐Ÿ† - -The best way to learn **is to try things on your own**! Why not try `HalfCheetahBulletEnv-v0` for PyBullet and `PandaPickAndPlace-v1` for Panda-Gym? - -If you want to try more advanced tasks for panda-gym, you need to check what was done using **TQC or SAC** (a more sample-efficient algorithm suited for robotics tasks). In real robotics, you'll use a more sample-efficient algorithm for a simple reason: contrary to a simulation **if you move your robotic arm too much, you have a risk of breaking it**. - -PandaPickAndPlace-v1: https://huggingface.co/sb3/tqc-PandaPickAndPlace-v1 - -And don't hesitate to check panda-gym documentation here: https://panda-gym.readthedocs.io/en/latest/usage/train_with_sb3.html - -Here are some ideas to go further: -* Train more steps -* Try different hyperparameters by looking at what your classmates have done ๐Ÿ‘‰ https://huggingface.co/models?other=https://huggingface.co/models?other=AntBulletEnv-v0 -* **Push your new trained model** on the Hub ๐Ÿ”ฅ - - -See you on Unit 7! ๐Ÿ”ฅ -## Keep learning, stay awesome ๐Ÿค— diff --git a/units/en/unit6/introduction.mdx b/units/en/unit6/introduction.mdx index 4be735f..862c8c4 100644 --- a/units/en/unit6/introduction.mdx +++ b/units/en/unit6/introduction.mdx @@ -16,8 +16,7 @@ So today we'll study **Actor-Critic methods**, a hybrid architecture combining v - *A Critic* that measures **how good the taken action is** (Value-Based method) -We'll study one of these hybrid methods, Advantage Actor Critic (A2C), **and train our agent using Stable-Baselines3 in robotic environments**. We'll train two robots: -- A spider ๐Ÿ•ท๏ธ to learn to move. +We'll study one of these hybrid methods, Advantage Actor Critic (A2C), **and train our agent using Stable-Baselines3 in robotic environments**. We'll train: - A robotic arm ๐Ÿฆพ to move to the correct position. Environments From f6b96f3e46631f88022973dff10573e108b6ccf9 Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Wed, 10 May 2023 08:49:02 +0200 Subject: [PATCH 2/7] Add huggingface_hub --- notebooks/unit6/requirements-unit6.txt | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/notebooks/unit6/requirements-unit6.txt b/notebooks/unit6/requirements-unit6.txt index b92d196..cf5db2d 100644 --- a/notebooks/unit6/requirements-unit6.txt +++ b/notebooks/unit6/requirements-unit6.txt @@ -1,3 +1,4 @@ stable-baselines3==2.0.0a4 huggingface_sb3 -panda-gym \ No newline at end of file +panda-gym +huggingface_hub \ No newline at end of file From ca42ab49f8a45b88c10070fe021577e5467e03c1 Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Sun, 6 Aug 2023 18:10:42 +0200 Subject: [PATCH 3/7] Update introduction.mdx * Remove gif --- units/en/unit6/introduction.mdx | 2 -- 1 file changed, 2 deletions(-) diff --git a/units/en/unit6/introduction.mdx b/units/en/unit6/introduction.mdx index 862c8c4..9d4c4ad 100644 --- a/units/en/unit6/introduction.mdx +++ b/units/en/unit6/introduction.mdx @@ -19,6 +19,4 @@ So today we'll study **Actor-Critic methods**, a hybrid architecture combining v We'll study one of these hybrid methods, Advantage Actor Critic (A2C), **and train our agent using Stable-Baselines3 in robotic environments**. We'll train: - A robotic arm ๐Ÿฆพ to move to the correct position. -Environments - Sound exciting? Let's get started! From e2ab2ee38f1c8b20ab60800d00996bb5607842b3 Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Sun, 6 Aug 2023 18:11:54 +0200 Subject: [PATCH 4/7] Update hands-on.mdx --- units/en/unit6/hands-on.mdx | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/units/en/unit6/hands-on.mdx b/units/en/unit6/hands-on.mdx index 4938ca1..e90b7ed 100644 --- a/units/en/unit6/hands-on.mdx +++ b/units/en/unit6/hands-on.mdx @@ -1,4 +1,4 @@ -# Advantage Actor Critic (A2C) using Robotics Simulations with PyBullet and Panda-Gym ๐Ÿค– [[hands-on]] +# Advantage Actor Critic (A2C) using Robotics Simulations with Panda-Gym ๐Ÿค– [[hands-on]] - - To validate this hands-on for the certification process, you need to push your two trained models to the Hub and get the following results: - `PandaReachDense-v3` get a result of >= -3.5. From d4920a4a080798806430803f780b80a9020a8c43 Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Sun, 6 Aug 2023 18:12:35 +0200 Subject: [PATCH 5/7] Delete unit4.ipynb --- unit4.ipynb | 1623 --------------------------------------------------- 1 file changed, 1623 deletions(-) delete mode 100644 unit4.ipynb diff --git a/unit4.ipynb b/unit4.ipynb deleted file mode 100644 index 0fb17e6..0000000 --- a/unit4.ipynb +++ /dev/null @@ -1,1623 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "view-in-github", - "colab_type": "text" - }, - "source": [ - "\"Open" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "CjRWziAVU2lZ" - }, - "source": [ - "# Unit 4: Code your first Deep Reinforcement Learning Algorithm with PyTorch: Reinforce. And test its robustness ๐Ÿ’ช\n", - "\n", - "\"thumbnail\"/\n", - "\n", - "\n", - "In this notebook, you'll code your first Deep Reinforcement Learning algorithm from scratch: Reinforce (also called Monte Carlo Policy Gradient).\n", - "\n", - "Reinforce is a *Policy-based method*: a Deep Reinforcement Learning algorithm that tries **to optimize the policy directly without using an action-value function**.\n", - "\n", - "More precisely, Reinforce is a *Policy-gradient method*, a subclass of *Policy-based methods* that aims **to optimize the policy directly by estimating the weights of the optimal policy using gradient ascent**.\n", - "\n", - "To test its robustness, we're going to train it in 2 different simple environments:\n", - "- Cartpole-v1\n", - "- PixelcopterEnv\n", - "\n", - "โฌ‡๏ธ Here is an example of what **you will achieve at the end of this notebook.** โฌ‡๏ธ" - ] - }, - { - "cell_type": "markdown", - "source": [ - " \"Environments\"/\n" - ], - "metadata": { - "id": "s4rBom2sbo7S" - } - }, - { - "cell_type": "markdown", - "source": [ - "### ๐ŸŽฎ Environments: \n", - "\n", - "- [CartPole-v1](https://www.gymlibrary.dev/environments/classic_control/cart_pole/)\n", - "- [PixelCopter](https://pygame-learning-environment.readthedocs.io/en/latest/user/games/pixelcopter.html)\n", - "\n", - "### ๐Ÿ“š RL-Library: \n", - "\n", - "- Python\n", - "- PyTorch\n", - "\n", - "\n", - "We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues)." - ], - "metadata": { - "id": "BPLwsPajb1f8" - } - }, - { - "cell_type": "markdown", - "metadata": { - "id": "L_WSo0VUV99t" - }, - "source": [ - "## Objectives of this notebook ๐Ÿ†\n", - "At the end of the notebook, you will:\n", - "- Be able to **code from scratch a Reinforce algorithm using PyTorch.**\n", - "- Be able to **test the robustness of your agent using simple environments.**\n", - "- Be able to **push your trained agent to the Hub** with a nice video replay and an evaluation score ๐Ÿ”ฅ." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "lEPrZg2eWa4R" - }, - "source": [ - "## This notebook is from the Deep Reinforcement Learning Course\n", - "\"Deep" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "6p5HnEefISCB" - }, - "source": [ - "In this free course, you will:\n", - "\n", - "- ๐Ÿ“– Study Deep Reinforcement Learning in **theory and practice**.\n", - "- ๐Ÿง‘โ€๐Ÿ’ป Learn to **use famous Deep RL libraries** such as Stable Baselines3, RL Baselines3 Zoo, CleanRL and Sample Factory 2.0.\n", - "- ๐Ÿค– Train **agents in unique environments** \n", - "\n", - "And more check ๐Ÿ“š the syllabus ๐Ÿ‘‰ https://simoninithomas.github.io/deep-rl-course\n", - "\n", - "Donโ€™t forget to **sign up to the course** (we are collecting your email to be able toย **send you the links when each Unit is published and give you information about the challenges and updates).**\n", - "\n", - "\n", - "The best way to keep in touch is to join our discord server to exchange with the community and with us ๐Ÿ‘‰๐Ÿป https://discord.gg/ydHrjt3WP5" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "mjY-eq3eWh9O" - }, - "source": [ - "## Prerequisites ๐Ÿ—๏ธ\n", - "Before diving into the notebook, you need to:\n", - "\n", - "๐Ÿ”ฒ ๐Ÿ“š [Study Policy Gradients by reading Unit 4](https://huggingface.co/deep-rl-course/unit4/introduction)" - ] - }, - { - "cell_type": "markdown", - "source": [ - "# Let's code Reinforce algorithm from scratch ๐Ÿ”ฅ\n", - "\n", - "\n", - "To validate this hands-on for the certification process, you need to push your trained models to the Hub.\n", - "\n", - "- Get a result of >= 350 for `Cartpole-v1`.\n", - "- Get a result of >= 5 for `PixelCopter`.\n", - "\n", - "To find your result, go to the leaderboard and find your model, **the result = mean_reward - std of reward**. **If you don't see your model on the leaderboard, go at the bottom of the leaderboard page and click on the refresh button**.\n", - "\n", - "For more information about the certification process, check this section ๐Ÿ‘‰ https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process\n" - ], - "metadata": { - "id": "Bsh4ZAamchSl" - } - }, - { - "cell_type": "markdown", - "source": [ - "## An advice ๐Ÿ’ก\n", - "It's better to run this colab in a copy on your Google Drive, so that **if it timeouts** you still have the saved notebook on your Google Drive and do not need to fill everything from scratch.\n", - "\n", - "To do that you can either do `Ctrl + S` or `File > Save a copy in Google Drive.`" - ], - "metadata": { - "id": "JoTC9o2SczNn" - } - }, - { - "cell_type": "markdown", - "source": [ - "## Set the GPU ๐Ÿ’ช\n", - "- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`\n", - "\n", - "\"GPU" - ], - "metadata": { - "id": "PU4FVzaoM6fC" - } - }, - { - "cell_type": "markdown", - "source": [ - "- `Hardware Accelerator > GPU`\n", - "\n", - "\"GPU" - ], - "metadata": { - "id": "KV0NyFdQM9ZG" - } - }, - { - "cell_type": "markdown", - "source": [ - "## Create a virtual display ๐Ÿ–ฅ\n", - "\n", - "During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames). \n", - "\n", - "Hence the following cell will install the librairies and create and run a virtual screen ๐Ÿ–ฅ" - ], - "metadata": { - "id": "bTpYcVZVMzUI" - } - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "jV6wjQ7Be7p5" - }, - "outputs": [], - "source": [ - "%%capture\n", - "!apt install python-opengl\n", - "!apt install ffmpeg\n", - "!apt install xvfb\n", - "!pip install pyvirtualdisplay\n", - "!pip install pyglet==1.5.1" - ] - }, - { - "cell_type": "code", - "source": [ - "# Virtual display\n", - "from pyvirtualdisplay import Display\n", - "\n", - "virtual_display = Display(visible=0, size=(1400, 900))\n", - "virtual_display.start()" - ], - "metadata": { - "id": "Sr-Nuyb1dBm0" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "tjrLfPFIW8XK" - }, - "source": [ - "## Install the dependencies ๐Ÿ”ฝ\n", - "The first step is to install the dependencies. Weโ€™ll install multiple ones:\n", - "\n", - "- `gym`\n", - "- `gym-games`: Extra gym environments made with PyGame.\n", - "- `huggingface_hub`: ๐Ÿค— works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations, and other features that will allow you to easily collaborate with others.\n", - "\n", - "You may be wondering why we install gym and not gymnasium, a more recent version of gym? **Because the gym-games we are using are not updated yet with gymnasium**. \n", - "\n", - "The differences you'll encounter here:\n", - "- In `gym` we don't have `terminated` and `truncated` but only `done`.\n", - "- In `gym` using `env.step()` returns `state, reward, done, info`\n", - "\n", - "You can learn more about the differences between Gym and Gymnasium here ๐Ÿ‘‰ https://gymnasium.farama.org/content/migration-guide/\n", - "\n", - "\n", - "You can see here all the Reinforce models available ๐Ÿ‘‰ https://huggingface.co/models?other=reinforce\n", - "\n", - "And you can find all the Deep Reinforcement Learning models here ๐Ÿ‘‰ https://huggingface.co/models?pipeline_tag=reinforcement-learning\n" - ] - }, - { - "cell_type": "code", - "source": [ - "!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit4/requirements-unit4.txt" - ], - "metadata": { - "id": "e8ZVi-uydpgL" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "AAHAq6RZW3rn" - }, - "source": [ - "## Import the packages ๐Ÿ“ฆ\n", - "In addition to import the installed libraries, we also import:\n", - "\n", - "- `imageio`: A library that will help us to generate a replay video\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "V8oadoJSWp7C" - }, - "outputs": [], - "source": [ - "import numpy as np\n", - "\n", - "from collections import deque\n", - "\n", - "import matplotlib.pyplot as plt\n", - "%matplotlib inline\n", - "\n", - "# PyTorch\n", - "import torch\n", - "import torch.nn as nn\n", - "import torch.nn.functional as F\n", - "import torch.optim as optim\n", - "from torch.distributions import Categorical\n", - "\n", - "# Gym\n", - "import gym\n", - "import gym_pygame\n", - "\n", - "# Hugging Face Hub\n", - "from huggingface_hub import notebook_login # To log to our Hugging Face account to be able to upload models to the Hub.\n", - "import imageio" - ] - }, - { - "cell_type": "markdown", - "source": [ - "## Check if we have a GPU\n", - "\n", - "- Let's check if we have a GPU\n", - "- If it's the case you should see `device:cuda0`" - ], - "metadata": { - "id": "RfxJYdMeeVgv" - } - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "kaJu5FeZxXGY" - }, - "outputs": [], - "source": [ - "device = torch.device(\"cuda:0\" if torch.cuda.is_available() else \"cpu\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "U5TNYa14aRav" - }, - "outputs": [], - "source": [ - "print(device)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "PBPecCtBL_pZ" - }, - "source": [ - "We're now ready to implement our Reinforce algorithm ๐Ÿ”ฅ" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "8KEyKYo2ZSC-" - }, - "source": [ - "# First agent: Playing CartPole-v1 ๐Ÿค–" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "haLArKURMyuF" - }, - "source": [ - "## Create the CartPole environment and understand how it works\n", - "### [The environment ๐ŸŽฎ](https://www.gymlibrary.dev/environments/classic_control/cart_pole/)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "AH_TaLKFXo_8" - }, - "source": [ - "### Why do we use a simple environment like CartPole-v1?\n", - "As explained in [Reinforcement Learning Tips and Tricks](https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html), when you implement your agent from scratch you need **to be sure that it works correctly and find bugs with easy environments before going deeper**. Since finding bugs will be much easier in simple environments.\n", - "\n", - "\n", - "> Try to have some โ€œsign of lifeโ€ on toy problems\n", - "\n", - "\n", - "> Validate the implementation by making it run on harder and harder envs (you can compare results against the RL zoo). You usually need to run hyperparameter optimization for that step.\n", - "___\n", - "### The CartPole-v1 environment\n", - "\n", - "> A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pendulum is placed upright on the cart and the goal is to balance the pole by applying forces in the left and right direction on the cart.\n", - "\n", - "\n", - "\n", - "So, we start with CartPole-v1. The goal is to push the cart left or right **so that the pole stays in the equilibrium.**\n", - "\n", - "The episode ends if:\n", - "- The pole Angle is greater than ยฑ12ยฐ\n", - "- Cart Position is greater than ยฑ2.4\n", - "- Episode length is greater than 500\n", - "\n", - "We get a reward ๐Ÿ’ฐ of +1 every timestep the Pole stays in the equilibrium." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "POOOk15_K6KA" - }, - "outputs": [], - "source": [ - "env_id = \"CartPole-v1\"\n", - "# Create the env\n", - "env = gym.make(env_id)\n", - "\n", - "# Create the evaluation env\n", - "eval_env = gym.make(env_id)\n", - "\n", - "# Get the state space and action space\n", - "s_size = env.observation_space.shape[0]\n", - "a_size = env.action_space.n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "FMLFrjiBNLYJ" - }, - "outputs": [], - "source": [ - "print(\"_____OBSERVATION SPACE_____ \\n\")\n", - "print(\"The State Space is: \", s_size)\n", - "print(\"Sample observation\", env.observation_space.sample()) # Get a random observation" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Lu6t4sRNNWkN" - }, - "outputs": [], - "source": [ - "print(\"\\n _____ACTION SPACE_____ \\n\")\n", - "print(\"The Action Space is: \", a_size)\n", - "print(\"Action Space Sample\", env.action_space.sample()) # Take a random action" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "7SJMJj3WaFOz" - }, - "source": [ - "## Let's build the Reinforce Architecture\n", - "This implementation is based on two implementations:\n", - "- [PyTorch official Reinforcement Learning example](https://github.com/pytorch/examples/blob/main/reinforcement_learning/reinforce.py)\n", - "- [Udacity Reinforce](https://github.com/udacity/deep-reinforcement-learning/blob/master/reinforce/REINFORCE.ipynb)\n", - "- [Improvement of the integration by Chris1nexus](https://github.com/huggingface/deep-rl-class/pull/95)\n", - "\n", - "\"Reinforce\"/" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "49kogtxBODX8" - }, - "source": [ - "So we want:\n", - "- Two fully connected layers (fc1 and fc2).\n", - "- Using ReLU as activation function of fc1\n", - "- Using Softmax to output a probability distribution over actions" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "w2LHcHhVZvPZ" - }, - "outputs": [], - "source": [ - "class Policy(nn.Module):\n", - " def __init__(self, s_size, a_size, h_size):\n", - " super(Policy, self).__init__()\n", - " # Create two fully connected layers\n", - "\n", - "\n", - "\n", - " def forward(self, x):\n", - " # Define the forward pass\n", - " # state goes to fc1 then we apply ReLU activation function\n", - "\n", - " # fc1 outputs goes to fc2\n", - "\n", - " # We output the softmax\n", - " \n", - " def act(self, state):\n", - " \"\"\"\n", - " Given a state, take action\n", - " \"\"\"\n", - " state = torch.from_numpy(state).float().unsqueeze(0).to(device)\n", - " probs = self.forward(state).cpu()\n", - " m = Categorical(probs)\n", - " action = np.argmax(m)\n", - " return action.item(), m.log_prob(action)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "rOMrdwSYOWSC" - }, - "source": [ - "### Solution" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "jGdhRSVrOV4K" - }, - "outputs": [], - "source": [ - "class Policy(nn.Module):\n", - " def __init__(self, s_size, a_size, h_size):\n", - " super(Policy, self).__init__()\n", - " self.fc1 = nn.Linear(s_size, h_size)\n", - " self.fc2 = nn.Linear(h_size, a_size)\n", - "\n", - " def forward(self, x):\n", - " x = F.relu(self.fc1(x))\n", - " x = self.fc2(x)\n", - " return F.softmax(x, dim=1)\n", - " \n", - " def act(self, state):\n", - " state = torch.from_numpy(state).float().unsqueeze(0).to(device)\n", - " probs = self.forward(state).cpu()\n", - " m = Categorical(probs)\n", - " action = np.argmax(m)\n", - " return action.item(), m.log_prob(action)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ZTGWL4g2eM5B" - }, - "source": [ - "I make a mistake, can you guess where?\n", - "\n", - "- To find out let's make a forward pass:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "lwnqGBCNePor" - }, - "outputs": [], - "source": [ - "debug_policy = Policy(s_size, a_size, 64).to(device)\n", - "debug_policy.act(env.reset())" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "14UYkoxCPaor" - }, - "source": [ - "- Here we see that the error says `ValueError: The value argument to log_prob must be a Tensor`\n", - "\n", - "- It means that `action` in `m.log_prob(action)` must be a Tensor **but it's not.**\n", - "\n", - "- Do you know why? Check the act function and try to see why it does not work. \n", - "\n", - "Advice ๐Ÿ’ก: Something is wrong in this implementation. Remember that we act function **we want to sample an action from the probability distribution over actions**.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "gfGJNZBUP7Vn" - }, - "source": [ - "### (Real) Solution" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Ho_UHf49N9i4" - }, - "outputs": [], - "source": [ - "class Policy(nn.Module):\n", - " def __init__(self, s_size, a_size, h_size):\n", - " super(Policy, self).__init__()\n", - " self.fc1 = nn.Linear(s_size, h_size)\n", - " self.fc2 = nn.Linear(h_size, a_size)\n", - "\n", - " def forward(self, x):\n", - " x = F.relu(self.fc1(x))\n", - " x = self.fc2(x)\n", - " return F.softmax(x, dim=1)\n", - " \n", - " def act(self, state):\n", - " state = torch.from_numpy(state).float().unsqueeze(0).to(device)\n", - " probs = self.forward(state).cpu()\n", - " m = Categorical(probs)\n", - " action = m.sample()\n", - " return action.item(), m.log_prob(action)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "rgJWQFU_eUYw" - }, - "source": [ - "By using CartPole, it was easier to debug since **we know that the bug comes from our integration and not from our simple environment**." - ] - }, - { - "cell_type": "markdown", - "source": [ - "- Since **we want to sample an action from the probability distribution over actions**, we can't use `action = np.argmax(m)` since it will always output the action that have the highest probability.\n", - "\n", - "- We need to replace with `action = m.sample()` that will sample an action from the probability distribution P(.|s)" - ], - "metadata": { - "id": "c-20i7Pk0l1T" - } - }, - { - "cell_type": "markdown", - "metadata": { - "id": "4MXoqetzfIoW" - }, - "source": [ - "### Let's build the Reinforce Training Algorithm\n", - "This is the Reinforce algorithm pseudocode:\n", - "\n", - "\"Policy\n", - " " - ] - }, - { - "cell_type": "markdown", - "source": [ - "- When we calculate the return Gt (line 6) we see that we calculate the sum of discounted rewards **starting at timestep t**.\n", - "\n", - "- Why? Because our policy should only **reinforce actions on the basis of the consequences**: so rewards obtained before taking an action are useless (since they were not because of the action), **only the ones that come after the action matters**.\n", - "\n", - "- Before coding this you should read this section [don't let the past distract you](https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html#don-t-let-the-past-distract-you) that explains why we use reward-to-go policy gradient.\n", - "\n", - "We use an interesting technique coded by [Chris1nexus](https://github.com/Chris1nexus) to **compute the return at each timestep efficiently**. The comments explained the procedure. Don't hesitate also [to check the PR explanation](https://github.com/huggingface/deep-rl-class/pull/95)\n", - "But overall the idea is to **compute the return at each timestep efficiently**." - ], - "metadata": { - "id": "QmcXG-9i2Qu2" - } - }, - { - "cell_type": "markdown", - "metadata": { - "id": "O554nUGPpcoq" - }, - "source": [ - "The second question you may ask is **why do we minimize the loss**? You talked about Gradient Ascent not Gradient Descent?\n", - "\n", - "- We want to maximize our utility function $J(\\theta)$ but in PyTorch like in Tensorflow it's better to **minimize an objective function.**\n", - " - So let's say we want to reinforce action 3 at a certain timestep. Before training this action P is 0.25.\n", - " - So we want to modify $\\theta$ such that $\\pi_\\theta(a_3|s; \\theta) > 0.25$\n", - " - Because all P must sum to 1, max $\\pi_\\theta(a_3|s; \\theta)$ will **minimize other action probability.**\n", - " - So we should tell PyTorch **to min $1 - \\pi_\\theta(a_3|s; \\theta)$.**\n", - " - This loss function approaches 0 as $\\pi_\\theta(a_3|s; \\theta)$ nears 1.\n", - " - So we are encouraging the gradient to max $\\pi_\\theta(a_3|s; \\theta)$\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "iOdv8Q9NfLK7" - }, - "outputs": [], - "source": [ - "def reinforce(policy, optimizer, n_training_episodes, max_t, gamma, print_every):\n", - " # Help us to calculate the score during the training\n", - " scores_deque = deque(maxlen=100)\n", - " scores = []\n", - " # Line 3 of pseudocode\n", - " for i_episode in range(1, n_training_episodes+1):\n", - " saved_log_probs = []\n", - " rewards = []\n", - " state = # TODO: reset the environment\n", - " # Line 4 of pseudocode\n", - " for t in range(max_t):\n", - " action, log_prob = # TODO get the action\n", - " saved_log_probs.append(log_prob)\n", - " state, reward, done, _ = # TODO: take an env step\n", - " rewards.append(reward)\n", - " if done:\n", - " break \n", - " scores_deque.append(sum(rewards))\n", - " scores.append(sum(rewards))\n", - " \n", - " # Line 6 of pseudocode: calculate the return\n", - " returns = deque(maxlen=max_t) \n", - " n_steps = len(rewards) \n", - " # Compute the discounted returns at each timestep,\n", - " # as the sum of the gamma-discounted return at time t (G_t) + the reward at time t\n", - " \n", - " # In O(N) time, where N is the number of time steps\n", - " # (this definition of the discounted return G_t follows the definition of this quantity \n", - " # shown at page 44 of Sutton&Barto 2017 2nd draft)\n", - " # G_t = r_(t+1) + r_(t+2) + ...\n", - " \n", - " # Given this formulation, the returns at each timestep t can be computed \n", - " # by re-using the computed future returns G_(t+1) to compute the current return G_t\n", - " # G_t = r_(t+1) + gamma*G_(t+1)\n", - " # G_(t-1) = r_t + gamma* G_t\n", - " # (this follows a dynamic programming approach, with which we memorize solutions in order \n", - " # to avoid computing them multiple times)\n", - " \n", - " # This is correct since the above is equivalent to (see also page 46 of Sutton&Barto 2017 2nd draft)\n", - " # G_(t-1) = r_t + gamma*r_(t+1) + gamma*gamma*r_(t+2) + ...\n", - " \n", - " \n", - " ## Given the above, we calculate the returns at timestep t as: \n", - " # gamma[t] * return[t] + reward[t]\n", - " #\n", - " ## We compute this starting from the last timestep to the first, in order\n", - " ## to employ the formula presented above and avoid redundant computations that would be needed \n", - " ## if we were to do it from first to last.\n", - " \n", - " ## Hence, the queue \"returns\" will hold the returns in chronological order, from t=0 to t=n_steps\n", - " ## thanks to the appendleft() function which allows to append to the position 0 in constant time O(1)\n", - " ## a normal python list would instead require O(N) to do this.\n", - " for t in range(n_steps)[::-1]:\n", - " disc_return_t = (returns[0] if len(returns)>0 else 0)\n", - " returns.appendleft( ) # TODO: complete here \n", - " \n", - " ## standardization of the returns is employed to make training more stable\n", - " eps = np.finfo(np.float32).eps.item()\n", - " \n", - " ## eps is the smallest representable float, which is \n", - " # added to the standard deviation of the returns to avoid numerical instabilities\n", - " returns = torch.tensor(returns)\n", - " returns = (returns - returns.mean()) / (returns.std() + eps)\n", - " \n", - " # Line 7:\n", - " policy_loss = []\n", - " for log_prob, disc_return in zip(saved_log_probs, returns):\n", - " policy_loss.append(-log_prob * disc_return)\n", - " policy_loss = torch.cat(policy_loss).sum()\n", - " \n", - " # Line 8: PyTorch prefers gradient descent \n", - " optimizer.zero_grad()\n", - " policy_loss.backward()\n", - " optimizer.step()\n", - " \n", - " if i_episode % print_every == 0:\n", - " print('Episode {}\\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_deque)))\n", - " \n", - " return scores" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "YB0Cxrw1StrP" - }, - "source": [ - "#### Solution" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "NCNvyElRStWG" - }, - "outputs": [], - "source": [ - "def reinforce(policy, optimizer, n_training_episodes, max_t, gamma, print_every):\n", - " # Help us to calculate the score during the training\n", - " scores_deque = deque(maxlen=100)\n", - " scores = []\n", - " # Line 3 of pseudocode\n", - " for i_episode in range(1, n_training_episodes+1):\n", - " saved_log_probs = []\n", - " rewards = []\n", - " state = env.reset()\n", - " # Line 4 of pseudocode\n", - " for t in range(max_t):\n", - " action, log_prob = policy.act(state)\n", - " saved_log_probs.append(log_prob)\n", - " state, reward, done, _ = env.step(action)\n", - " rewards.append(reward)\n", - " if done:\n", - " break \n", - " scores_deque.append(sum(rewards))\n", - " scores.append(sum(rewards))\n", - " \n", - " # Line 6 of pseudocode: calculate the return\n", - " returns = deque(maxlen=max_t) \n", - " n_steps = len(rewards) \n", - " # Compute the discounted returns at each timestep,\n", - " # as \n", - " # the sum of the gamma-discounted return at time t (G_t) + the reward at time t\n", - " #\n", - " # In O(N) time, where N is the number of time steps\n", - " # (this definition of the discounted return G_t follows the definition of this quantity \n", - " # shown at page 44 of Sutton&Barto 2017 2nd draft)\n", - " # G_t = r_(t+1) + r_(t+2) + ...\n", - " \n", - " # Given this formulation, the returns at each timestep t can be computed \n", - " # by re-using the computed future returns G_(t+1) to compute the current return G_t\n", - " # G_t = r_(t+1) + gamma*G_(t+1)\n", - " # G_(t-1) = r_t + gamma* G_t\n", - " # (this follows a dynamic programming approach, with which we memorize solutions in order \n", - " # to avoid computing them multiple times)\n", - " \n", - " # This is correct since the above is equivalent to (see also page 46 of Sutton&Barto 2017 2nd draft)\n", - " # G_(t-1) = r_t + gamma*r_(t+1) + gamma*gamma*r_(t+2) + ...\n", - " \n", - " \n", - " ## Given the above, we calculate the returns at timestep t as: \n", - " # gamma[t] * return[t] + reward[t]\n", - " #\n", - " ## We compute this starting from the last timestep to the first, in order\n", - " ## to employ the formula presented above and avoid redundant computations that would be needed \n", - " ## if we were to do it from first to last.\n", - " \n", - " ## Hence, the queue \"returns\" will hold the returns in chronological order, from t=0 to t=n_steps\n", - " ## thanks to the appendleft() function which allows to append to the position 0 in constant time O(1)\n", - " ## a normal python list would instead require O(N) to do this.\n", - " for t in range(n_steps)[::-1]:\n", - " disc_return_t = (returns[0] if len(returns)>0 else 0)\n", - " returns.appendleft( gamma*disc_return_t + rewards[t] ) \n", - " \n", - " ## standardization of the returns is employed to make training more stable\n", - " eps = np.finfo(np.float32).eps.item()\n", - " ## eps is the smallest representable float, which is \n", - " # added to the standard deviation of the returns to avoid numerical instabilities \n", - " returns = torch.tensor(returns)\n", - " returns = (returns - returns.mean()) / (returns.std() + eps)\n", - " \n", - " # Line 7:\n", - " policy_loss = []\n", - " for log_prob, disc_return in zip(saved_log_probs, returns):\n", - " policy_loss.append(-log_prob * disc_return)\n", - " policy_loss = torch.cat(policy_loss).sum()\n", - " \n", - " # Line 8: PyTorch prefers gradient descent \n", - " optimizer.zero_grad()\n", - " policy_loss.backward()\n", - " optimizer.step()\n", - " \n", - " if i_episode % print_every == 0:\n", - " print('Episode {}\\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_deque)))\n", - " \n", - " return scores" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "RIWhQyJjfpEt" - }, - "source": [ - "## Train it\n", - "- We're now ready to train our agent.\n", - "- But first, we define a variable containing all the training hyperparameters.\n", - "- You can change the training parameters (and should ๐Ÿ˜‰)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "utRe1NgtVBYF" - }, - "outputs": [], - "source": [ - "cartpole_hyperparameters = {\n", - " \"h_size\": 16,\n", - " \"n_training_episodes\": 1000,\n", - " \"n_evaluation_episodes\": 10,\n", - " \"max_t\": 1000,\n", - " \"gamma\": 1.0,\n", - " \"lr\": 1e-2,\n", - " \"env_id\": env_id,\n", - " \"state_space\": s_size,\n", - " \"action_space\": a_size,\n", - "}" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "D3lWyVXBVfl6" - }, - "outputs": [], - "source": [ - "# Create policy and place it to the device\n", - "cartpole_policy = Policy(cartpole_hyperparameters[\"state_space\"], cartpole_hyperparameters[\"action_space\"], cartpole_hyperparameters[\"h_size\"]).to(device)\n", - "cartpole_optimizer = optim.Adam(cartpole_policy.parameters(), lr=cartpole_hyperparameters[\"lr\"])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "uGf-hQCnfouB" - }, - "outputs": [], - "source": [ - "scores = reinforce(cartpole_policy,\n", - " cartpole_optimizer,\n", - " cartpole_hyperparameters[\"n_training_episodes\"], \n", - " cartpole_hyperparameters[\"max_t\"],\n", - " cartpole_hyperparameters[\"gamma\"], \n", - " 100)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Qajj2kXqhB3g" - }, - "source": [ - "## Define evaluation method ๐Ÿ“\n", - "- Here we define the evaluation method that we're going to use to test our Reinforce agent." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "3FamHmxyhBEU" - }, - "outputs": [], - "source": [ - "def evaluate_agent(env, max_steps, n_eval_episodes, policy):\n", - " \"\"\"\n", - " Evaluate the agent for ``n_eval_episodes`` episodes and returns average reward and std of reward.\n", - " :param env: The evaluation environment\n", - " :param n_eval_episodes: Number of episode to evaluate the agent\n", - " :param policy: The Reinforce agent\n", - " \"\"\"\n", - " episode_rewards = []\n", - " for episode in range(n_eval_episodes):\n", - " state = env.reset()\n", - " step = 0\n", - " done = False\n", - " total_rewards_ep = 0\n", - " \n", - " for step in range(max_steps):\n", - " action, _ = policy.act(state)\n", - " new_state, reward, done, info = env.step(action)\n", - " total_rewards_ep += reward\n", - " \n", - " if done:\n", - " break\n", - " state = new_state\n", - " episode_rewards.append(total_rewards_ep)\n", - " mean_reward = np.mean(episode_rewards)\n", - " std_reward = np.std(episode_rewards)\n", - "\n", - " return mean_reward, std_reward" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "xdH2QCrLTrlT" - }, - "source": [ - "## Evaluate our agent ๐Ÿ“ˆ" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "ohGSXDyHh0xx" - }, - "outputs": [], - "source": [ - "evaluate_agent(eval_env, \n", - " cartpole_hyperparameters[\"max_t\"], \n", - " cartpole_hyperparameters[\"n_evaluation_episodes\"],\n", - " cartpole_policy)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "7CoeLkQ7TpO8" - }, - "source": [ - "### Publish our trained model on the Hub ๐Ÿ”ฅ\n", - "Now that we saw we got good results after the training, we can publish our trained model on the hub ๐Ÿค— with one line of code.\n", - "\n", - "Here's an example of a Model Card:\n", - "\n", - "" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Jmhs1k-cftIq" - }, - "source": [ - "### Push to the Hub\n", - "#### Do not modify this code" - ] - }, - { - "cell_type": "code", - "source": [ - "from huggingface_hub import HfApi, snapshot_download\n", - "from huggingface_hub.repocard import metadata_eval_result, metadata_save\n", - "\n", - "from pathlib import Path\n", - "import datetime\n", - "import json\n", - "import imageio\n", - "\n", - "import tempfile\n", - "\n", - "import os" - ], - "metadata": { - "id": "LIVsvlW_8tcw" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Lo4JH45if81z" - }, - "outputs": [], - "source": [ - "def record_video(env, policy, out_directory, fps=30):\n", - " \"\"\"\n", - " Generate a replay video of the agent\n", - " :param env\n", - " :param Qtable: Qtable of our agent\n", - " :param out_directory\n", - " :param fps: how many frame per seconds (with taxi-v3 and frozenlake-v1 we use 1)\n", - " \"\"\"\n", - " images = [] \n", - " done = False\n", - " state = env.reset()\n", - " img = env.render(mode='rgb_array')\n", - " images.append(img)\n", - " while not done:\n", - " # Take the action (index) that have the maximum expected future reward given that state\n", - " action, _ = policy.act(state)\n", - " state, reward, done, info = env.step(action) # We directly put next_state = state for recording logic\n", - " img = env.render(mode='rgb_array')\n", - " images.append(img)\n", - " imageio.mimsave(out_directory, [np.array(img) for i, img in enumerate(images)], fps=fps)" - ] - }, - { - "cell_type": "code", - "source": [ - "def push_to_hub(repo_id, \n", - " model,\n", - " hyperparameters,\n", - " eval_env,\n", - " video_fps=30\n", - " ):\n", - " \"\"\"\n", - " Evaluate, Generate a video and Upload a model to Hugging Face Hub.\n", - " This method does the complete pipeline:\n", - " - It evaluates the model\n", - " - It generates the model card\n", - " - It generates a replay video of the agent\n", - " - It pushes everything to the Hub\n", - "\n", - " :param repo_id: repo_id: id of the model repository from the Hugging Face Hub\n", - " :param model: the pytorch model we want to save\n", - " :param hyperparameters: training hyperparameters\n", - " :param eval_env: evaluation environment\n", - " :param video_fps: how many frame per seconds to record our video replay \n", - " \"\"\"\n", - "\n", - " _, repo_name = repo_id.split(\"/\")\n", - " api = HfApi()\n", - " \n", - " # Step 1: Create the repo\n", - " repo_url = api.create_repo(\n", - " repo_id=repo_id,\n", - " exist_ok=True,\n", - " )\n", - "\n", - " with tempfile.TemporaryDirectory() as tmpdirname:\n", - " local_directory = Path(tmpdirname)\n", - " \n", - " # Step 2: Save the model\n", - " torch.save(model, local_directory / \"model.pt\")\n", - "\n", - " # Step 3: Save the hyperparameters to JSON\n", - " with open(local_directory / \"hyperparameters.json\", \"w\") as outfile:\n", - " json.dump(hyperparameters, outfile)\n", - " \n", - " # Step 4: Evaluate the model and build JSON\n", - " mean_reward, std_reward = evaluate_agent(eval_env, \n", - " hyperparameters[\"max_t\"],\n", - " hyperparameters[\"n_evaluation_episodes\"], \n", - " model)\n", - " # Get datetime\n", - " eval_datetime = datetime.datetime.now()\n", - " eval_form_datetime = eval_datetime.isoformat()\n", - "\n", - " evaluate_data = {\n", - " \"env_id\": hyperparameters[\"env_id\"], \n", - " \"mean_reward\": mean_reward,\n", - " \"n_evaluation_episodes\": hyperparameters[\"n_evaluation_episodes\"],\n", - " \"eval_datetime\": eval_form_datetime,\n", - " }\n", - "\n", - " # Write a JSON file\n", - " with open(local_directory / \"results.json\", \"w\") as outfile:\n", - " json.dump(evaluate_data, outfile)\n", - "\n", - " # Step 5: Create the model card\n", - " env_name = hyperparameters[\"env_id\"]\n", - " \n", - " metadata = {}\n", - " metadata[\"tags\"] = [\n", - " env_name,\n", - " \"reinforce\",\n", - " \"reinforcement-learning\",\n", - " \"custom-implementation\",\n", - " \"deep-rl-class\"\n", - " ]\n", - "\n", - " # Add metrics\n", - " eval = metadata_eval_result(\n", - " model_pretty_name=repo_name,\n", - " task_pretty_name=\"reinforcement-learning\",\n", - " task_id=\"reinforcement-learning\",\n", - " metrics_pretty_name=\"mean_reward\",\n", - " metrics_id=\"mean_reward\",\n", - " metrics_value=f\"{mean_reward:.2f} +/- {std_reward:.2f}\",\n", - " dataset_pretty_name=env_name,\n", - " dataset_id=env_name,\n", - " )\n", - "\n", - " # Merges both dictionaries\n", - " metadata = {**metadata, **eval}\n", - "\n", - " model_card = f\"\"\"\n", - " # **Reinforce** Agent playing **{env_id}**\n", - " This is a trained model of a **Reinforce** agent playing **{env_id}** .\n", - " To learn to use this model and train yours check Unit 4 of the Deep Reinforcement Learning Course: https://huggingface.co/deep-rl-course/unit4/introduction\n", - " \"\"\"\n", - "\n", - " readme_path = local_directory / \"README.md\"\n", - " readme = \"\"\n", - " if readme_path.exists():\n", - " with readme_path.open(\"r\", encoding=\"utf8\") as f:\n", - " readme = f.read()\n", - " else:\n", - " readme = model_card\n", - "\n", - " with readme_path.open(\"w\", encoding=\"utf-8\") as f:\n", - " f.write(readme)\n", - "\n", - " # Save our metrics to Readme metadata\n", - " metadata_save(readme_path, metadata)\n", - "\n", - " # Step 6: Record a video\n", - " video_path = local_directory / \"replay.mp4\"\n", - " record_video(env, model, video_path, video_fps)\n", - "\n", - " # Step 7. Push everything to the Hub\n", - " api.upload_folder(\n", - " repo_id=repo_id,\n", - " folder_path=local_directory,\n", - " path_in_repo=\".\",\n", - " )\n", - "\n", - " print(f\"Your model is pushed to the Hub. You can view your model here: {repo_url}\")" - ], - "metadata": { - "id": "_TPdq47D7_f_" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "w17w8CxzoURM" - }, - "source": [ - "### .\n", - "\n", - "By using `push_to_hub` **you evaluate, record a replay, generate a model card of your agent and push it to the Hub**.\n", - "\n", - "This way:\n", - "- You can **showcase our work** ๐Ÿ”ฅ\n", - "- You can **visualize your agent playing** ๐Ÿ‘€\n", - "- You can **share with the community an agent that others can use** ๐Ÿ’พ\n", - "- You can **access a leaderboard ๐Ÿ† to see how well your agent is performing compared to your classmates** ๐Ÿ‘‰ https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "cWnFC0iZooTw" - }, - "source": [ - "To be able to share your model with the community there are three more steps to follow:\n", - "\n", - "1๏ธโƒฃ (If it's not already done) create an account to HF โžก https://huggingface.co/join\n", - "\n", - "2๏ธโƒฃ Sign in and then, you need to store your authentication token from the Hugging Face website.\n", - "- Create a new token (https://huggingface.co/settings/tokens) **with write role**\n", - "\n", - "\n", - "\"Create\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "QB5nIcxR8paT" - }, - "outputs": [], - "source": [ - "notebook_login()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "GyWc1x3-o3xG" - }, - "source": [ - "If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login` (or `login`)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "F-D-zhbRoeOm" - }, - "source": [ - "3๏ธโƒฃ We're now ready to push our trained agent to the ๐Ÿค— Hub ๐Ÿ”ฅ using `package_to_hub()` function" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "UNwkTS65Uq3Q" - }, - "outputs": [], - "source": [ - "repo_id = \"\" #TODO Define your repo id {username/Reinforce-{model-id}}\n", - "push_to_hub(repo_id,\n", - " cartpole_policy, # The model we want to save\n", - " cartpole_hyperparameters, # Hyperparameters\n", - " eval_env, # Evaluation environment\n", - " video_fps=30\n", - " )" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "jrnuKH1gYZSz" - }, - "source": [ - "Now that we try the robustness of our implementation, let's try a more complex environment: PixelCopter ๐Ÿš\n", - "\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "source": [ - "## Second agent: PixelCopter ๐Ÿš\n", - "\n", - "### Study the PixelCopter environment ๐Ÿ‘€\n", - "- [The Environment documentation](https://pygame-learning-environment.readthedocs.io/en/latest/user/games/pixelcopter.html)\n" - ], - "metadata": { - "id": "JNLVmKKVKA6j" - } - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "JBSc8mlfyin3" - }, - "outputs": [], - "source": [ - "env_id = \"Pixelcopter-PLE-v0\"\n", - "env = gym.make(env_id)\n", - "eval_env = gym.make(env_id)\n", - "s_size = env.observation_space.shape[0]\n", - "a_size = env.action_space.n" - ] - }, - { - "cell_type": "code", - "source": [ - "print(\"_____OBSERVATION SPACE_____ \\n\")\n", - "print(\"The State Space is: \", s_size)\n", - "print(\"Sample observation\", env.observation_space.sample()) # Get a random observation" - ], - "metadata": { - "id": "L5u_zAHsKBy7" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "source": [ - "print(\"\\n _____ACTION SPACE_____ \\n\")\n", - "print(\"The Action Space is: \", a_size)\n", - "print(\"Action Space Sample\", env.action_space.sample()) # Take a random action" - ], - "metadata": { - "id": "D7yJM9YXKNbq" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "NNWvlyvzalXr" - }, - "source": [ - "The observation space (7) ๐Ÿ‘€:\n", - "- player y position\n", - "- player velocity\n", - "- player distance to floor\n", - "- player distance to ceiling\n", - "- next block x distance to player\n", - "- next blocks top y location\n", - "- next blocks bottom y location\n", - "\n", - "The action space(2) ๐ŸŽฎ:\n", - "- Up\n", - "- Down\n", - "\n", - "The reward function ๐Ÿ’ฐ: \n", - "- For each vertical block it passes through it gains a positive reward of +1. Each time a terminal state reached it receives a negative reward of -1." - ] - }, - { - "cell_type": "markdown", - "source": [ - "### Define the new Policy ๐Ÿง \n", - "- We need to have a deeper neural network since the environment is more complex" - ], - "metadata": { - "id": "aV1466QP8crz" - } - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "I1eBkCiX2X_S" - }, - "outputs": [], - "source": [ - "class Policy(nn.Module):\n", - " def __init__(self, s_size, a_size, h_size):\n", - " super(Policy, self).__init__()\n", - " # Define the three layers here\n", - "\n", - " def forward(self, x):\n", - " # Define the forward process here\n", - " return F.softmax(x, dim=1)\n", - " \n", - " def act(self, state):\n", - " state = torch.from_numpy(state).float().unsqueeze(0).to(device)\n", - " probs = self.forward(state).cpu()\n", - " m = Categorical(probs)\n", - " action = m.sample()\n", - " return action.item(), m.log_prob(action)" - ] - }, - { - "cell_type": "markdown", - "source": [ - "#### Solution" - ], - "metadata": { - "id": "47iuAFqV8Ws-" - } - }, - { - "cell_type": "code", - "source": [ - "class Policy(nn.Module):\n", - " def __init__(self, s_size, a_size, h_size):\n", - " super(Policy, self).__init__()\n", - " self.fc1 = nn.Linear(s_size, h_size)\n", - " self.fc2 = nn.Linear(h_size, h_size*2)\n", - " self.fc3 = nn.Linear(h_size*2, a_size)\n", - "\n", - " def forward(self, x):\n", - " x = F.relu(self.fc1(x))\n", - " x = F.relu(self.fc2(x))\n", - " x = self.fc3(x)\n", - " return F.softmax(x, dim=1)\n", - " \n", - " def act(self, state):\n", - " state = torch.from_numpy(state).float().unsqueeze(0).to(device)\n", - " probs = self.forward(state).cpu()\n", - " m = Categorical(probs)\n", - " action = m.sample()\n", - " return action.item(), m.log_prob(action)" - ], - "metadata": { - "id": "wrNuVcHC8Xu7" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "SM1QiGCSbBkM" - }, - "source": [ - "### Define the hyperparameters โš™๏ธ\n", - "- Because this environment is more complex.\n", - "- Especially for the hidden size, we need more neurons." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "y0uujOR_ypB6" - }, - "outputs": [], - "source": [ - "pixelcopter_hyperparameters = {\n", - " \"h_size\": 64,\n", - " \"n_training_episodes\": 50000,\n", - " \"n_evaluation_episodes\": 10,\n", - " \"max_t\": 10000,\n", - " \"gamma\": 0.99,\n", - " \"lr\": 1e-4,\n", - " \"env_id\": env_id,\n", - " \"state_space\": s_size,\n", - " \"action_space\": a_size,\n", - "}" - ] - }, - { - "cell_type": "markdown", - "source": [ - "### Train it\n", - "- We're now ready to train our agent ๐Ÿ”ฅ." - ], - "metadata": { - "id": "wyvXTJWm9GJG" - } - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "7mM2P_ckysFE" - }, - "outputs": [], - "source": [ - "# Create policy and place it to the device\n", - "# torch.manual_seed(50)\n", - "pixelcopter_policy = Policy(pixelcopter_hyperparameters[\"state_space\"], pixelcopter_hyperparameters[\"action_space\"], pixelcopter_hyperparameters[\"h_size\"]).to(device)\n", - "pixelcopter_optimizer = optim.Adam(pixelcopter_policy.parameters(), lr=pixelcopter_hyperparameters[\"lr\"])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "v1HEqP-fy-Rf" - }, - "outputs": [], - "source": [ - "scores = reinforce(pixelcopter_policy,\n", - " pixelcopter_optimizer,\n", - " pixelcopter_hyperparameters[\"n_training_episodes\"], \n", - " pixelcopter_hyperparameters[\"max_t\"],\n", - " pixelcopter_hyperparameters[\"gamma\"], \n", - " 1000)" - ] - }, - { - "cell_type": "markdown", - "source": [ - "### Publish our trained model on the Hub ๐Ÿ”ฅ" - ], - "metadata": { - "id": "8kwFQ-Ip85BE" - } - }, - { - "cell_type": "code", - "source": [ - "repo_id = \"\" #TODO Define your repo id {username/Reinforce-{model-id}}\n", - "push_to_hub(repo_id,\n", - " pixelcopter_policy, # The model we want to save\n", - " pixelcopter_hyperparameters, # Hyperparameters\n", - " eval_env, # Evaluation environment\n", - " video_fps=30\n", - " )" - ], - "metadata": { - "id": "6PtB7LRbTKWK" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "7VDcJ29FcOyb" - }, - "source": [ - "## Some additional challenges ๐Ÿ†\n", - "The best way to learn **is to try things on your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. But also trying to find better parameters.\n", - "\n", - "In the [Leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?\n", - "\n", - "Here are some ideas to achieve so:\n", - "* Train more steps\n", - "* Try different hyperparameters by looking at what your classmates have done ๐Ÿ‘‰ https://huggingface.co/models?other=reinforce\n", - "* **Push your new trained model** on the Hub ๐Ÿ”ฅ\n", - "* **Improving the implementation for more complex environments** (for instance, what about changing the network to a Convolutional Neural Network to handle\n", - "frames as observation)?" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "x62pP0PHdA-y" - }, - "source": [ - "________________________________________________________________________\n", - "\n", - "**Congrats on finishing this unit**!ย There was a lot of information.\n", - "And congrats on finishing the tutorial. You've just coded your first Deep Reinforcement Learning agent from scratch using PyTorch and shared it on the Hub ๐Ÿฅณ.\n", - "\n", - "Don't hesitate to iterate on this unit **by improving the implementation for more complex environments** (for instance, what about changing the network to a Convolutional Neural Network to handle\n", - "frames as observation)?\n", - "\n", - "In the next unit, **we're going to learn more about Unity MLAgents**, by training agents in Unity environments. This way, you will be ready to participate in the **AI vs AI challenges where you'll train your agents\n", - "to compete against other agents in a snowball fight and a soccer game.**\n", - "\n", - "Sounds fun? See you next time!\n", - "\n", - "Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please ๐Ÿ‘‰ [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)\n", - "\n", - "See you in Unit 5! ๐Ÿ”ฅ\n", - "\n", - "### Keep Learning, stay awesome ๐Ÿค—\n", - "\n" - ] - } - ], - "metadata": { - "accelerator": "GPU", - "colab": { - "private_outputs": true, - "provenance": [], - "collapsed_sections": [ - "BPLwsPajb1f8", - "L_WSo0VUV99t", - "mjY-eq3eWh9O", - "JoTC9o2SczNn", - "gfGJNZBUP7Vn", - "YB0Cxrw1StrP", - "47iuAFqV8Ws-", - "x62pP0PHdA-y" - ], - "include_colab_link": true - }, - "gpuClass": "standard", - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.10" - } - }, - "nbformat": 4, - "nbformat_minor": 0 -} \ No newline at end of file From d32b77b376fb0920cdf58b955158f1583f892b88 Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Sun, 6 Aug 2023 18:13:18 +0200 Subject: [PATCH 6/7] Update _toctree.yml * Change Unit 6 hands on title --- units/en/_toctree.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/units/en/_toctree.yml b/units/en/_toctree.yml index 040e10a..6244fce 100644 --- a/units/en/_toctree.yml +++ b/units/en/_toctree.yml @@ -157,7 +157,7 @@ - local: unit6/advantage-actor-critic title: Advantage Actor Critic (A2C) - local: unit6/hands-on - title: Advantage Actor Critic (A2C) using Robotics Simulations with PyBullet and Panda-Gym ๐Ÿค– + title: Advantage Actor Critic (A2C) using Robotics Simulations with Panda-Gym ๐Ÿค– - local: unit6/conclusion title: Conclusion - local: unit6/additional-readings From d430db9ea3c94b3a20ce4c7fd2d96e1b4d8c5b9c Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Sun, 6 Aug 2023 18:23:53 +0200 Subject: [PATCH 7/7] Update hands-on.mdx --- units/en/unit6/hands-on.mdx | 249 ++++++++++++++---------------------- 1 file changed, 93 insertions(+), 156 deletions(-) diff --git a/units/en/unit6/hands-on.mdx b/units/en/unit6/hands-on.mdx index 500d30c..d1035ae 100644 --- a/units/en/unit6/hands-on.mdx +++ b/units/en/unit6/hands-on.mdx @@ -27,11 +27,10 @@ For more information about the certification process, check this section ๐Ÿ‘‰ ht [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit6/unit6.ipynb) -# Unit 6: Advantage Actor Critic (A2C) using Robotics Simulations with PyBullet and Panda-Gym ๐Ÿค– +# Unit 6: Advantage Actor Critic (A2C) using Robotics Simulations with Panda-Gym ๐Ÿค– ### ๐ŸŽฎ Environments: -- [PyBullet](https://github.com/bulletphysics/bullet3) - [Panda-Gym](https://github.com/qgallouedec/panda-gym) ### ๐Ÿ“š RL-Library: @@ -44,12 +43,13 @@ We're constantly trying to improve our tutorials, so **if you find some issues i At the end of the notebook, you will: -- Be able to use the environment librairies **PyBullet** and **Panda-Gym**. +- Be able to use **Panda-Gym**, the environment library. - Be able to **train robots using A2C**. - Understand why **we need to normalize the input**. - Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score ๐Ÿ”ฅ. ## Prerequisites ๐Ÿ—๏ธ + Before diving into the notebook, you need to: ๐Ÿ”ฒ ๐Ÿ“š Study [Actor-Critic methods by reading Unit 6](https://huggingface.co/deep-rl-course/unit6/introduction) ๐Ÿค— @@ -89,30 +89,31 @@ virtual_display.start() ``` ### Install dependencies ๐Ÿ”ฝ -The first step is to install the dependencies, weโ€™ll install multiple ones: -- `pybullet`: Contains the walking robots environments. +Weโ€™ll install multiple ones: + +- `gymnasium` - `panda-gym`: Contains the robotics arm environments. -- `stable-baselines3[extra]`: The SB3 deep reinforcement learning library. +- `stable-baselines3`: The SB3 deep reinforcement learning library. - `huggingface_sb3`: Additional code for Stable-baselines3 to load and upload models from the Hugging Face ๐Ÿค— Hub. - `huggingface_hub`: Library allowing anyone to work with the Hub repositories. ```bash -!pip install stable-baselines3[extra]==1.8.0 -!pip install huggingface_sb3 -!pip install panda_gym==2.0.0 -!pip install pyglet==1.5.1 +!pip install stable-baselines3[extra] +!pip install gymnasium +!pip install git+https://github.com/huggingface/huggingface_sb3@gymnasium-v2 +!pip install huggingface_hub +!pip install panda_gym ``` ## Import the packages ๐Ÿ“ฆ ```python -import pybullet_envs -import panda_gym -import gym - import os +import gymnasium as gym +import panda_gym + from huggingface_sb3 import load_from_hub, package_to_hub from stable_baselines3 import A2C @@ -123,45 +124,61 @@ from stable_baselines3.common.env_util import make_vec_env from huggingface_hub import notebook_login ``` -## Environment 1: AntBulletEnv-v0 ๐Ÿ•ธ +## PandaReachDense-v3 ๐Ÿฆพ + +The agent we're going to train is a robotic arm that needs to do controls (moving the arm and using the end-effector). + +In robotics, the *end-effector* is the device at the end of a robotic arm designed to interact with the environment. + +In `PandaReach`, the robot must place its end-effector at a target position (green ball). + +We're going to use the dense version of this environment. It means we'll get a *dense reward function* that **will provide a reward at each timestep** (the closer the agent is to completing the task, the higher the reward). Contrary to a *sparse reward function* where the environment **return a reward if and only if the task is completed**. + +Also, we're going to use the *End-effector displacement control*, it means the **action corresponds to the displacement of the end-effector**. We don't control the individual motion of each joint (joint control). + +Robotics + +This way **the training will be easier**. + +### Create the environment -### Create the AntBulletEnv-v0 #### The environment ๐ŸŽฎ -In this environment, the agent needs to use its different joints correctly in order to walk. -You can find a detailled explanation of this environment here: https://hackmd.io/@jeffreymo/SJJrSJh5_#PyBullet +In `PandaReachDense-v3` the robotic arm must place its end-effector at a target position (green ball). ```python -env_id = "AntBulletEnv-v0" +env_id = "PandaReachDense-v3" + # Create the env env = gym.make(env_id) # Get the state space and action space -s_size = env.observation_space.shape[0] +s_size = env.observation_space.shape a_size = env.action_space ``` ```python print("_____OBSERVATION SPACE_____ \n") print("The State Space is: ", s_size) -print("Sample observation", env.observation_space.sample()) # Get a random observation +print("Sample observation", env.observation_space.sample()) # Get a random observation ``` -The observation Space (from [Jeffrey Y Mo](https://hackmd.io/@jeffreymo/SJJrSJh5_#PyBullet)): -The difference is that our observation space is 28 not 29. +The observation space **is a dictionary with 3 different elements**: -PyBullet Ant Obs space +- `achieved_goal`: (x,y,z) position of the goal. +- `desired_goal`: (x,y,z) distance between the goal position and the current object position. +- `observation`: position (x,y,z) and velocity of the end-effector (vx, vy, vz). +Given it's a dictionary as observation, **we will need to use a MultiInputPolicy policy instead of MlpPolicy**. ```python print("\n _____ACTION SPACE_____ \n") print("The Action Space is: ", a_size) -print("Action Space Sample", env.action_space.sample()) # Take a random action +print("Action Space Sample", env.action_space.sample()) # Take a random action ``` -The action Space (from [Jeffrey Y Mo](https://hackmd.io/@jeffreymo/SJJrSJh5_#PyBullet)): - -PyBullet Ant Obs space +The action space is a vector with 3 values: +- Control x, y, z movement ### Normalize observation and rewards @@ -186,13 +203,11 @@ env = # TODO: Add the wrapper ```python env = make_vec_env(env_id, n_envs=4) -env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10.0) +env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10.) ``` ### Create the A2C Model ๐Ÿค– -In this case, because we have a vector of 28 values as input, we'll use an MLP (multi-layer perceptron) as policy. - For more information about A2C implementation with StableBaselines3 check: https://stable-baselines3.readthedocs.io/en/master/modules/a2c.html#notes To find the best parameters I checked the [official trained agents by Stable-Baselines3 team](https://huggingface.co/sb3). @@ -204,86 +219,71 @@ model = # Create the A2C model and try to find the best parameters #### Solution ```python -model = A2C( - policy="MlpPolicy", - env=env, - gae_lambda=0.9, - gamma=0.99, - learning_rate=0.00096, - max_grad_norm=0.5, - n_steps=8, - vf_coef=0.4, - ent_coef=0.0, - policy_kwargs=dict(log_std_init=-2, ortho_init=False), - normalize_advantage=False, - use_rms_prop=True, - use_sde=True, - verbose=1, -) +model = A2C(policy = "MultiInputPolicy", + env = env, + verbose=1) ``` ### Train the A2C agent ๐Ÿƒ -- Let's train our agent for 2,000,000 timesteps. Don't forget to use GPU on Colab. It will take approximately ~25-40min +- Let's train our agent for 1,000,000 timesteps, don't forget to use GPU on Colab. It will take approximately ~25-40min ```python -model.learn(2_000_000) +model.learn(1_000_000) ``` ```python # Save the model and VecNormalize statistics when saving the agent -model.save("a2c-AntBulletEnv-v0") +model.save("a2c-PandaReachDense-v3") env.save("vec_normalize.pkl") ``` ### Evaluate the agent ๐Ÿ“ˆ -- Now that our agent is trained, we need to **check its performance**. + +- Now that's our agent is trained, we need to **check its performance**. - Stable-Baselines3 provides a method to do that: `evaluate_policy` -- In my case, I got a mean reward of `2371.90 +/- 16.50` ```python from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize # Load the saved statistics -eval_env = DummyVecEnv([lambda: gym.make("AntBulletEnv-v0")]) +eval_env = DummyVecEnv([lambda: gym.make("PandaReachDense-v3")]) eval_env = VecNormalize.load("vec_normalize.pkl", eval_env) +# We need to override the render_mode +eval_env.render_mode = "rgb_array" + # do not update them at test time eval_env.training = False # reward normalization is not needed at test time eval_env.norm_reward = False # Load the agent -model = A2C.load("a2c-AntBulletEnv-v0") +model = A2C.load("a2c-PandaReachDense-v3") -mean_reward, std_reward = evaluate_policy(model, env) +mean_reward, std_reward = evaluate_policy(model, eval_env) print(f"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}") ``` - ### Publish your trained model on the Hub ๐Ÿ”ฅ + Now that we saw we got good results after the training, we can publish our trained model on the Hub with one line of code. ๐Ÿ“š The libraries documentation ๐Ÿ‘‰ https://github.com/huggingface/huggingface_sb3/tree/main#hugging-face--x-stable-baselines3-v20 -Here's an example of a Model Card (with a PyBullet environment): - -Model Card Pybullet - By using `package_to_hub`, as we already mentionned in the former units, **you evaluate, record a replay, generate a model card of your agent and push it to the hub**. This way: - You can **showcase our work** ๐Ÿ”ฅ - You can **visualize your agent playing** ๐Ÿ‘€ -- You can **share an agent with the community that others can use** ๐Ÿ’พ +- You can **share with the community an agent that others can use** ๐Ÿ’พ - You can **access a leaderboard ๐Ÿ† to see how well your agent is performing compared to your classmates** ๐Ÿ‘‰ https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard - To be able to share your model with the community there are three more steps to follow: 1๏ธโƒฃ (If it's not already done) create an account to HF โžก https://huggingface.co/join -2๏ธโƒฃ Sign in and then you need to get your authentication token from the Hugging Face website. +2๏ธโƒฃ Sign in and then, you need to store your authentication token from the Hugging Face website. - Create a new token (https://huggingface.co/settings/tokens) **with write role** Create HF Token @@ -295,116 +295,68 @@ To be able to share your model with the community there are three more steps to notebook_login() !git config --global credential.helper store ``` +If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login` -If you don't want to use Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login` - -3๏ธโƒฃ We're now ready to push our trained agent to the ๐Ÿค— Hub ๐Ÿ”ฅ using `package_to_hub()` function +3๏ธโƒฃ We're now ready to push our trained agent to the ๐Ÿค— Hub ๐Ÿ”ฅ using `package_to_hub()` function. +For this environment, **running this cell can take approximately 10min** ```python +from huggingface_sb3 import package_to_hub + package_to_hub( model=model, model_name=f"a2c-{env_id}", model_architecture="A2C", env_id=env_id, eval_env=eval_env, - repo_id=f"ThomasSimonini/a2c-{env_id}", # Change the username + repo_id=f"ThomasSimonini/a2c-{env_id}", # Change the username commit_message="Initial commit", ) ``` -## Take a coffee break โ˜• -- You already trained your first robot that learned to move congratutlations ๐Ÿฅณ! -- It's **time to take a break**. Don't hesitate to **save this notebook** `File > Save a copy to Drive` to work on this second part later. +## Some additional challenges ๐Ÿ† +The best way to learn **is to try things by your own**! Why not trying `PandaPickAndPlace-v3`? -## Environment 2: PandaReachDense-v2 ๐Ÿฆพ +If you want to try more advanced tasks for panda-gym, you need to check what was done using **TQC or SAC** (a more sample-efficient algorithm suited for robotics tasks). In real robotics, you'll use a more sample-efficient algorithm for a simple reason: contrary to a simulation **if you move your robotic arm too much, you have a risk of breaking it**. -The agent we're going to train is a robotic arm that needs to do controls (moving the arm and using the end-effector). +PandaPickAndPlace-v1 (this model uses the v1 version of the environment): https://huggingface.co/sb3/tqc-PandaPickAndPlace-v1 -In robotics, the *end-effector* is the device at the end of a robotic arm designed to interact with the environment. +And don't hesitate to check panda-gym documentation here: https://panda-gym.readthedocs.io/en/latest/usage/train_with_sb3.html -In `PandaReach`, the robot must place its end-effector at a target position (green ball). +We provide you the steps to train another agent (optional): -We're going to use the dense version of this environment. This means we'll get a *dense reward function* that **will provide a reward at each timestep** (the closer the agent is to completing the task, the higher the reward). This is in contrast to a *sparse reward function* where the environment **return a reward if and only if the task is completed**. - -Also, we're going to use the *End-effector displacement control*, which means the **action corresponds to the displacement of the end-effector**. We don't control the individual motion of each joint (joint control). - -Robotics - - -This way **the training will be easier**. - - - -In `PandaReachDense-v2`, the robotic arm must place its end-effector at a target position (green ball). - - - -```python -import gym - -env_id = "PandaReachDense-v2" - -# Create the env -env = gym.make(env_id) - -# Get the state space and action space -s_size = env.observation_space.shape -a_size = env.action_space -``` - -```python -print("_____OBSERVATION SPACE_____ \n") -print("The State Space is: ", s_size) -print("Sample observation", env.observation_space.sample()) # Get a random observation -``` - -The observation space **is a dictionary with 3 different elements**: -- `achieved_goal`: (x,y,z) position of the goal. -- `desired_goal`: (x,y,z) distance between the goal position and the current object position. -- `observation`: position (x,y,z) and velocity of the end-effector (vx, vy, vz). - -Given it's a dictionary as observation, **we will need to use a MultiInputPolicy policy instead of MlpPolicy**. - -```python -print("\n _____ACTION SPACE_____ \n") -print("The Action Space is: ", a_size) -print("Action Space Sample", env.action_space.sample()) # Take a random action -``` - -The action space is a vector with 3 values: -- Control x, y, z movement - -Now it's your turn: - -1. Define the environment called "PandaReachDense-v2". -2. Make a vectorized environment. +1. Define the environment called "PandaPickAndPlace-v3" +2. Make a vectorized environment 3. Add a wrapper to normalize the observations and rewards. [Check the documentation](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize) 4. Create the A2C Model (don't forget verbose=1 to print the training logs). -5. Train it for 1M Timesteps. -6. Save the model and VecNormalize statistics when saving the agent. -7. Evaluate your agent. -8. Publish your trained model on the Hub ๐Ÿ”ฅ with `package_to_hub`. +5. Train it for 1M Timesteps +6. Save the model and VecNormalize statistics when saving the agent +7. Evaluate your agent +8. Publish your trained model on the Hub ๐Ÿ”ฅ with `package_to_hub` -### Solution (fill the todo) + +### Solution (optional) ```python # 1 - 2 -env_id = "PandaReachDense-v2" +env_id = "PandaPickAndPlace-v3" env = make_vec_env(env_id, n_envs=4) # 3 -env = VecNormalize(env, norm_obs=True, norm_reward=False, clip_obs=10.0) +env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10.) # 4 -model = A2C(policy="MultiInputPolicy", env=env, verbose=1) +model = A2C(policy = "MultiInputPolicy", + env = env, + verbose=1) # 5 model.learn(1_000_000) ``` ```python # 6 -model_name = "a2c-PandaReachDense-v2" +model_name = "a2c-PandaPickAndPlace-v3"; model.save(model_name) env.save("vec_normalize.pkl") @@ -412,7 +364,7 @@ env.save("vec_normalize.pkl") from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize # Load the saved statistics -eval_env = DummyVecEnv([lambda: gym.make("PandaReachDense-v2")]) +eval_env = DummyVecEnv([lambda: gym.make("PandaPickAndPlace-v3")]) eval_env = VecNormalize.load("vec_normalize.pkl", eval_env) # do not update them at test time @@ -423,7 +375,7 @@ eval_env.norm_reward = False # Load the agent model = A2C.load(model_name) -mean_reward, std_reward = evaluate_policy(model, env) +mean_reward, std_reward = evaluate_policy(model, eval_env) print(f"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}") @@ -434,26 +386,11 @@ package_to_hub( model_architecture="A2C", env_id=env_id, eval_env=eval_env, - repo_id=f"ThomasSimonini/a2c-{env_id}", # TODO: Change the username + repo_id=f"ThomasSimonini/a2c-{env_id}", # TODO: Change the username commit_message="Initial commit", ) ``` -## Some additional challenges ๐Ÿ† - -The best way to learn **is to try things on your own**! Why not try `HalfCheetahBulletEnv-v0` for PyBullet and `PandaPickAndPlace-v1` for Panda-Gym? - -If you want to try more advanced tasks for panda-gym, you need to check what was done using **TQC or SAC** (a more sample-efficient algorithm suited for robotics tasks). In real robotics, you'll use a more sample-efficient algorithm for a simple reason: contrary to a simulation **if you move your robotic arm too much, you have a risk of breaking it**. - -PandaPickAndPlace-v1: https://huggingface.co/sb3/tqc-PandaPickAndPlace-v1 - -And don't hesitate to check panda-gym documentation here: https://panda-gym.readthedocs.io/en/latest/usage/train_with_sb3.html - -Here are some ideas to go further: -* Train more steps -* Try different hyperparameters by looking at what your classmates have done ๐Ÿ‘‰ https://huggingface.co/models?other=https://huggingface.co/models?other=AntBulletEnv-v0 -* **Push your new trained model** on the Hub ๐Ÿ”ฅ - - See you on Unit 7! ๐Ÿ”ฅ + ## Keep learning, stay awesome ๐Ÿค—