Merge pull request #174 from huggingface/ThomasSimonini/A2C

Add Unit A2C
This commit is contained in:
Thomas Simonini
2023-01-17 14:52:56 +01:00
committed by GitHub
9 changed files with 1553 additions and 0 deletions

View File

@@ -0,0 +1,4 @@
stable-baselines3[extra]
huggingface_sb3
panda_gym==2.0.0
pyglet==1.5.1

918
notebooks/unit6/unit6.ipynb Normal file
View File

@@ -0,0 +1,918 @@
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"private_outputs": true,
"authorship_tag": "ABX9TyMm2AvQJHZiNbxotv6J/Rf+",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
},
"accelerator": "GPU",
"gpuClass": "standard"
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/github/huggingface/deep-rl-class/blob/ThomasSimonini%2FA2C/notebooks/unit6/unit6.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"source": [
"# Unit 6: Advantage Actor Critic (A2C) using Robotics Simulations with PyBullet and Panda-Gym 🤖\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/thumbnail.png\" alt=\"Thumbnail\"/>\n",
"\n",
"In this notebook, you'll learn to use A2C with PyBullet and Panda-Gym, two set of robotics environments. \n",
"\n",
"With [PyBullet](https://github.com/bulletphysics/bullet3), you're going to **train a robot to move**:\n",
"- `AntBulletEnv-v0` 🕸️ More precisely, a spider (they say Ant but come on... it's a spider 😆) 🕸️\n",
"\n",
"Then, with [Panda-Gym](https://github.com/qgallouedec/panda-gym), you're going **to train a robotic arm** (Franka Emika Panda robot) to perform a task:\n",
"- `Reach`: the robot must place its end-effector at a target position.\n",
"\n",
"After that, you'll be able **to train in other robotics environments**.\n"
],
"metadata": {
"id": "-PTReiOw-RAN"
}
},
{
"cell_type": "markdown",
"source": [
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/environments.gif\" alt=\"Robotics environments\"/>"
],
"metadata": {
"id": "2VGL_0ncoAJI"
}
},
{
"cell_type": "markdown",
"source": [
"### 🎮 Environments: \n",
"\n",
"- [PyBullet](https://github.com/bulletphysics/bullet3)\n",
"- [Panda-Gym](https://github.com/qgallouedec/panda-gym)\n",
"\n",
"###📚 RL-Library: \n",
"\n",
"- [Stable-Baselines3](https://stable-baselines3.readthedocs.io/)"
],
"metadata": {
"id": "QInFitfWno1Q"
}
},
{
"cell_type": "markdown",
"source": [
"We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues)."
],
"metadata": {
"id": "2CcdX4g3oFlp"
}
},
{
"cell_type": "markdown",
"source": [
"## Objectives of this notebook 🏆\n",
"\n",
"At the end of the notebook, you will:\n",
"\n",
"- Be able to use **PyBullet** and **Panda-Gym**, the environment libraries.\n",
"- Be able to **train robots using A2C**.\n",
"- Understand why **we need to normalize the input**.\n",
"- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.\n",
"\n",
"\n"
],
"metadata": {
"id": "MoubJX20oKaQ"
}
},
{
"cell_type": "markdown",
"source": [
"## This notebook is from the Deep Reinforcement Learning Course\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/deep-rl-course-illustration.jpg\" alt=\"Deep RL Course illustration\"/>\n",
"\n",
"In this free course, you will:\n",
"\n",
"- 📖 Study Deep Reinforcement Learning in **theory and practice**.\n",
"- 🧑‍💻 Learn to **use famous Deep RL libraries** such as Stable Baselines3, RL Baselines3 Zoo, CleanRL and Sample Factory 2.0.\n",
"- 🤖 Train **agents in unique environments** \n",
"\n",
"And more check 📚 the syllabus 👉 https://simoninithomas.github.io/deep-rl-course\n",
"\n",
"Dont forget to **<a href=\"http://eepurl.com/ic5ZUD\">sign up to the course</a>** (we are collecting your email to be able to **send you the links when each Unit is published and give you information about the challenges and updates).**\n",
"\n",
"\n",
"The best way to keep in touch is to join our discord server to exchange with the community and with us 👉🏻 https://discord.gg/ydHrjt3WP5"
],
"metadata": {
"id": "DoUNkTExoUED"
}
},
{
"cell_type": "markdown",
"source": [
"## Prerequisites 🏗️\n",
"Before diving into the notebook, you need to:\n",
"\n",
"🔲 📚 Study [Actor-Critic methods by reading Unit 6](https://huggingface.co/deep-rl-course/unit6/introduction) 🤗 "
],
"metadata": {
"id": "BTuQAUAPoa5E"
}
},
{
"cell_type": "markdown",
"source": [
"# Let's train our first robots 🤖"
],
"metadata": {
"id": "iajHvVDWoo01"
}
},
{
"cell_type": "markdown",
"source": [
"To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push your two trained models to the Hub and get the following results:\n",
"\n",
"- `AntBulletEnv-v0` get a result of >= 650.\n",
"- `PandaReachDense-v2` get a result of >= -3.5.\n",
"\n",
"To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**\n",
"\n",
"If you don't find your model, **go to the bottom of the page and click on the refresh button**\n",
"\n",
"For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process"
],
"metadata": {
"id": "zbOENTE2os_D"
}
},
{
"cell_type": "markdown",
"source": [
"## Set the GPU 💪\n",
"- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg\" alt=\"GPU Step 1\">"
],
"metadata": {
"id": "PU4FVzaoM6fC"
}
},
{
"cell_type": "markdown",
"source": [
"- `Hardware Accelerator > GPU`\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg\" alt=\"GPU Step 2\">"
],
"metadata": {
"id": "KV0NyFdQM9ZG"
}
},
{
"cell_type": "markdown",
"source": [
"## Create a virtual display 🔽\n",
"\n",
"During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames). \n",
"\n",
"Hence the following cell will install the librairies and create and run a virtual screen 🖥"
],
"metadata": {
"id": "bTpYcVZVMzUI"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "jV6wjQ7Be7p5"
},
"outputs": [],
"source": [
"%%capture\n",
"!apt install python-opengl\n",
"!apt install ffmpeg\n",
"!apt install xvfb\n",
"!pip3 install pyvirtualdisplay"
]
},
{
"cell_type": "code",
"source": [
"# Virtual display\n",
"from pyvirtualdisplay import Display\n",
"\n",
"virtual_display = Display(visible=0, size=(1400, 900))\n",
"virtual_display.start()"
],
"metadata": {
"id": "ww5PQH1gNLI4"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Install dependencies 🔽\n",
"The first step is to install the dependencies, well install multiple ones:\n",
"\n",
"- `pybullet`: Contains the walking robots environments.\n",
"- `panda-gym`: Contains the robotics arm environments.\n",
"- `stable-baselines3[extra]`: The SB3 deep reinforcement learning library.\n",
"- `huggingface_sb3`: Additional code for Stable-baselines3 to load and upload models from the Hugging Face 🤗 Hub.\n",
"- `huggingface_hub`: Library allowing anyone to work with the Hub repositories."
],
"metadata": {
"id": "e1obkbdJ_KnG"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "2yZRi_0bQGPM"
},
"outputs": [],
"source": [
"!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit6/requirements-unit6.txt"
]
},
{
"cell_type": "markdown",
"source": [
"## Import the packages 📦"
],
"metadata": {
"id": "QTep3PQQABLr"
}
},
{
"cell_type": "code",
"source": [
"import pybullet_envs\n",
"import panda_gym\n",
"import gym\n",
"\n",
"import os\n",
"\n",
"from huggingface_sb3 import load_from_hub, package_to_hub\n",
"\n",
"from stable_baselines3 import A2C\n",
"from stable_baselines3.common.evaluation import evaluate_policy\n",
"from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize\n",
"from stable_baselines3.common.env_util import make_vec_env\n",
"\n",
"from huggingface_hub import notebook_login"
],
"metadata": {
"id": "HpiB8VdnQ7Bk"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Environment 1: AntBulletEnv-v0 🕸\n",
"\n"
],
"metadata": {
"id": "lfBwIS_oAVXI"
}
},
{
"cell_type": "markdown",
"source": [
"### Create the AntBulletEnv-v0\n",
"#### The environment 🎮\n",
"In this environment, the agent needs to use correctly its different joints to walk correctly.\n",
"You can find a detailled explanation of this environment here: https://hackmd.io/@jeffreymo/SJJrSJh5_#PyBullet"
],
"metadata": {
"id": "frVXOrnlBerQ"
}
},
{
"cell_type": "code",
"source": [
"env_id = \"AntBulletEnv-v0\"\n",
"# Create the env\n",
"env = gym.make(env_id)\n",
"\n",
"# Get the state space and action space\n",
"s_size = env.observation_space.shape[0]\n",
"a_size = env.action_space"
],
"metadata": {
"id": "JpU-JCDQYYax"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"print(\"_____OBSERVATION SPACE_____ \\n\")\n",
"print(\"The State Space is: \", s_size)\n",
"print(\"Sample observation\", env.observation_space.sample()) # Get a random observation"
],
"metadata": {
"id": "2ZfvcCqEYgrg"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"The observation Space (from [Jeffrey Y Mo](https://hackmd.io/@jeffreymo/SJJrSJh5_#PyBullet)):\n",
"\n",
"The difference is that our observation space is 28 not 29.\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/obs_space.png\" alt=\"PyBullet Ant Obs space\"/>\n"
],
"metadata": {
"id": "QzMmsdMJS7jh"
}
},
{
"cell_type": "code",
"source": [
"print(\"\\n _____ACTION SPACE_____ \\n\")\n",
"print(\"The Action Space is: \", a_size)\n",
"print(\"Action Space Sample\", env.action_space.sample()) # Take a random action"
],
"metadata": {
"id": "Tc89eLTYYkK2"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"The action Space (from [Jeffrey Y Mo](https://hackmd.io/@jeffreymo/SJJrSJh5_#PyBullet)):\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/action_space.png\" alt=\"PyBullet Ant Obs space\"/>\n"
],
"metadata": {
"id": "3RfsHhzZS9Pw"
}
},
{
"cell_type": "markdown",
"source": [
"### Normalize observation and rewards"
],
"metadata": {
"id": "S5sXcg469ysB"
}
},
{
"cell_type": "markdown",
"source": [
"A good practice in reinforcement learning is to [normalize input features](https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html). \n",
"\n",
"For that purpose, there is a wrapper that will compute a running average and standard deviation of input features.\n",
"\n",
"We also normalize rewards with this same wrapper by adding `norm_reward = True`\n",
"\n",
"[You should check the documentation to fill this cell](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize)"
],
"metadata": {
"id": "1ZyX6qf3Zva9"
}
},
{
"cell_type": "code",
"source": [
"env = make_vec_env(env_id, n_envs=4)\n",
"\n",
"# Adding this wrapper to normalize the observation and the reward\n",
"env = # TODO: Add the wrapper"
],
"metadata": {
"id": "1RsDtHHAQ9Ie"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"#### Solution"
],
"metadata": {
"id": "tF42HvI7-gs5"
}
},
{
"cell_type": "code",
"source": [
"env = make_vec_env(env_id, n_envs=4)\n",
"\n",
"env = VecNormalize(env, norm_obs=True, norm_reward=False, clip_obs=10.)"
],
"metadata": {
"id": "2O67mqgC-hol"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Create the A2C Model 🤖\n",
"\n",
"In this case, because we have a vector of 28 values as input, we'll use an MLP (multi-layer perceptron) as policy.\n",
"\n",
"For more information about A2C implementation with StableBaselines3 check: https://stable-baselines3.readthedocs.io/en/master/modules/a2c.html#notes\n",
"\n",
"To find the best parameters I checked the [official trained agents by Stable-Baselines3 team](https://huggingface.co/sb3)."
],
"metadata": {
"id": "4JmEVU6z1ZA-"
}
},
{
"cell_type": "code",
"source": [
"model = # Create the A2C model and try to find the best parameters"
],
"metadata": {
"id": "vR3T4qFt164I"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"#### Solution"
],
"metadata": {
"id": "nWAuOOLh-oQf"
}
},
{
"cell_type": "code",
"source": [
"model = A2C(policy = \"MlpPolicy\",\n",
" env = env,\n",
" gae_lambda = 0.9,\n",
" gamma = 0.99,\n",
" learning_rate = 0.00096,\n",
" max_grad_norm = 0.5,\n",
" n_steps = 8,\n",
" vf_coef = 0.4,\n",
" ent_coef = 0.0,\n",
" policy_kwargs=dict(\n",
" log_std_init=-2, ortho_init=False),\n",
" normalize_advantage=False,\n",
" use_rms_prop= True,\n",
" use_sde= True,\n",
" verbose=1)"
],
"metadata": {
"id": "FKFLY54T-pU1"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Train the A2C agent 🏃\n",
"- Let's train our agent for 2,000,000 timesteps, don't forget to use GPU on Colab. It will take approximately ~25-40min"
],
"metadata": {
"id": "opyK3mpJ1-m9"
}
},
{
"cell_type": "code",
"source": [
"model.learn(2_000_000)"
],
"metadata": {
"id": "4TuGHZD7RF1G"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Save the model and VecNormalize statistics when saving the agent\n",
"model.save(\"a2c-AntBulletEnv-v0\")\n",
"env.save(\"vec_normalize.pkl\")"
],
"metadata": {
"id": "MfYtjj19cKFr"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Evaluate the agent 📈\n",
"- Now that's our agent is trained, we need to **check its performance**.\n",
"- Stable-Baselines3 provides a method to do that: `evaluate_policy`\n",
"- In my case, I got a mean reward of `2371.90 +/- 16.50`"
],
"metadata": {
"id": "01M9GCd32Ig-"
}
},
{
"cell_type": "code",
"source": [
"from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize\n",
"\n",
"# Load the saved statistics\n",
"eval_env = DummyVecEnv([lambda: gym.make(\"AntBulletEnv-v0\")])\n",
"eval_env = VecNormalize.load(\"vec_normalize.pkl\", eval_env)\n",
"\n",
"# do not update them at test time\n",
"eval_env.training = False\n",
"# reward normalization is not needed at test time\n",
"eval_env.norm_reward = False\n",
"\n",
"# Load the agent\n",
"model = A2C.load(\"a2c-AntBulletEnv-v0\")\n",
"\n",
"mean_reward, std_reward = evaluate_policy(model, env)\n",
"\n",
"print(f\"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}\")"
],
"metadata": {
"id": "liirTVoDkHq3"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Publish your trained model on the Hub 🔥\n",
"Now that we saw we got good results after the training, we can publish our trained model on the Hub with one line of code.\n",
"\n",
"📚 The libraries documentation 👉 https://github.com/huggingface/huggingface_sb3/tree/main#hugging-face--x-stable-baselines3-v20\n",
"\n",
"Here's an example of a Model Card (with a PyBullet environment):\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/modelcardpybullet.png\" alt=\"Model Card Pybullet\"/>"
],
"metadata": {
"id": "44L9LVQaavR8"
}
},
{
"cell_type": "markdown",
"source": [
"By using `package_to_hub`, as we already mentionned in the former units, **you evaluate, record a replay, generate a model card of your agent and push it to the hub**.\n",
"\n",
"This way:\n",
"- You can **showcase our work** 🔥\n",
"- You can **visualize your agent playing** 👀\n",
"- You can **share with the community an agent that others can use** 💾\n",
"- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard\n"
],
"metadata": {
"id": "MkMk99m8bgaQ"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "JquRrWytA6eo"
},
"source": [
"To be able to share your model with the community there are three more steps to follow:\n",
"\n",
"1⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join\n",
"\n",
"2⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.\n",
"- Create a new token (https://huggingface.co/settings/tokens) **with write role**\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg\" alt=\"Create HF Token\">\n",
"\n",
"- Copy the token \n",
"- Run the cell below and paste the token"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "GZiFBBlzxzxY"
},
"outputs": [],
"source": [
"notebook_login()\n",
"!git config --global credential.helper store"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_tsf2uv0g_4p"
},
"source": [
"If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FGNh9VsZok0i"
},
"source": [
"3⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥 using `package_to_hub()` function"
]
},
{
"cell_type": "code",
"source": [
"package_to_hub(\n",
" model=model,\n",
" model_name=f\"a2c-{env_id}\",\n",
" model_architecture=\"A2C\",\n",
" env_id=env_id,\n",
" eval_env=eval_env,\n",
" repo_id=f\"ThomasSimonini/a2c-{env_id}\", # Change the username\n",
" commit_message=\"Initial commit\",\n",
")"
],
"metadata": {
"id": "ueuzWVCUTkfS"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Take a coffee break ☕\n",
"- You already trained your first robot that learned to move congratutlations 🥳!\n",
"- It's **time to take a break**. Don't hesitate to **save this notebook** `File > Save a copy to Drive` to work on this second part later.\n"
],
"metadata": {
"id": "Qk9ykOk9D6Qh"
}
},
{
"cell_type": "markdown",
"source": [
"## Environment 2: PandaReachDense-v2 🦾\n",
"\n",
"The agent we're going to train is a robotic arm that needs to do controls (moving the arm and using the end-effector).\n",
"\n",
"In robotics, the *end-effector* is the device at the end of a robotic arm designed to interact with the environment.\n",
"\n",
"In `PandaReach`, the robot must place its end-effector at a target position (green ball).\n",
"\n",
"We're going to use the dense version of this environment. It means we'll get a *dense reward function* that **will provide a reward at each timestep** (the closer the agent is to completing the task, the higher the reward). Contrary to a *sparse reward function* where the environment **return a reward if and only if the task is completed**.\n",
"\n",
"Also, we're going to use the *End-effector displacement control*, it means the **action corresponds to the displacement of the end-effector**. We don't control the individual motion of each joint (joint control).\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/robotics.jpg\" alt=\"Robotics\"/>\n",
"\n",
"\n",
"This way **the training will be easier**.\n",
"\n"
],
"metadata": {
"id": "5VWfwAA7EJg7"
}
},
{
"cell_type": "markdown",
"source": [
"\n",
"\n",
"In `PandaReachDense-v2` the robotic arm must place its end-effector at a target position (green ball).\n",
"\n"
],
"metadata": {
"id": "oZ7FyDEi7G3T"
}
},
{
"cell_type": "code",
"source": [
"import gym\n",
"\n",
"env_id = \"PandaPushDense-v2\"\n",
"\n",
"# Create the env\n",
"env = gym.make(env_id)\n",
"\n",
"# Get the state space and action space\n",
"s_size = env.observation_space.shape\n",
"a_size = env.action_space"
],
"metadata": {
"id": "zXzAu3HYF1WD"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"print(\"_____OBSERVATION SPACE_____ \\n\")\n",
"print(\"The State Space is: \", s_size)\n",
"print(\"Sample observation\", env.observation_space.sample()) # Get a random observation"
],
"metadata": {
"id": "E-U9dexcF-FB"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"The observation space **is a dictionary with 3 different elements**:\n",
"- `achieved_goal`: (x,y,z) position of the goal.\n",
"- `desired_goal`: (x,y,z) distance between the goal position and the current object position.\n",
"- `observation`: position (x,y,z) and velocity of the end-effector (vx, vy, vz).\n",
"\n",
"Given it's a dictionary as observation, **we will need to use a MultiInputPolicy policy instead of MlpPolicy**."
],
"metadata": {
"id": "g_JClfElGFnF"
}
},
{
"cell_type": "code",
"source": [
"print(\"\\n _____ACTION SPACE_____ \\n\")\n",
"print(\"The Action Space is: \", a_size)\n",
"print(\"Action Space Sample\", env.action_space.sample()) # Take a random action"
],
"metadata": {
"id": "ib1Kxy4AF-FC"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"The action space is a vector with 3 values:\n",
"- Control x, y, z movement"
],
"metadata": {
"id": "5MHTHEHZS4yp"
}
},
{
"cell_type": "markdown",
"source": [
"Now it's your turn:\n",
"\n",
"1. Define the environment called \"PandaReachDense-v2\"\n",
"2. Make a vectorized environment\n",
"3. Add a wrapper to normalize the observations and rewards. [Check the documentation](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize)\n",
"4. Create the A2C Model (don't forget verbose=1 to print the training logs).\n",
"5. Train it for 1M Timesteps\n",
"6. Save the model and VecNormalize statistics when saving the agent\n",
"7. Evaluate your agent\n",
"8. Publish your trained model on the Hub 🔥 with `package_to_hub`"
],
"metadata": {
"id": "nIhPoc5t9HjG"
}
},
{
"cell_type": "markdown",
"source": [
"### Solution (fill the todo)"
],
"metadata": {
"id": "sKGbFXZq9ikN"
}
},
{
"cell_type": "code",
"source": [
"# 1 - 2\n",
"env_id = \"PandaReachDense-v2\"\n",
"env = make_vec_env(env_id, n_envs=4)\n",
"\n",
"# 3\n",
"env = VecNormalize(env, norm_obs=True, norm_reward=False, clip_obs=10.)\n",
"\n",
"# 4\n",
"model = A2C(policy = \"MultiInputPolicy\",\n",
" env = env,\n",
" verbose=1)\n",
"# 5\n",
"model.learn(1_000_000)"
],
"metadata": {
"id": "J-cC-Feg9iMm"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# 6\n",
"model_name = \"a2c-PandaReachDense-v2\"; \n",
"model.save(model_name)\n",
"env.save(\"vec_normalize.pkl\")\n",
"\n",
"# 7\n",
"from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize\n",
"\n",
"# Load the saved statistics\n",
"eval_env = DummyVecEnv([lambda: gym.make(\"PandaReachDense-v2\")])\n",
"eval_env = VecNormalize.load(\"vec_normalize.pkl\", eval_env)\n",
"\n",
"# do not update them at test time\n",
"eval_env.training = False\n",
"# reward normalization is not needed at test time\n",
"eval_env.norm_reward = False\n",
"\n",
"# Load the agent\n",
"model = A2C.load(model_name)\n",
"\n",
"mean_reward, std_reward = evaluate_policy(model, env)\n",
"\n",
"print(f\"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}\")\n",
"\n",
"# 8\n",
"package_to_hub(\n",
" model=model,\n",
" model_name=f\"a2c-{env_id}\",\n",
" model_architecture=\"A2C\",\n",
" env_id=env_id,\n",
" eval_env=eval_env,\n",
" repo_id=f\"ThomasSimonini/a2c-{env_id}\", # TODO: Change the username\n",
" commit_message=\"Initial commit\",\n",
")"
],
"metadata": {
"id": "-UnlKLmpg80p"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Some additional challenges 🏆\n",
"The best way to learn **is to try things by your own**! Why not trying `HalfCheetahBulletEnv-v0` for PyBullet and `PandaPickAndPlace-v1` for Panda-Gym?\n",
"\n",
"If you want to try more advanced tasks for panda-gym, you need to check what was done using **TQC or SAC** (a more sample-efficient algorithm suited for robotics tasks). In real robotics, you'll use a more sample-efficient algorithm for a simple reason: contrary to a simulation **if you move your robotic arm too much, you have a risk of breaking it**.\n",
"\n",
"PandaPickAndPlace-v1: https://huggingface.co/sb3/tqc-PandaPickAndPlace-v1\n",
"\n",
"And don't hesitate to check panda-gym documentation here: https://panda-gym.readthedocs.io/en/latest/usage/train_with_sb3.html\n",
"\n",
"Here are some ideas to achieve so:\n",
"* Train more steps\n",
"* Try different hyperparameters by looking at what your classmates have done 👉 https://huggingface.co/models?other=https://huggingface.co/models?other=AntBulletEnv-v0\n",
"* **Push your new trained model** on the Hub 🔥\n"
],
"metadata": {
"id": "G3xy3Nf3c2O1"
}
},
{
"cell_type": "markdown",
"source": [
"See you on Unit 7! 🔥\n",
"## Keep learning, stay awesome 🤗"
],
"metadata": {
"id": "usatLaZ8dM4P"
}
}
]
}

View File

@@ -148,6 +148,20 @@
title: Bonus. Learn to create your own environments with Unity and MLAgents
- local: unit5/conclusion
title: Conclusion
- title: Unit 6. Actor Critic methods with Robotics environments
sections:
- local: unit6/introduction
title: Introduction
- local: unit6/variance-problem
title: The Problem of Variance in Reinforce
- local: unit6/advantage-actor-critic
title: Advantage Actor Critic (A2C)
- local: unit6/hands-on
title: Advantage Actor Critic (A2C) using Robotics Simulations with PyBullet and Panda-Gym 🤖
- local: unit6/conclusion
title: Conclusion
- local: unit6/additional-readings
title: Additional Readings
- title: What's next? New Units Publishing Schedule
sections:
- local: communication/publishing-schedule

View File

@@ -0,0 +1,17 @@
# Additional Readings [[additional-readings]]
## Bias-variance tradeoff in Reinforcement Learning
If you want to dive deeper into the question of variance and bias tradeoff in Deep Reinforcement Learning, you can check these two articles:
- [Making Sense of the Bias / Variance Trade-off in (Deep) Reinforcement Learning](https://blog.mlreview.com/making-sense-of-the-bias-variance-trade-off-in-deep-reinforcement-learning-79cf1e83d565)
- [Bias-variance Tradeoff in Reinforcement Learning](https://www.endtoend.ai/blog/bias-variance-tradeoff-in-reinforcement-learning/)
## Advantage Functions
- [Advantage Functions, SpinningUp RL](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html?highlight=advantage%20functio#advantage-functions)
## Actor Critic
- [Foundations of Deep RL Series, L3 Policy Gradients and Advantage Estimation by Pieter Abbeel](https://www.youtube.com/watch?v=AKbX1Zvo7r8)
- [A2C Paper: Asynchronous Methods for Deep Reinforcement Learning](https://arxiv.org/abs/1602.01783v2)

View File

@@ -0,0 +1,70 @@
# Advantage Actor-Critic (A2C) [[advantage-actor-critic]]
## Reducing variance with Actor-Critic methods
The solution to reducing the variance of the Reinforce algorithm and training our agent faster and better is to use a combination of Policy-Based and Value-Based methods: *the Actor-Critic method*.
To understand the Actor-Critic, imagine you play a video game. You can play with a friend that will provide you with some feedback. You're the Actor and your friend is the Critic.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/ac.jpg" alt="Actor Critic"/>
You don't know how to play at the beginning, **so you try some actions randomly**. The Critic observes your action and **provides feedback**.
Learning from this feedback, **you'll update your policy and be better at playing that game.**
On the other hand, your friend (Critic) will also update their way to provide feedback so it can be better next time.
This is the idea behind Actor-Critic. We learn two function approximations:
- *A policy* that **controls how our agent acts**: \\( \pi_{\theta}(s,a) \\)
- *A value function* to assist the policy update by measuring how good the action taken is: \\( \hat{q}_{w}(s,a) \\)
## The Actor-Critic Process
Now that we have seen the Actor Critic's big picture, let's dive deeper to understand how Actor and Critic improve together during the training.
As we saw, with Actor-Critic methods, there are two function approximations (two neural networks):
- *Actor*, a **policy function** parameterized by theta: \\( \pi_{\theta}(s,a) \\)
- *Critic*, a **value function** parameterized by w: \\( \hat{q}_{w}(s,a) \\)
Let's see the training process to understand how Actor and Critic are optimized:
- At each timestep, t, we get the current state \\( S_t\\) from the environment and **pass it as input through our Actor and Critic**.
- Our Policy takes the state and **outputs an action** \\( A_t \\).
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/step1.jpg" alt="Step 1 Actor Critic"/>
- The Critic takes that action also as input and, using \\( S_t\\) and \\( A_t \\), **computes the value of taking that action at that state: the Q-value**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/step2.jpg" alt="Step 2 Actor Critic"/>
- The action \\( A_t\\) performed in the environment outputs a new state \\( S_{t+1}\\) and a reward \\( R_{t+1} \\) .
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/step3.jpg" alt="Step 3 Actor Critic"/>
- The Actor updates its policy parameters using the Q value.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/step4.jpg" alt="Step 4 Actor Critic"/>
- Thanks to its updated parameters, the Actor produces the next action to take at \\( A_{t+1} \\) given the new state \\( S_{t+1} \\).
- The Critic then updates its value parameters.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/step5.jpg" alt="Step 5 Actor Critic"/>
## Adding Advantage in Actor-Critic (A2C)
We can stabilize learning further by **using the Advantage function as Critic instead of the Action value function**.
The idea is that the Advantage function calculates the relative advantage of an action compared to the others possible at a state: **how taking that action at a state is better compared to the average value of the state**. It's subtracting the mean value of the state from the state action pair:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/advantage1.jpg" alt="Advantage Function"/>
In other words, this function calculates **the extra reward we get if we take this action at that state compared to the mean reward we get at that state**.
The extra reward is what's beyond the expected value of that state.
- If A(s,a) > 0: our gradient is **pushed in that direction**.
- If A(s,a) < 0 (our action does worse than the average value of that state), **our gradient is pushed in the opposite direction**.
The problem with implementing this advantage function is that it requires two value functions — \\( Q(s,a)\\) and \\( V(s)\\). Fortunately, **we can use the TD error as a good estimator of the advantage function.**
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/advantage2.jpg" alt="Advantage Function"/>

View File

@@ -0,0 +1,11 @@
# Conclusion [[conclusion]]
Congrats on finishing this unit and the tutorial. You've just trained your first virtual robots 🥳.
**Take time to grasp the material before continuing**. You can also look at the additional reading materials we provided in the *additional reading* section.
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
See you in next unit,
### Keep learning, stay awesome 🤗,

464
units/en/unit6/hands-on.mdx Normal file
View File

@@ -0,0 +1,464 @@
# Advantage Actor Critic (A2C) using Robotics Simulations with PyBullet and Panda-Gym 🤖 [[hands-on]]
<CourseFloatingBanner classNames="absolute z-10 right-0 top-0"
notebooks={[
{label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/unit6/unit6.ipynb"}
]}
askForHelpUrl="http://hf.co/join/discord" />
Now that you've studied the theory behind Advantage Actor Critic (A2C), **you're ready to train your A2C agent** using Stable-Baselines3 in robotic environments. And train two robots:
- A spider 🕷️ to learn to move.
- A robotic arm 🦾 to move in the correct position.
We're going to use two Robotics environments:
- [PyBullet](https://github.com/bulletphysics/bullet3)
- [panda-gym](https://github.com/qgallouedec/panda-gym)
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/environments.gif" alt="Environments"/>
To validate this hands-on for the certification process, you need to push your two trained models to the Hub and get the following results:
- `AntBulletEnv-v0` get a result of >= 650.
- `PandaReachDense-v2` get a result of >= -3.5.
To find your result, [go to the leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**
For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
**To start the hands-on click on Open In Colab button** 👇 :
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit6/unit6.ipynb)
# Unit 6: Advantage Actor Critic (A2C) using Robotics Simulations with PyBullet and Panda-Gym 🤖
### 🎮 Environments:
- [PyBullet](https://github.com/bulletphysics/bullet3)
- [Panda-Gym](https://github.com/qgallouedec/panda-gym)
### 📚 RL-Library:
- [Stable-Baselines3](https://stable-baselines3.readthedocs.io/)
We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues).
## Objectives of this notebook 🏆
At the end of the notebook, you will:
- Be able to use **PyBullet** and **Panda-Gym**, the environment libraries.
- Be able to **train robots using A2C**.
- Understand why **we need to normalize the input**.
- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.
## Prerequisites 🏗️
Before diving into the notebook, you need to:
🔲 📚 Study [Actor-Critic methods by reading Unit 6](https://huggingface.co/deep-rl-course/unit6/introduction) 🤗
# Let's train our first robots 🤖
## Set the GPU 💪
- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg" alt="GPU Step 1">
- `Hardware Accelerator > GPU`
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg" alt="GPU Step 2">
## Create a virtual display 🔽
During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).
Hence the following cell will install the librairies and create and run a virtual screen 🖥
```python
%%capture
!apt install python-opengl
!apt install ffmpeg
!apt install xvfb
!pip3 install pyvirtualdisplay
```
```python
# Virtual display
from pyvirtualdisplay import Display
virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()
```
### Install dependencies 🔽
The first step is to install the dependencies, well install multiple ones:
- `pybullet`: Contains the walking robots environments.
- `panda-gym`: Contains the robotics arm environments.
- `stable-baselines3[extra]`: The SB3 deep reinforcement learning library.
- `huggingface_sb3`: Additional code for Stable-baselines3 to load and upload models from the Hugging Face 🤗 Hub.
- `huggingface_hub`: Library allowing anyone to work with the Hub repositories.
```bash
!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit6/requirements-unit6.txt
```
## Import the packages 📦
```python
import pybullet_envs
import panda_gym
import gym
import os
from huggingface_sb3 import load_from_hub, package_to_hub
from stable_baselines3 import A2C
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
from stable_baselines3.common.env_util import make_vec_env
from huggingface_hub import notebook_login
```
## Environment 1: AntBulletEnv-v0 🕸
### Create the AntBulletEnv-v0
#### The environment 🎮
In this environment, the agent needs to use correctly its different joints to walk correctly.
You can find a detailled explanation of this environment here: https://hackmd.io/@jeffreymo/SJJrSJh5_#PyBullet
```python
env_id = "AntBulletEnv-v0"
# Create the env
env = gym.make(env_id)
# Get the state space and action space
s_size = env.observation_space.shape[0]
a_size = env.action_space
```
```python
print("_____OBSERVATION SPACE_____ \n")
print("The State Space is: ", s_size)
print("Sample observation", env.observation_space.sample()) # Get a random observation
```
The observation Space (from [Jeffrey Y Mo](https://hackmd.io/@jeffreymo/SJJrSJh5_#PyBullet)):
The difference is that our observation space is 28 not 29.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/obs_space.png" alt="PyBullet Ant Obs space"/>
```python
print("\n _____ACTION SPACE_____ \n")
print("The Action Space is: ", a_size)
print("Action Space Sample", env.action_space.sample()) # Take a random action
```
The action Space (from [Jeffrey Y Mo](https://hackmd.io/@jeffreymo/SJJrSJh5_#PyBullet)):
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/action_space.png" alt="PyBullet Ant Obs space"/>
### Normalize observation and rewards
A good practice in reinforcement learning is to [normalize input features](https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html).
For that purpose, there is a wrapper that will compute a running average and standard deviation of input features.
We also normalize rewards with this same wrapper by adding `norm_reward = True`
[You should check the documentation to fill this cell](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize)
```python
env = make_vec_env(env_id, n_envs=4)
# Adding this wrapper to normalize the observation and the reward
env = # TODO: Add the wrapper
```
#### Solution
```python
env = make_vec_env(env_id, n_envs=4)
env = VecNormalize(env, norm_obs=True, norm_reward=False, clip_obs=10.0)
```
### Create the A2C Model 🤖
In this case, because we have a vector of 28 values as input, we'll use an MLP (multi-layer perceptron) as policy.
For more information about A2C implementation with StableBaselines3 check: https://stable-baselines3.readthedocs.io/en/master/modules/a2c.html#notes
To find the best parameters I checked the [official trained agents by Stable-Baselines3 team](https://huggingface.co/sb3).
```python
model = # Create the A2C model and try to find the best parameters
```
#### Solution
```python
model = A2C(
policy="MlpPolicy",
env=env,
gae_lambda=0.9,
gamma=0.99,
learning_rate=0.00096,
max_grad_norm=0.5,
n_steps=8,
vf_coef=0.4,
ent_coef=0.0,
policy_kwargs=dict(log_std_init=-2, ortho_init=False),
normalize_advantage=False,
use_rms_prop=True,
use_sde=True,
verbose=1,
)
```
### Train the A2C agent 🏃
- Let's train our agent for 2,000,000 timesteps, don't forget to use GPU on Colab. It will take approximately ~25-40min
```python
model.learn(2_000_000)
```
```python
# Save the model and VecNormalize statistics when saving the agent
model.save("a2c-AntBulletEnv-v0")
env.save("vec_normalize.pkl")
```
### Evaluate the agent 📈
- Now that's our agent is trained, we need to **check its performance**.
- Stable-Baselines3 provides a method to do that: `evaluate_policy`
- In my case, I got a mean reward of `2371.90 +/- 16.50`
```python
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
# Load the saved statistics
eval_env = DummyVecEnv([lambda: gym.make("AntBulletEnv-v0")])
eval_env = VecNormalize.load("vec_normalize.pkl", eval_env)
# do not update them at test time
eval_env.training = False
# reward normalization is not needed at test time
eval_env.norm_reward = False
# Load the agent
model = A2C.load("a2c-AntBulletEnv-v0")
mean_reward, std_reward = evaluate_policy(model, env)
print(f"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}")
```
### Publish your trained model on the Hub 🔥
Now that we saw we got good results after the training, we can publish our trained model on the Hub with one line of code.
📚 The libraries documentation 👉 https://github.com/huggingface/huggingface_sb3/tree/main#hugging-face--x-stable-baselines3-v20
Here's an example of a Model Card (with a PyBullet environment):
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/modelcardpybullet.png" alt="Model Card Pybullet"/>
By using `package_to_hub`, as we already mentionned in the former units, **you evaluate, record a replay, generate a model card of your agent and push it to the hub**.
This way:
- You can **showcase our work** 🔥
- You can **visualize your agent playing** 👀
- You can **share with the community an agent that others can use** 💾
- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
To be able to share your model with the community there are three more steps to follow:
1⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join
2⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.
- Create a new token (https://huggingface.co/settings/tokens) **with write role**
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
- Copy the token
- Run the cell below and paste the token
```python
notebook_login()
!git config --global credential.helper store
```
If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
3⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥 using `package_to_hub()` function
```python
package_to_hub(
model=model,
model_name=f"a2c-{env_id}",
model_architecture="A2C",
env_id=env_id,
eval_env=eval_env,
repo_id=f"ThomasSimonini/a2c-{env_id}", # Change the username
commit_message="Initial commit",
)
```
## Take a coffee break ☕
- You already trained your first robot that learned to move congratutlations 🥳!
- It's **time to take a break**. Don't hesitate to **save this notebook** `File > Save a copy to Drive` to work on this second part later.
## Environment 2: PandaReachDense-v2 🦾
The agent we're going to train is a robotic arm that needs to do controls (moving the arm and using the end-effector).
In robotics, the *end-effector* is the device at the end of a robotic arm designed to interact with the environment.
In `PandaReach`, the robot must place its end-effector at a target position (green ball).
We're going to use the dense version of this environment. It means we'll get a *dense reward function* that **will provide a reward at each timestep** (the closer the agent is to completing the task, the higher the reward). Contrary to a *sparse reward function* where the environment **return a reward if and only if the task is completed**.
Also, we're going to use the *End-effector displacement control*, it means the **action corresponds to the displacement of the end-effector**. We don't control the individual motion of each joint (joint control).
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/robotics.jpg" alt="Robotics"/>
This way **the training will be easier**.
In `PandaReachDense-v2`, the robotic arm must place its end-effector at a target position (green ball).
```python
import gym
env_id = "PandaPushDense-v2"
# Create the env
env = gym.make(env_id)
# Get the state space and action space
s_size = env.observation_space.shape
a_size = env.action_space
```
```python
print("_____OBSERVATION SPACE_____ \n")
print("The State Space is: ", s_size)
print("Sample observation", env.observation_space.sample()) # Get a random observation
```
The observation space **is a dictionary with 3 different elements**:
- `achieved_goal`: (x,y,z) position of the goal.
- `desired_goal`: (x,y,z) distance between the goal position and the current object position.
- `observation`: position (x,y,z) and velocity of the end-effector (vx, vy, vz).
Given it's a dictionary as observation, **we will need to use a MultiInputPolicy policy instead of MlpPolicy**.
```python
print("\n _____ACTION SPACE_____ \n")
print("The Action Space is: ", a_size)
print("Action Space Sample", env.action_space.sample()) # Take a random action
```
The action space is a vector with 3 values:
- Control x, y, z movement
Now it's your turn:
1. Define the environment called "PandaReachDense-v2"
2. Make a vectorized environment
3. Add a wrapper to normalize the observations and rewards. [Check the documentation](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize)
4. Create the A2C Model (don't forget verbose=1 to print the training logs).
5. Train it for 1M Timesteps
6. Save the model and VecNormalize statistics when saving the agent
7. Evaluate your agent
8. Publish your trained model on the Hub 🔥 with `package_to_hub`
### Solution (fill the todo)
```python
# 1 - 2
env_id = "PandaReachDense-v2"
env = make_vec_env(env_id, n_envs=4)
# 3
env = VecNormalize(env, norm_obs=True, norm_reward=False, clip_obs=10.0)
# 4
model = A2C(policy="MultiInputPolicy", env=env, verbose=1)
# 5
model.learn(1_000_000)
```
```python
# 6
model_name = "a2c-PandaReachDense-v2"
model.save(model_name)
env.save("vec_normalize.pkl")
# 7
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
# Load the saved statistics
eval_env = DummyVecEnv([lambda: gym.make("PandaReachDense-v2")])
eval_env = VecNormalize.load("vec_normalize.pkl", eval_env)
# do not update them at test time
eval_env.training = False
# reward normalization is not needed at test time
eval_env.norm_reward = False
# Load the agent
model = A2C.load(model_name)
mean_reward, std_reward = evaluate_policy(model, env)
print(f"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}")
# 8
package_to_hub(
model=model,
model_name=f"a2c-{env_id}",
model_architecture="A2C",
env_id=env_id,
eval_env=eval_env,
repo_id=f"ThomasSimonini/a2c-{env_id}", # TODO: Change the username
commit_message="Initial commit",
)
```
## Some additional challenges 🏆
The best way to learn **is to try things by your own**! Why not trying `HalfCheetahBulletEnv-v0` for PyBullet and `PandaPickAndPlace-v1` for Panda-Gym?
If you want to try more advanced tasks for panda-gym, you need to check what was done using **TQC or SAC** (a more sample-efficient algorithm suited for robotics tasks). In real robotics, you'll use a more sample-efficient algorithm for a simple reason: contrary to a simulation **if you move your robotic arm too much, you have a risk of breaking it**.
PandaPickAndPlace-v1: https://huggingface.co/sb3/tqc-PandaPickAndPlace-v1
And don't hesitate to check panda-gym documentation here: https://panda-gym.readthedocs.io/en/latest/usage/train_with_sb3.html
Here are some ideas to achieve so:
* Train more steps
* Try different hyperparameters by looking at what your classmates have done 👉 https://huggingface.co/models?other=https://huggingface.co/models?other=AntBulletEnv-v0
* **Push your new trained model** on the Hub 🔥
See you on Unit 7! 🔥
## Keep learning, stay awesome 🤗

View File

@@ -0,0 +1,25 @@
# Introduction [[introduction]]
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/thumbnail.png" alt="Thumbnail"/>
In unit 4, we learned about our first Policy-Based algorithm called **Reinforce**.
In Policy-Based methods, **we aim to optimize the policy directly without using a value function**. More precisely, Reinforce is part of a subclass of *Policy-Based Methods* called *Policy-Gradient methods*. This subclass optimizes the policy directly by **estimating the weights of the optimal policy using Gradient Ascent**.
We saw that Reinforce worked well. However, because we use Monte-Carlo sampling to estimate return (we use an entire episode to calculate the return), **we have significant variance in policy gradient estimation**.
Remember that the policy gradient estimation is **the direction of the steepest increase in return**. In other words, how to update our policy weights so that actions that lead to good returns have a higher probability of being taken. The Monte Carlo variance, which we will further study in this unit, **leads to slower training since we need a lot of samples to mitigate it**.
So, today we'll study **Actor-Critic methods**, a hybrid architecture combining value-based and Policy-Based methods that help to stabilize the training by reducing the variance:
- *An Actor* that controls **how our agent behaves** (Policy-Based method)
- *A Critic* that measures **how good the taken action is** (Value-Based method)
We'll study one of these hybrid methods, Advantage Actor Critic (A2C), **and train our agent using Stable-Baselines3 in robotic environments**. We'll train two robots:
- A spider 🕷️ to learn to move.
- A robotic arm 🦾 to move in the correct position.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/environments.gif" alt="Environments"/>
Sounds exciting? Let's get started!

View File

@@ -0,0 +1,30 @@
# The Problem of Variance in Reinforce [[the-problem-of-variance-in-reinforce]]
In Reinforce, we want to **increase the probability of actions in a trajectory proportional to how high the return is**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/pg.jpg" alt="Reinforce"/>
- If the **return is high**, we will **push up** the probabilities of the (state, action) combinations.
- Else, if the **return is low**, it will **push down** the probabilities of the (state, action) combinations.
This return \\(R(\tau)\\) is calculated using a *Monte-Carlo sampling*. We collect a trajectory and calculate the discounted return, **and use this score to increase or decrease the probability of every action taken in that trajectory**. If the return is good, all actions will be “reinforced” by increasing their likelihood of being taken.
\\(R(\tau) = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ...\\)
The advantage of this method is that **its unbiased. Since were not estimating the return**, we use only the true return we obtain.
Given the stochasticity of the environment (random events during an episode) and stochasticity of the policy, **trajectories can lead to different returns, which can lead to high variance**. Consequently, the same starting state can lead to very different returns.
Because of this, **the return starting at the same state can vary significantly across episodes**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/variance.jpg" alt="variance"/>
The solution is to mitigate the variance by **using a large number of trajectories, hoping that the variance introduced in any one trajectory will be reduced in aggregate and provide a "true" estimation of the return.**
However, increasing the batch size significantly **reduces sample efficiency**. So we need to find additional mechanisms to reduce the variance.
---
If you want to dive deeper into the question of variance and bias tradeoff in Deep Reinforcement Learning, you can check these two articles:
- [Making Sense of the Bias / Variance Trade-off in (Deep) Reinforcement Learning](https://blog.mlreview.com/making-sense-of-the-bias-variance-trade-off-in-deep-reinforcement-learning-79cf1e83d565)
- [Bias-variance Tradeoff in Reinforcement Learning](https://www.endtoend.ai/blog/bias-variance-tradeoff-in-reinforcement-learning/)
---