mirror of
https://github.com/huggingface/deep-rl-class.git
synced 2026-03-24 22:01:45 +08:00
801 lines
26 KiB
Plaintext
801 lines
26 KiB
Plaintext
{
|
||
"nbformat": 4,
|
||
"nbformat_minor": 0,
|
||
"metadata": {
|
||
"colab": {
|
||
"provenance": [],
|
||
"private_outputs": true,
|
||
"collapsed_sections": [
|
||
"tF42HvI7-gs5"
|
||
],
|
||
"include_colab_link": true
|
||
},
|
||
"kernelspec": {
|
||
"name": "python3",
|
||
"display_name": "Python 3"
|
||
},
|
||
"language_info": {
|
||
"name": "python"
|
||
},
|
||
"accelerator": "GPU",
|
||
"gpuClass": "standard"
|
||
},
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "view-in-github",
|
||
"colab_type": "text"
|
||
},
|
||
"source": [
|
||
"<a href=\"https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/unit6/unit6.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"source": [
|
||
"# Unit 6: Advantage Actor Critic (A2C) using Robotics Simulations with Panda-Gym 🤖\n",
|
||
"\n",
|
||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/thumbnail.png\" alt=\"Thumbnail\"/>\n",
|
||
"\n",
|
||
"In this notebook, you'll learn to use A2C with [Panda-Gym](https://github.com/qgallouedec/panda-gym). You're going **to train a robotic arm** (Franka Emika Panda robot) to perform a task:\n",
|
||
"\n",
|
||
"- `Reach`: the robot must place its end-effector at a target position.\n",
|
||
"\n",
|
||
"After that, you'll be able **to train in other robotics tasks**.\n"
|
||
],
|
||
"metadata": {
|
||
"id": "-PTReiOw-RAN"
|
||
}
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"source": [
|
||
"### 🎮 Environments:\n",
|
||
"\n",
|
||
"- [Panda-Gym](https://github.com/qgallouedec/panda-gym)\n",
|
||
"\n",
|
||
"###📚 RL-Library:\n",
|
||
"\n",
|
||
"- [Stable-Baselines3](https://stable-baselines3.readthedocs.io/)"
|
||
],
|
||
"metadata": {
|
||
"id": "QInFitfWno1Q"
|
||
}
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"source": [
|
||
"We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues)."
|
||
],
|
||
"metadata": {
|
||
"id": "2CcdX4g3oFlp"
|
||
}
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"source": [
|
||
"## Objectives of this notebook 🏆\n",
|
||
"\n",
|
||
"At the end of the notebook, you will:\n",
|
||
"\n",
|
||
"- Be able to use **Panda-Gym**, the environment library.\n",
|
||
"- Be able to **train robots using A2C**.\n",
|
||
"- Understand why **we need to normalize the input**.\n",
|
||
"- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.\n",
|
||
"\n",
|
||
"\n"
|
||
],
|
||
"metadata": {
|
||
"id": "MoubJX20oKaQ"
|
||
}
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"source": [
|
||
"## This notebook is from the Deep Reinforcement Learning Course\n",
|
||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/deep-rl-course-illustration.jpg\" alt=\"Deep RL Course illustration\"/>\n",
|
||
"\n",
|
||
"In this free course, you will:\n",
|
||
"\n",
|
||
"- 📖 Study Deep Reinforcement Learning in **theory and practice**.\n",
|
||
"- 🧑💻 Learn to **use famous Deep RL libraries** such as Stable Baselines3, RL Baselines3 Zoo, CleanRL and Sample Factory 2.0.\n",
|
||
"- 🤖 Train **agents in unique environments**\n",
|
||
"\n",
|
||
"And more check 📚 the syllabus 👉 https://simoninithomas.github.io/deep-rl-course\n",
|
||
"\n",
|
||
"Don’t forget to **<a href=\"http://eepurl.com/ic5ZUD\">sign up to the course</a>** (we are collecting your email to be able to **send you the links when each Unit is published and give you information about the challenges and updates).**\n",
|
||
"\n",
|
||
"\n",
|
||
"The best way to keep in touch is to join our discord server to exchange with the community and with us 👉🏻 https://discord.gg/ydHrjt3WP5"
|
||
],
|
||
"metadata": {
|
||
"id": "DoUNkTExoUED"
|
||
}
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"source": [
|
||
"## Prerequisites 🏗️\n",
|
||
"Before diving into the notebook, you need to:\n",
|
||
"\n",
|
||
"🔲 📚 Study [Actor-Critic methods by reading Unit 6](https://huggingface.co/deep-rl-course/unit6/introduction) 🤗 "
|
||
],
|
||
"metadata": {
|
||
"id": "BTuQAUAPoa5E"
|
||
}
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"source": [
|
||
"# Let's train our first robots 🤖"
|
||
],
|
||
"metadata": {
|
||
"id": "iajHvVDWoo01"
|
||
}
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"source": [
|
||
"To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push your trained model to the Hub and get the following results:\n",
|
||
"\n",
|
||
"- `PandaReachDense-v3` get a result of >= -3.5.\n",
|
||
"\n",
|
||
"To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**\n",
|
||
"\n",
|
||
"For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process"
|
||
],
|
||
"metadata": {
|
||
"id": "zbOENTE2os_D"
|
||
}
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"source": [
|
||
"## Set the GPU 💪\n",
|
||
"- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`\n",
|
||
"\n",
|
||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg\" alt=\"GPU Step 1\">"
|
||
],
|
||
"metadata": {
|
||
"id": "PU4FVzaoM6fC"
|
||
}
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"source": [
|
||
"- `Hardware Accelerator > GPU`\n",
|
||
"\n",
|
||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg\" alt=\"GPU Step 2\">"
|
||
],
|
||
"metadata": {
|
||
"id": "KV0NyFdQM9ZG"
|
||
}
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"source": [
|
||
"## Create a virtual display 🔽\n",
|
||
"\n",
|
||
"During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).\n",
|
||
"\n",
|
||
"Hence the following cell will install the librairies and create and run a virtual screen 🖥"
|
||
],
|
||
"metadata": {
|
||
"id": "bTpYcVZVMzUI"
|
||
}
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "jV6wjQ7Be7p5"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"%%capture\n",
|
||
"!apt install python-opengl\n",
|
||
"!apt install ffmpeg\n",
|
||
"!apt install xvfb\n",
|
||
"!pip3 install pyvirtualdisplay"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"source": [
|
||
"# Virtual display\n",
|
||
"from pyvirtualdisplay import Display\n",
|
||
"\n",
|
||
"virtual_display = Display(visible=0, size=(1400, 900))\n",
|
||
"virtual_display.start()"
|
||
],
|
||
"metadata": {
|
||
"id": "ww5PQH1gNLI4"
|
||
},
|
||
"execution_count": null,
|
||
"outputs": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"source": [
|
||
"### Install dependencies 🔽\n",
|
||
"\n",
|
||
"The first step is to install the dependencies, we’ll install multiple ones:\n",
|
||
"- `gymnasium`\n",
|
||
"- `panda-gym`: Contains the robotics arm environments.\n",
|
||
"- `stable-baselines3`: The SB3 deep reinforcement learning library.\n",
|
||
"- `huggingface_sb3`: Additional code for Stable-baselines3 to load and upload models from the Hugging Face 🤗 Hub.\n",
|
||
"- `huggingface_hub`: Library allowing anyone to work with the Hub repositories.\n",
|
||
"\n",
|
||
"⏲ The installation can **take 10 minutes**."
|
||
],
|
||
"metadata": {
|
||
"id": "e1obkbdJ_KnG"
|
||
}
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"source": [
|
||
"!pip install stable-baselines3[extra]\n",
|
||
"!pip install gymnasium"
|
||
],
|
||
"metadata": {
|
||
"id": "TgZUkjKYSgvn"
|
||
},
|
||
"execution_count": null,
|
||
"outputs": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"source": [
|
||
"!pip install huggingface_sb3\n",
|
||
"!pip install huggingface_hub\n",
|
||
"!pip install panda_gym"
|
||
],
|
||
"metadata": {
|
||
"id": "ABneW6tOSpyU"
|
||
},
|
||
"execution_count": null,
|
||
"outputs": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"source": [
|
||
"## Import the packages 📦"
|
||
],
|
||
"metadata": {
|
||
"id": "QTep3PQQABLr"
|
||
}
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"source": [
|
||
"import os\n",
|
||
"\n",
|
||
"import gymnasium as gym\n",
|
||
"import panda_gym\n",
|
||
"\n",
|
||
"from huggingface_sb3 import load_from_hub, package_to_hub\n",
|
||
"\n",
|
||
"from stable_baselines3 import A2C\n",
|
||
"from stable_baselines3.common.evaluation import evaluate_policy\n",
|
||
"from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize\n",
|
||
"from stable_baselines3.common.env_util import make_vec_env\n",
|
||
"\n",
|
||
"from huggingface_hub import notebook_login"
|
||
],
|
||
"metadata": {
|
||
"id": "HpiB8VdnQ7Bk"
|
||
},
|
||
"execution_count": null,
|
||
"outputs": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"source": [
|
||
"## PandaReachDense-v3 🦾\n",
|
||
"\n",
|
||
"The agent we're going to train is a robotic arm that needs to do controls (moving the arm and using the end-effector).\n",
|
||
"\n",
|
||
"In robotics, the *end-effector* is the device at the end of a robotic arm designed to interact with the environment.\n",
|
||
"\n",
|
||
"In `PandaReach`, the robot must place its end-effector at a target position (green ball).\n",
|
||
"\n",
|
||
"We're going to use the dense version of this environment. It means we'll get a *dense reward function* that **will provide a reward at each timestep** (the closer the agent is to completing the task, the higher the reward). Contrary to a *sparse reward function* where the environment **return a reward if and only if the task is completed**.\n",
|
||
"\n",
|
||
"Also, we're going to use the *End-effector displacement control*, it means the **action corresponds to the displacement of the end-effector**. We don't control the individual motion of each joint (joint control).\n",
|
||
"\n",
|
||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/robotics.jpg\" alt=\"Robotics\"/>\n",
|
||
"\n",
|
||
"\n",
|
||
"This way **the training will be easier**.\n",
|
||
"\n"
|
||
],
|
||
"metadata": {
|
||
"id": "lfBwIS_oAVXI"
|
||
}
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"source": [
|
||
"### Create the environment\n",
|
||
"\n",
|
||
"#### The environment 🎮\n",
|
||
"\n",
|
||
"In `PandaReachDense-v3` the robotic arm must place its end-effector at a target position (green ball)."
|
||
],
|
||
"metadata": {
|
||
"id": "frVXOrnlBerQ"
|
||
}
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"source": [
|
||
"env_id = \"PandaReachDense-v3\"\n",
|
||
"\n",
|
||
"# Create the env\n",
|
||
"env = gym.make(env_id)\n",
|
||
"\n",
|
||
"# Get the state space and action space\n",
|
||
"s_size = env.observation_space.shape\n",
|
||
"a_size = env.action_space"
|
||
],
|
||
"metadata": {
|
||
"id": "zXzAu3HYF1WD"
|
||
},
|
||
"execution_count": null,
|
||
"outputs": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"source": [
|
||
"print(\"_____OBSERVATION SPACE_____ \\n\")\n",
|
||
"print(\"The State Space is: \", s_size)\n",
|
||
"print(\"Sample observation\", env.observation_space.sample()) # Get a random observation"
|
||
],
|
||
"metadata": {
|
||
"id": "E-U9dexcF-FB"
|
||
},
|
||
"execution_count": null,
|
||
"outputs": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"source": [
|
||
"The observation space **is a dictionary with 3 different elements**:\n",
|
||
"- `achieved_goal`: (x,y,z) the current position of the end-effector.\n",
|
||
"- `desired_goal`: (x,y,z) the target position for the end-effector.\n",
|
||
"- `observation`: position (x,y,z) and velocity of the end-effector (vx, vy, vz).\n",
|
||
"\n",
|
||
"Given it's a dictionary as observation, **we will need to use a MultiInputPolicy policy instead of MlpPolicy**."
|
||
],
|
||
"metadata": {
|
||
"id": "g_JClfElGFnF"
|
||
}
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"source": [
|
||
"print(\"\\n _____ACTION SPACE_____ \\n\")\n",
|
||
"print(\"The Action Space is: \", a_size)\n",
|
||
"print(\"Action Space Sample\", env.action_space.sample()) # Take a random action"
|
||
],
|
||
"metadata": {
|
||
"id": "ib1Kxy4AF-FC"
|
||
},
|
||
"execution_count": null,
|
||
"outputs": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"source": [
|
||
"The action space is a vector with 3 values:\n",
|
||
"- Control x, y, z movement"
|
||
],
|
||
"metadata": {
|
||
"id": "5MHTHEHZS4yp"
|
||
}
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"source": [
|
||
"### Normalize observation and rewards"
|
||
],
|
||
"metadata": {
|
||
"id": "S5sXcg469ysB"
|
||
}
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"source": [
|
||
"A good practice in reinforcement learning is to [normalize input features](https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html).\n",
|
||
"\n",
|
||
"For that purpose, there is a wrapper that will compute a running average and standard deviation of input features.\n",
|
||
"\n",
|
||
"We also normalize rewards with this same wrapper by adding `norm_reward = True`\n",
|
||
"\n",
|
||
"[You should check the documentation to fill this cell](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize)"
|
||
],
|
||
"metadata": {
|
||
"id": "1ZyX6qf3Zva9"
|
||
}
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"source": [
|
||
"env = make_vec_env(env_id, n_envs=4)\n",
|
||
"\n",
|
||
"# Adding this wrapper to normalize the observation and the reward\n",
|
||
"env = # TODO: Add the wrapper"
|
||
],
|
||
"metadata": {
|
||
"id": "1RsDtHHAQ9Ie"
|
||
},
|
||
"execution_count": null,
|
||
"outputs": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"source": [
|
||
"#### Solution"
|
||
],
|
||
"metadata": {
|
||
"id": "tF42HvI7-gs5"
|
||
}
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"source": [
|
||
"env = make_vec_env(env_id, n_envs=4)\n",
|
||
"\n",
|
||
"env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10.)"
|
||
],
|
||
"metadata": {
|
||
"id": "2O67mqgC-hol"
|
||
},
|
||
"execution_count": null,
|
||
"outputs": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"source": [
|
||
"### Create the A2C Model 🤖\n",
|
||
"\n",
|
||
"For more information about A2C implementation with StableBaselines3 check: https://stable-baselines3.readthedocs.io/en/master/modules/a2c.html#notes\n",
|
||
"\n",
|
||
"To find the best parameters I checked the [official trained agents by Stable-Baselines3 team](https://huggingface.co/sb3)."
|
||
],
|
||
"metadata": {
|
||
"id": "4JmEVU6z1ZA-"
|
||
}
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"source": [
|
||
"model = # Create the A2C model and try to find the best parameters"
|
||
],
|
||
"metadata": {
|
||
"id": "vR3T4qFt164I"
|
||
},
|
||
"execution_count": null,
|
||
"outputs": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"source": [
|
||
"#### Solution"
|
||
],
|
||
"metadata": {
|
||
"id": "nWAuOOLh-oQf"
|
||
}
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"source": [
|
||
"model = A2C(policy = \"MultiInputPolicy\",\n",
|
||
" env = env,\n",
|
||
" verbose=1)"
|
||
],
|
||
"metadata": {
|
||
"id": "FKFLY54T-pU1"
|
||
},
|
||
"execution_count": null,
|
||
"outputs": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"source": [
|
||
"### Train the A2C agent 🏃\n",
|
||
"- Let's train our agent for 1,000,000 timesteps, don't forget to use GPU on Colab. It will take approximately ~25-40min"
|
||
],
|
||
"metadata": {
|
||
"id": "opyK3mpJ1-m9"
|
||
}
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"source": [
|
||
"model.learn(1_000_000)"
|
||
],
|
||
"metadata": {
|
||
"id": "4TuGHZD7RF1G"
|
||
},
|
||
"execution_count": null,
|
||
"outputs": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"source": [
|
||
"# Save the model and VecNormalize statistics when saving the agent\n",
|
||
"model.save(\"a2c-PandaReachDense-v3\")\n",
|
||
"env.save(\"vec_normalize.pkl\")"
|
||
],
|
||
"metadata": {
|
||
"id": "MfYtjj19cKFr"
|
||
},
|
||
"execution_count": null,
|
||
"outputs": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"source": [
|
||
"### Evaluate the agent 📈\n",
|
||
"- Now that's our agent is trained, we need to **check its performance**.\n",
|
||
"- Stable-Baselines3 provides a method to do that: `evaluate_policy`"
|
||
],
|
||
"metadata": {
|
||
"id": "01M9GCd32Ig-"
|
||
}
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"source": [
|
||
"from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize\n",
|
||
"\n",
|
||
"# Load the saved statistics\n",
|
||
"eval_env = DummyVecEnv([lambda: gym.make(\"PandaReachDense-v3\")])\n",
|
||
"eval_env = VecNormalize.load(\"vec_normalize.pkl\", eval_env)\n",
|
||
"\n",
|
||
"# We need to override the render_mode\n",
|
||
"eval_env.render_mode = \"rgb_array\"\n",
|
||
"\n",
|
||
"# do not update them at test time\n",
|
||
"eval_env.training = False\n",
|
||
"# reward normalization is not needed at test time\n",
|
||
"eval_env.norm_reward = False\n",
|
||
"\n",
|
||
"# Load the agent\n",
|
||
"model = A2C.load(\"a2c-PandaReachDense-v3\")\n",
|
||
"\n",
|
||
"mean_reward, std_reward = evaluate_policy(model, eval_env)\n",
|
||
"\n",
|
||
"print(f\"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}\")"
|
||
],
|
||
"metadata": {
|
||
"id": "liirTVoDkHq3"
|
||
},
|
||
"execution_count": null,
|
||
"outputs": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"source": [
|
||
"### Publish your trained model on the Hub 🔥\n",
|
||
"Now that we saw we got good results after the training, we can publish our trained model on the Hub with one line of code.\n",
|
||
"\n",
|
||
"📚 The libraries documentation 👉 https://github.com/huggingface/huggingface_sb3/tree/main#hugging-face--x-stable-baselines3-v20\n"
|
||
],
|
||
"metadata": {
|
||
"id": "44L9LVQaavR8"
|
||
}
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"source": [
|
||
"By using `package_to_hub`, as we already mentionned in the former units, **you evaluate, record a replay, generate a model card of your agent and push it to the hub**.\n",
|
||
"\n",
|
||
"This way:\n",
|
||
"- You can **showcase our work** 🔥\n",
|
||
"- You can **visualize your agent playing** 👀\n",
|
||
"- You can **share with the community an agent that others can use** 💾\n",
|
||
"- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard\n"
|
||
],
|
||
"metadata": {
|
||
"id": "MkMk99m8bgaQ"
|
||
}
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "JquRrWytA6eo"
|
||
},
|
||
"source": [
|
||
"To be able to share your model with the community there are three more steps to follow:\n",
|
||
"\n",
|
||
"1️⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join\n",
|
||
"\n",
|
||
"2️⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.\n",
|
||
"- Create a new token (https://huggingface.co/settings/tokens) **with write role**\n",
|
||
"\n",
|
||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg\" alt=\"Create HF Token\">\n",
|
||
"\n",
|
||
"- Copy the token\n",
|
||
"- Run the cell below and paste the token"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "GZiFBBlzxzxY"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"notebook_login()\n",
|
||
"!git config --global credential.helper store"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "_tsf2uv0g_4p"
|
||
},
|
||
"source": [
|
||
"If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "FGNh9VsZok0i"
|
||
},
|
||
"source": [
|
||
"3️⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥 using `package_to_hub()` function"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"source": [
|
||
"For this environment, **running this cell can take approximately 10min**"
|
||
],
|
||
"metadata": {
|
||
"id": "juxItTNf1W74"
|
||
}
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"source": [
|
||
"from huggingface_sb3 import package_to_hub\n",
|
||
"\n",
|
||
"package_to_hub(\n",
|
||
" model=model,\n",
|
||
" model_name=f\"a2c-{env_id}\",\n",
|
||
" model_architecture=\"A2C\",\n",
|
||
" env_id=env_id,\n",
|
||
" eval_env=eval_env,\n",
|
||
" repo_id=f\"ThomasSimonini/a2c-{env_id}\", # Change the username\n",
|
||
" commit_message=\"Initial commit\",\n",
|
||
")"
|
||
],
|
||
"metadata": {
|
||
"id": "V1N8r8QVwcCE"
|
||
},
|
||
"execution_count": null,
|
||
"outputs": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"source": [
|
||
"## Some additional challenges 🏆\n",
|
||
"The best way to learn **is to try things by your own**! Why not trying `PandaPickAndPlace-v3`?\n",
|
||
"\n",
|
||
"If you want to try more advanced tasks for panda-gym, you need to check what was done using **TQC or SAC** (a more sample-efficient algorithm suited for robotics tasks). In real robotics, you'll use a more sample-efficient algorithm for a simple reason: contrary to a simulation **if you move your robotic arm too much, you have a risk of breaking it**.\n",
|
||
"\n",
|
||
"PandaPickAndPlace-v1 (this model uses the v1 version of the environment): https://huggingface.co/sb3/tqc-PandaPickAndPlace-v1\n",
|
||
"\n",
|
||
"And don't hesitate to check panda-gym documentation here: https://panda-gym.readthedocs.io/en/latest/usage/train_with_sb3.html\n",
|
||
"\n",
|
||
"We provide you the steps to train another agent (optional):\n",
|
||
"\n",
|
||
"1. Define the environment called \"PandaPickAndPlace-v3\"\n",
|
||
"2. Make a vectorized environment\n",
|
||
"3. Add a wrapper to normalize the observations and rewards. [Check the documentation](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize)\n",
|
||
"4. Create the A2C Model (don't forget verbose=1 to print the training logs).\n",
|
||
"5. Train it for 1M Timesteps\n",
|
||
"6. Save the model and VecNormalize statistics when saving the agent\n",
|
||
"7. Evaluate your agent\n",
|
||
"8. Publish your trained model on the Hub 🔥 with `package_to_hub`\n"
|
||
],
|
||
"metadata": {
|
||
"id": "G3xy3Nf3c2O1"
|
||
}
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"source": [
|
||
"### Solution (optional)"
|
||
],
|
||
"metadata": {
|
||
"id": "sKGbFXZq9ikN"
|
||
}
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"source": [
|
||
"# 1 - 2\n",
|
||
"env_id = \"PandaPickAndPlace-v3\"\n",
|
||
"env = make_vec_env(env_id, n_envs=4)\n",
|
||
"\n",
|
||
"# 3\n",
|
||
"env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10.)\n",
|
||
"\n",
|
||
"# 4\n",
|
||
"model = A2C(policy = \"MultiInputPolicy\",\n",
|
||
" env = env,\n",
|
||
" verbose=1)\n",
|
||
"# 5\n",
|
||
"model.learn(1_000_000)"
|
||
],
|
||
"metadata": {
|
||
"id": "J-cC-Feg9iMm"
|
||
},
|
||
"execution_count": null,
|
||
"outputs": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"source": [
|
||
"# 6\n",
|
||
"model_name = \"a2c-PandaPickAndPlace-v3\";\n",
|
||
"model.save(model_name)\n",
|
||
"env.save(\"vec_normalize.pkl\")\n",
|
||
"\n",
|
||
"# 7\n",
|
||
"from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize\n",
|
||
"\n",
|
||
"# Load the saved statistics\n",
|
||
"eval_env = DummyVecEnv([lambda: gym.make(\"PandaPickAndPlace-v3\")])\n",
|
||
"eval_env = VecNormalize.load(\"vec_normalize.pkl\", eval_env)\n",
|
||
"\n",
|
||
"# do not update them at test time\n",
|
||
"eval_env.training = False\n",
|
||
"# reward normalization is not needed at test time\n",
|
||
"eval_env.norm_reward = False\n",
|
||
"\n",
|
||
"# Load the agent\n",
|
||
"model = A2C.load(model_name)\n",
|
||
"\n",
|
||
"mean_reward, std_reward = evaluate_policy(model, eval_env)\n",
|
||
"\n",
|
||
"print(f\"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}\")\n",
|
||
"\n",
|
||
"# 8\n",
|
||
"package_to_hub(\n",
|
||
" model=model,\n",
|
||
" model_name=f\"a2c-{env_id}\",\n",
|
||
" model_architecture=\"A2C\",\n",
|
||
" env_id=env_id,\n",
|
||
" eval_env=eval_env,\n",
|
||
" repo_id=f\"ThomasSimonini/a2c-{env_id}\", # TODO: Change the username\n",
|
||
" commit_message=\"Initial commit\",\n",
|
||
")"
|
||
],
|
||
"metadata": {
|
||
"id": "-UnlKLmpg80p"
|
||
},
|
||
"execution_count": null,
|
||
"outputs": []
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"source": [
|
||
"See you on Unit 7! 🔥\n",
|
||
"## Keep learning, stay awesome 🤗"
|
||
],
|
||
"metadata": {
|
||
"id": "usatLaZ8dM4P"
|
||
}
|
||
}
|
||
]
|
||
}
|