From 5aeaf3b5c4dc293a77ae797c17e057cfd3fa0c0e Mon Sep 17 00:00:00 2001 From: simoninithomas Date: Fri, 30 Dec 2022 19:01:28 +0100 Subject: [PATCH 01/21] Adding updated A2C Unit --- units/en/_toctree.yml | 15 ++++- units/en/unit6/additional-readings.mdx | 9 +++ units/en/unit6/advantage-actor-critic.mdx | 68 +++++++++++++++++++++++ units/en/unit6/conclusion.mdx | 15 +++++ units/en/unit6/hands-on.mdx | 35 ++++++++++++ units/en/unit6/introduction.mdx | 25 +++++++++ units/en/unit6/variance-problem.mdx | 30 ++++++++++ 7 files changed, 196 insertions(+), 1 deletion(-) create mode 100644 units/en/unit6/additional-readings.mdx create mode 100644 units/en/unit6/advantage-actor-critic.mdx create mode 100644 units/en/unit6/conclusion.mdx create mode 100644 units/en/unit6/hands-on.mdx create mode 100644 units/en/unit6/introduction.mdx create mode 100644 units/en/unit6/variance-problem.mdx diff --git a/units/en/_toctree.yml b/units/en/_toctree.yml index a46425e..2843096 100644 --- a/units/en/_toctree.yml +++ b/units/en/_toctree.yml @@ -104,8 +104,21 @@ title: Optuna - local: unitbonus2/hands-on title: Hands-on +- title: Unit 6. Actor Crtic methods with Robotics environments + sections: + - local: unit6/introduction + title: Introduction + - local: unit6/variance-problem + title: The Problem of Variance in Reinforce + - local: unit6/advantage-actor-critic + title: Advantage Actor-Critic (A2C) + - local: unit6/hands-on + title: Advantage Actor Critic (A2C) using Robotics Simulations with PyBullet and Panda-Gym ๐Ÿค– + - local: unit6/conclusion + title: Conclusion + - local: unit6/additional-readings + title: Additional Readings - title: What's next? New Units Publishing Schedule sections: - local: communication/publishing-schedule title: Publishing Schedule - diff --git a/units/en/unit6/additional-readings.mdx b/units/en/unit6/additional-readings.mdx new file mode 100644 index 0000000..4361839 --- /dev/null +++ b/units/en/unit6/additional-readings.mdx @@ -0,0 +1,9 @@ +# Additional Readings [[additional-readings]] + +## Bias-variance tradeoff in Reinforcement Learning +If you want to dive deeper into the question of variance and bias tradeoff in Deep Reinforcement Learning, you can check these two articles: +- [Making Sense of the Bias / Variance Trade-off in (Deep) Reinforcement Learning](https://blog.mlreview.com/making-sense-of-the-bias-variance-trade-off-in-deep-reinforcement-learning-79cf1e83d565) +- [Bias-variance Tradeoff in Reinforcement Learning](https://www.endtoend.ai/blog/bias-variance-tradeoff-in-reinforcement-learning/) + +## Advantage Functions +- [Advantage Functions, SpinningUp RL](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html?highlight=advantage%20functio#advantage-functions) diff --git a/units/en/unit6/advantage-actor-critic.mdx b/units/en/unit6/advantage-actor-critic.mdx new file mode 100644 index 0000000..d0731f0 --- /dev/null +++ b/units/en/unit6/advantage-actor-critic.mdx @@ -0,0 +1,68 @@ +# Advantage Actor-Critic (A2C) [[advantage-actor-critic-a2c]] +## Reducing variance with Actor-Critic methods +The solution to reducing the variance of Reinforce algorithm and training our agent faster and better is to use a combination of policy-based and value-based methods: *the Actor-Critic method*. + +To understand the Actor-Critic, imagine you play a video game. You can play with a friend that will provide you with some feedback. You're the Actor, and your friend is the Critic. + +Actor Critic + +You don't know how to play at the beginning, **so you try some actions randomly**. The Critic observes your action and **provides feedback**. + +Learning from this feedback, **you'll update your policy and be better at playing that game.** + +On the other hand, your friend (Critic) will also update their way to provide feedback so it can be better next time. + +This is the idea behind Actor-Critic. We learn two function approximations: + +- *A policy* that **controls how our agent acts**: \\( \pi_{\theta}(s,a) \\) + +- *A value function* to assist the policy update by measuring how good the action taken is: \\( \hat{q}_{w}(s,a) \\) + +## The Actor-Critic Process +Now that we have seen the Actor Critic's big picture let's dive deeper to understand how Actor and Critic improve together during the training. + +As we saw, with Actor-Critic methods, there are two function approximations (two neural networks): +- *Actor*, a **policy function** parameterized by theta: \\( \pi_{\theta}(s,a) \\) +- *Critic*, a **value function** parameterized by w: \\( \hat{q}_{w}(s,a) \\) + +Let's see the training process to understand how Actor and Critic are optimized: +- At each timestep, t, we get the current state \\( S_t\\) from the environment and **pass it as input through our Actor and Critic**. + +- Our Policy takes the state and **outputs an action** \\( A_t \\). + +Step 1 Actor Critic + +- The Critic takes that action also as input and, using \\( S_t\\) and \\( A_t \\), **computes the value of taking that action at that state: the Q-value**. + +Step 2 Actor Critic + +- The action \\( A_t\\) performed in the environment outputs a new state \\( S_{t+1}\\) and a reward \\( R_{t+1} \\) . + +Step 3 Actor Critic + +- The Actor updates its policy parameters using the Q value. + +Step 4 Actor Critic + +- Thanks to its updated parameters, the Actor produces the next action to take at \\( A_{t+1} \\) given the new state \\( S_{t+1} \\). + +- The Critic then updates its value parameters. + +Step 5 Actor Critic + +## Adding "Advantage" in Actor Critic (A2C) +We can stabilize learning further by **using the Advantage function as Critic instead of the Action value function**. + +The idea is that the Advantage function calculates the relative advantage of an action compared to the others possible at a state: **how better taking that action at a state is compared to the average value of the state**. It's subtracting the mean value of the state from the state action pair: + +Advantage Function + +In other words, this function calculates **the extra reward we get if we take this action at that state compared to the mean reward we get at that state**. + +The extra reward is what's beyond the expected value of that state. +- If A(s,a) > 0: our gradient is **pushed in that direction**. +- If A(s,a) < 0 (our action does worse than the average value of that state), **our gradient is pushed in the opposite direction**. + +The problem with implementing this advantage function is that it requires two value functions โ€” \\( Q(s,a)\\) and \\( V(s)\\). Fortunately, **we can use the TD error as a good estimator of the advantage function.** + +Advantage Function diff --git a/units/en/unit6/conclusion.mdx b/units/en/unit6/conclusion.mdx new file mode 100644 index 0000000..5502e31 --- /dev/null +++ b/units/en/unit6/conclusion.mdx @@ -0,0 +1,15 @@ +# Conclusion [[conclusion]] + +Congrats on finishing this unit and the tutorial. You've just trained your first virtual robots ๐Ÿฅณ. + +**Take time to grasp the material before continuing**. You can also look at the additional reading materials we provided in the *additional reading* section. + +Feel free to train your agent in other environments. The **best way to learn is to try things on your own!** For instance, what about teaching your robot [to stack objects](https://panda-gym.readthedocs.io/en/latest/usage/environments.html#sparce-reward-end-effector-control-default-setting)? + +In the next unit, we will learn to improve Actor-Critic Methods with Proximal Policy Optimization using the [CleanRL library](https://github.com/vwxyzjn/cleanrl). Then we'll study how to speed up the process with the [Sample Factory library](https://samplefactory.dev/). You'll train your PPO agents in these environments: VizDoom, Racing Car, and a 3D FPS. + +TODO: IMAGE of the environment Vizdoom + ED + +Finally, with your feedback, we want **to improve and update the course iteratively**. If you have some, please ๐Ÿ‘‰ [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9) + +### Keep learning, stay awesome ๐Ÿค—, diff --git a/units/en/unit6/hands-on.mdx b/units/en/unit6/hands-on.mdx new file mode 100644 index 0000000..28ca5c7 --- /dev/null +++ b/units/en/unit6/hands-on.mdx @@ -0,0 +1,35 @@ +# Advantage Actor Critic (A2C) using Robotics Simulations with PyBullet and Panda-Gym ๐Ÿค– [[hands-on]] + + + + + +Now that you've studied the theory behind Advantage Actor Critic (A2C), **you're ready to train your A2C agent** using Stable-Baselines3 in robotic environments. And train three robots: + +- A bipedal walker ๐Ÿšถ to learn to walk. +- A spider ๐Ÿ•ท๏ธ to learn to move. +- A robotic arm ๐Ÿฆพ to move objects in the correct position. + +We're going to use two Robotics environments: + +- [PyBullet](https://github.com/bulletphysics/bullet3) +- [panda-gym](https://github.com/qgallouedec/panda-gym) + +TODO: ADD IMAGE OF THREE + + +To validate this hands-on for the certification process, you need to push your three trained model to the Hub and get: + +TODO ADD CERTIFICATION ELEMENTS + +To find your result, [go to the leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward** + +For more information about the certification process, check this section ๐Ÿ‘‰ https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process + +**To start the hands-on click on Open In Colab button** ๐Ÿ‘‡ : + +[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit6/unit6.ipynb) diff --git a/units/en/unit6/introduction.mdx b/units/en/unit6/introduction.mdx new file mode 100644 index 0000000..8d3e6a6 --- /dev/null +++ b/units/en/unit6/introduction.mdx @@ -0,0 +1,25 @@ +# Introduction [[introduction]] + +TODO: ADD THUMBNAIL + +In unit 4, we learned about our first Policy-Based algorithm called **Reinforce**. + +In Policy-Based methods, **we aim to optimize the policy directly without using a value function**. More precisely, Reinforce is part of a subclass of *Policy-Based Methods* called *Policy-Gradient methods*. This subclass optimizes the policy directly by **estimating the weights of the optimal policy using Gradient Ascent**. + +We saw that Reinforce worked well. However, because we use Monte-Carlo sampling to estimate return (we use an entire episode to calculate the return), **we have significant variance in policy gradient estimation**. + +Remember that the policy gradient estimation is **the direction of the steepest increase in return**. Aka, how to update our policy weights so that actions that lead to good returns have a higher probability of being taken. The Monte Carlo variance, which we will further study in this unit, **leads to slower training since we need a lot of samples to mitigate it**. + +So, today we'll study **Actor-Critic methods**, a hybrid architecture combining value-based and policy-based methods that help to stabilize the training by reducing the variance: +- *An Actor* that controls **how our agent behaves** (policy-based method) +- *A Critic* that measures **how good the action taken is** (value-based method) + + +We'll study one of these hybrid methods, Advantage Actor Critic (A2C), **and train our agent using Stable-Baselines3 in robotic environments**. Where we'll train three robots: +- A bipedal walker ๐Ÿšถ to learn to walk. +- A spider ๐Ÿ•ท๏ธ to learn to move. +- A robotic arm ๐Ÿฆพ to move objects in the correct position. + +TODO: ADD IMAGE OF THREE + +Sounds exciting? Let's get started! diff --git a/units/en/unit6/variance-problem.mdx b/units/en/unit6/variance-problem.mdx new file mode 100644 index 0000000..bb8df6a --- /dev/null +++ b/units/en/unit6/variance-problem.mdx @@ -0,0 +1,30 @@ +# The Problem of Variance in Reinforce [[the-problem-of-variance-in-reinforce]] + +In Reinforce, we want to **increase the probability of actions in a trajectory proportional to how high the return is**. + + +Reinforce + +- If the **return is high**, we will **push up** the probabilities of the (state, action) combinations. +- Else, if the **return is low**, it will **push down** the probabilities of the (state, action) combinations. + +This return \\(R(\tau)\\) is calculated using a *Monte-Carlo sampling*. Indeed, we collect a trajectory and calculate the discounted return, **and use this score to increase or decrease the probability of every action taken in that trajectory**. If the return is good, all actions will be โ€œreinforcedโ€ by increasing their likelihood of being taken. + +\\(R(\tau) = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ...\\) + +The advantage of this method is that **itโ€™s unbiased. Since weโ€™re not estimating the return**, we use only the true return we obtain. + +But the problem is that **the variance is high, since trajectories can lead to different returns** due to stochasticity of the environment (random events during episode) and stochasticity of the policy. Consequently, the same starting state can lead to very different returns. +Because of this, **the return starting at the same state can vary significantly across episodes**. + +variance + +The solution is to mitigate the variance by **using a large number of trajectories, hoping that the variance introduced in any one trajectory will be reduced in aggregate and provide a "true" estimation of the return.** + +However, increasing the batch size significantly **reduces sample efficiency**. So we need to find additional mechanisms to reduce the variance. + +--- +If you want to dive deeper into the question of variance and bias tradeoff in Deep Reinforcement Learning, you can check these two articles: +- [Making Sense of the Bias / Variance Trade-off in (Deep) Reinforcement Learning](https://blog.mlreview.com/making-sense-of-the-bias-variance-trade-off-in-deep-reinforcement-learning-79cf1e83d565) +- [Bias-variance Tradeoff in Reinforcement Learning](https://www.endtoend.ai/blog/bias-variance-tradeoff-in-reinforcement-learning/) +--- From 143f169a654875f52044d409c4e2c13f726139d3 Mon Sep 17 00:00:00 2001 From: simoninithomas Date: Fri, 30 Dec 2022 19:05:40 +0100 Subject: [PATCH 02/21] Adding reading resources --- units/en/unit6/additional-readings.mdx | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/units/en/unit6/additional-readings.mdx b/units/en/unit6/additional-readings.mdx index 4361839..5e7f386 100644 --- a/units/en/unit6/additional-readings.mdx +++ b/units/en/unit6/additional-readings.mdx @@ -1,9 +1,16 @@ # Additional Readings [[additional-readings]] ## Bias-variance tradeoff in Reinforcement Learning + If you want to dive deeper into the question of variance and bias tradeoff in Deep Reinforcement Learning, you can check these two articles: - [Making Sense of the Bias / Variance Trade-off in (Deep) Reinforcement Learning](https://blog.mlreview.com/making-sense-of-the-bias-variance-trade-off-in-deep-reinforcement-learning-79cf1e83d565) - [Bias-variance Tradeoff in Reinforcement Learning](https://www.endtoend.ai/blog/bias-variance-tradeoff-in-reinforcement-learning/) ## Advantage Functions + - [Advantage Functions, SpinningUp RL](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html?highlight=advantage%20functio#advantage-functions) + +## Actor Critic + +- [Foundations of Deep RL Series, L3 Policy Gradients and Advantage Estimation by Pieter Abbeel](https://www.youtube.com/watch?v=AKbX1Zvo7r8) +- [A2C Paper: Asynchronous Methods for Deep Reinforcement Learning](https://arxiv.org/abs/1602.01783v2) From 526d5fd48c64fc54d451af6cd4c1d234e3f7ab8e Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Sat, 31 Dec 2022 11:13:23 +0100 Subject: [PATCH 03/21] Create requirements-unit6.txt --- notebooks/unit6/requirements-unit6.txt | 4 ++++ 1 file changed, 4 insertions(+) create mode 100644 notebooks/unit6/requirements-unit6.txt diff --git a/notebooks/unit6/requirements-unit6.txt b/notebooks/unit6/requirements-unit6.txt new file mode 100644 index 0000000..4ac4ded --- /dev/null +++ b/notebooks/unit6/requirements-unit6.txt @@ -0,0 +1,4 @@ +gymnasium +panda_gym==2.0.0 +stable-baselines3[extra] +huggingface_sb3 From d733a98e390210e3d070ff02a63666a0fa34b332 Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Sat, 31 Dec 2022 11:17:49 +0100 Subject: [PATCH 04/21] Update requirements-unit6.txt --- notebooks/unit6/requirements-unit6.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/notebooks/unit6/requirements-unit6.txt b/notebooks/unit6/requirements-unit6.txt index 4ac4ded..0cfebcb 100644 --- a/notebooks/unit6/requirements-unit6.txt +++ b/notebooks/unit6/requirements-unit6.txt @@ -2,3 +2,4 @@ gymnasium panda_gym==2.0.0 stable-baselines3[extra] huggingface_sb3 +pyglet==1.5.1 From be34a485d0dfb6d6483f3e3983d5f5556456ddf9 Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Sat, 31 Dec 2022 11:19:23 +0100 Subject: [PATCH 05/21] Update requirements-unit6.txt --- notebooks/unit6/requirements-unit6.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/notebooks/unit6/requirements-unit6.txt b/notebooks/unit6/requirements-unit6.txt index 0cfebcb..a346f80 100644 --- a/notebooks/unit6/requirements-unit6.txt +++ b/notebooks/unit6/requirements-unit6.txt @@ -1,5 +1,5 @@ gymnasium -panda_gym==2.0.0 stable-baselines3[extra] huggingface_sb3 +panda_gym==2.0.0 pyglet==1.5.1 From b835b898fce7e034f189140fd88263e9f6b6ae37 Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Sat, 31 Dec 2022 20:36:44 +0100 Subject: [PATCH 06/21] Update conclusion.mdx --- units/en/unit6/conclusion.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/units/en/unit6/conclusion.mdx b/units/en/unit6/conclusion.mdx index 5502e31..68393d7 100644 --- a/units/en/unit6/conclusion.mdx +++ b/units/en/unit6/conclusion.mdx @@ -10,6 +10,6 @@ In the next unit, we will learn to improve Actor-Critic Methods with Proximal Po TODO: IMAGE of the environment Vizdoom + ED -Finally, with your feedback, we want **to improve and update the course iteratively**. If you have some, please ๐Ÿ‘‰ [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9) +Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please ๐Ÿ‘‰ [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9) ### Keep learning, stay awesome ๐Ÿค—, From 14bd94d5745d8f9d83f74c834959b6b7f4c5a455 Mon Sep 17 00:00:00 2001 From: simoninithomas Date: Sun, 1 Jan 2023 17:29:07 +0100 Subject: [PATCH 07/21] Update conclusion --- units/en/unit6/conclusion.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/units/en/unit6/conclusion.mdx b/units/en/unit6/conclusion.mdx index 68393d7..3da4332 100644 --- a/units/en/unit6/conclusion.mdx +++ b/units/en/unit6/conclusion.mdx @@ -4,7 +4,7 @@ Congrats on finishing this unit and the tutorial. You've just trained your first **Take time to grasp the material before continuing**. You can also look at the additional reading materials we provided in the *additional reading* section. -Feel free to train your agent in other environments. The **best way to learn is to try things on your own!** For instance, what about teaching your robot [to stack objects](https://panda-gym.readthedocs.io/en/latest/usage/environments.html#sparce-reward-end-effector-control-default-setting)? +Feel free to train your agent in other environments. The **best way to learn is to try things on your own!** For instance, what about teaching your robotic arm [to stack objects](https://panda-gym.readthedocs.io/en/latest/usage/environments.html#sparce-reward-end-effector-control-default-setting) or slide objects? In the next unit, we will learn to improve Actor-Critic Methods with Proximal Policy Optimization using the [CleanRL library](https://github.com/vwxyzjn/cleanrl). Then we'll study how to speed up the process with the [Sample Factory library](https://samplefactory.dev/). You'll train your PPO agents in these environments: VizDoom, Racing Car, and a 3D FPS. From 1680476a040ce0e4c0a6fce86b94cf2e5d9aff7b Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Sun, 1 Jan 2023 17:30:34 +0100 Subject: [PATCH 08/21] Add unit6 WIP --- notebooks/unit6/unit6.ipynb | 771 ++++++++++++++++++++++++++++++++++++ 1 file changed, 771 insertions(+) create mode 100644 notebooks/unit6/unit6.ipynb diff --git a/notebooks/unit6/unit6.ipynb b/notebooks/unit6/unit6.ipynb new file mode 100644 index 0000000..8ecae3c --- /dev/null +++ b/notebooks/unit6/unit6.ipynb @@ -0,0 +1,771 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [], + "private_outputs": true, + "authorship_tag": "ABX9TyM4Z04oGTU1B2rRuxHfuNly", + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + }, + "accelerator": "GPU", + "gpuClass": "standard" + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Unit 6: Advantage Actor Critic (A2C) using Robotics Simulations with PyBullet and Panda-Gym ๐Ÿค–\n", + "\n", + "TODO: ADD THUMBNAIL\n", + "\n", + "In this small notebook you'll learn to use A2C with PyBullet and Panda-Gym two set of robotics environments. \n", + "\n", + "With [PyBullet](https://github.com/bulletphysics/bullet3), you're going to **train robots to walk and run**:\n", + "- `AntBulletEnv-v0` ๐Ÿ•ธ๏ธ More precisely a spider (they say Ant but come on... it's a spider ๐Ÿ˜†) ๐Ÿ•ธ๏ธ\n", + "- `HalfCheetahBulletEnv-v0`\n", + "\n", + "Then, with [Panda-Gym](https://github.com/qgallouedec/panda-gym), you're going **to train a robotic arm** (Franka Emika Panda robot) to perform some tasks:\n", + "- `Reach`: the robot must place its end-effector at a target position.\n", + "- `Slide`: the robot has to slide an object to a target position.\n", + "\n", + "After that, you'll be able to train other robotics environments." + ], + "metadata": { + "id": "-PTReiOw-RAN" + } + }, + { + "cell_type": "markdown", + "source": [ + "TODO: ADD VIDEO OF WHAT IT LOOKS LIKE" + ], + "metadata": { + "id": "2VGL_0ncoAJI" + } + }, + { + "cell_type": "markdown", + "source": [ + "### ๐ŸŽฎ Environments: \n", + "\n", + "- [PyBullet](https://github.com/bulletphysics/bullet3)\n", + "- [Panda-Gym](https://github.com/qgallouedec/panda-gym)\n", + "\n", + "###๐Ÿ“š RL-Library: \n", + "\n", + "- [Stable-Baselines3](https://stable-baselines3.readthedocs.io/)" + ], + "metadata": { + "id": "QInFitfWno1Q" + } + }, + { + "cell_type": "markdown", + "source": [ + "We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues)." + ], + "metadata": { + "id": "2CcdX4g3oFlp" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Objectives of this notebook ๐Ÿ†\n", + "\n", + "At the end of the notebook, you will:\n", + "\n", + "- Be able to use **PyBullet** and **Panda-Gym**, the environment libraries.\n", + "- Be able to **train robots using A2C**.\n", + "- Understand why **we need to normalize the input**.\n", + "- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score ๐Ÿ”ฅ.\n", + "\n", + "\n" + ], + "metadata": { + "id": "MoubJX20oKaQ" + } + }, + { + "cell_type": "markdown", + "source": [ + "## This notebook is from the Deep Reinforcement Learning Course\n", + "\"Deep\n", + "\n", + "In this free course, you will:\n", + "\n", + "- ๐Ÿ“– Study Deep Reinforcement Learning in **theory and practice**.\n", + "- ๐Ÿง‘โ€๐Ÿ’ป Learn to **use famous Deep RL libraries** such as Stable Baselines3, RL Baselines3 Zoo, CleanRL and Sample Factory 2.0.\n", + "- ๐Ÿค– Train **agents in unique environments** \n", + "\n", + "And more check ๐Ÿ“š the syllabus ๐Ÿ‘‰ https://simoninithomas.github.io/deep-rl-course\n", + "\n", + "Donโ€™t forget to **sign up to the course** (we are collecting your email to be able toย **send you the links when each Unit is published and give you information about the challenges and updates).**\n", + "\n", + "\n", + "The best way to keep in touch is to join our discord server to exchange with the community and with us ๐Ÿ‘‰๐Ÿป https://discord.gg/ydHrjt3WP5" + ], + "metadata": { + "id": "DoUNkTExoUED" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Prerequisites ๐Ÿ—๏ธ\n", + "Before diving into the notebook, you need to:\n", + "\n", + "๐Ÿ”ฒ ๐Ÿ“š Study [Actor-Critic methods by reading Unit 6](https://huggingface.co/deep-rl-course/unit6/introduction) ๐Ÿค— " + ], + "metadata": { + "id": "BTuQAUAPoa5E" + } + }, + { + "cell_type": "markdown", + "source": [ + "# Let's train our first robots ๐Ÿค–" + ], + "metadata": { + "id": "iajHvVDWoo01" + } + }, + { + "cell_type": "markdown", + "source": [ + "To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to:\n", + "\n", + "TODO ADD CERTIFICATION RECOMMENDATION\n", + "\n", + "To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**\n", + "\n", + "For more information about the certification process, check this section ๐Ÿ‘‰ https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process" + ], + "metadata": { + "id": "zbOENTE2os_D" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Set the GPU ๐Ÿ’ช\n", + "- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`\n", + "\n", + "\"GPU" + ], + "metadata": { + "id": "PU4FVzaoM6fC" + } + }, + { + "cell_type": "markdown", + "source": [ + "- `Hardware Accelerator > GPU`\n", + "\n", + "\"GPU" + ], + "metadata": { + "id": "KV0NyFdQM9ZG" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Create a virtual display ๐Ÿ”ฝ\n", + "\n", + "During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames). \n", + "\n", + "Hence the following cell will install the librairies and create and run a virtual screen ๐Ÿ–ฅ" + ], + "metadata": { + "id": "bTpYcVZVMzUI" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "jV6wjQ7Be7p5" + }, + "outputs": [], + "source": [ + "%%capture\n", + "!apt install python-opengl\n", + "!apt install ffmpeg\n", + "!apt install xvfb\n", + "!pip3 install pyvirtualdisplay" + ] + }, + { + "cell_type": "code", + "source": [ + "# Additional dependencies for RL Baselines3 Zoo\n", + "!apt-get install swig cmake freeglut3-dev " + ], + "metadata": { + "id": "fWyKJCy_NJBX" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "# Virtual display\n", + "from pyvirtualdisplay import Display\n", + "\n", + "virtual_display = Display(visible=0, size=(1400, 900))\n", + "virtual_display.start()" + ], + "metadata": { + "id": "ww5PQH1gNLI4" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "### Install dependencies ๐Ÿ”ฝ\n", + "The first step is to install the dependencies, weโ€™ll install multiple ones:\n", + "\n", + "- `pybullet`: Contains the walking robots environments.\n", + "- `panda-gym`: Contains the robotics arm environments.\n", + "- `stable-baselines3[extra]`: The SB3 deep reinforcement learning library.\n", + "- `huggingface_sb3`: Additional code for Stable-baselines3 to load and upload models from the Hugging Face ๐Ÿค— Hub.\n", + "- `huggingface_hub`: Library allowing anyone to work with the Hub repositories.\n", + "\n", + "We're going to install **two versions of gym**:\n", + "- `gym==0.21`: The classical version of gym for PyBullet environments.\n", + "- `gymnasium`: [The new Gym library by Farama Foundation](https://github.com/Farama-Foundation/Gymnasium) for Panda Gym environments." + ], + "metadata": { + "id": "e1obkbdJ_KnG" + } + }, + { + "cell_type": "code", + "source": [ + "!pip install -r https://huggingface.co/spaces/ThomasSimonini/temp-space-requirements/raw/main/requirements/requirements-unit6.txt" + ], + "metadata": { + "id": "69jUeXrLryos" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "2yZRi_0bQGPM" + }, + "outputs": [], + "source": [ + "TODO: CHANGE TO THE ONE COMMENTED#!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit6/requirements-unit6.txt" + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Import the packages ๐Ÿ“ฆ" + ], + "metadata": { + "id": "QTep3PQQABLr" + } + }, + { + "cell_type": "code", + "source": [ + "import gymnasium as gymnasium\n", + "import panda_gym\n", + "\n", + "import gym\n", + "import pybullet_envs\n", + "\n", + "import os\n", + "\n", + "from huggingface_sb3 import load_from_hub, package_to_hub\n", + "\n", + "from stable_baselines3 import A2C\n", + "from stable_baselines3.common.evaluation import evaluate_policy\n", + "from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize\n", + "from stable_baselines3.common.env_util import make_vec_env\n", + "\n", + "from huggingface_hub import notebook_login" + ], + "metadata": { + "id": "HpiB8VdnQ7Bk" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# Part 1: PyBullet Environments\n" + ], + "metadata": { + "id": "KIqf-N-otczo" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Environment 1: AntBulletEnv-v0 ๐Ÿ•ธ\n", + "\n" + ], + "metadata": { + "id": "lfBwIS_oAVXI" + } + }, + { + "cell_type": "markdown", + "source": [ + "### Create the AntBulletEnv-v0\n", + "#### The environment ๐ŸŽฎ\n", + "In this environment, the agent needs to use correctly its different joints to walk correctly." + ], + "metadata": { + "id": "frVXOrnlBerQ" + } + }, + { + "cell_type": "code", + "source": [ + "import gym # As mentionned we use gym for PyBullet and gymnasium for panda-gym" + ], + "metadata": { + "id": "RJ0XJccTt9FX" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "env_id = \"AntBulletEnv-v0\"\n", + "# Create the env\n", + "env = gym.make(env_id)\n", + "\n", + "# Get the state space and action space\n", + "s_size = env.observation_space.shape[0]\n", + "a_size = env.action_space" + ], + "metadata": { + "id": "JpU-JCDQYYax" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "print(\"_____OBSERVATION SPACE_____ \\n\")\n", + "print(\"The State Space is: \", s_size)\n", + "print(\"Sample observation\", env.observation_space.sample()) # Get a random observation" + ], + "metadata": { + "id": "2ZfvcCqEYgrg" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "print(\"\\n _____ACTION SPACE_____ \\n\")\n", + "print(\"The Action Space is: \", a_size)\n", + "print(\"Action Space Sample\", env.action_space.sample()) # Take a random action" + ], + "metadata": { + "id": "Tc89eLTYYkK2" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "### Normalize observation and rewards" + ], + "metadata": { + "id": "S5sXcg469ysB" + } + }, + { + "cell_type": "markdown", + "source": [ + "A good practice in reinforcement learning is to [normalize input features](https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html). For that, a wrapper exists and will compute a running average and standard deviation of input features.\n", + "\n", + "We also normalize rewards with this same wrapper by adding `norm_reward = True`\n", + "\n", + "[You should check the documentation to fill this cell](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize)" + ], + "metadata": { + "id": "1ZyX6qf3Zva9" + } + }, + { + "cell_type": "code", + "source": [ + "env = make_vec_env(env_id, n_envs=4)\n", + "\n", + "# Adding this wrapper to normalize the observation and the reward\n", + "env = # TODO: Add the wrapper" + ], + "metadata": { + "id": "1RsDtHHAQ9Ie" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "#### Solution" + ], + "metadata": { + "id": "tF42HvI7-gs5" + } + }, + { + "cell_type": "code", + "source": [ + "env = make_vec_env(env_id, n_envs=4)\n", + "\n", + "env = VecNormalize(env, norm_obs=True, norm_reward=False, clip_obs=10.)" + ], + "metadata": { + "id": "2O67mqgC-hol" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "### Create the A2C Model ๐Ÿค–\n", + "\n", + "In this case, because we have a vector of 28 values as input, we'll use an MLP (multi-layer perceptron) as policy.\n", + "\n", + "To find the best parameters I checked the [official trained agents by Stable-Baselines3 team](https://huggingface.co/sb3)." + ], + "metadata": { + "id": "4JmEVU6z1ZA-" + } + }, + { + "cell_type": "code", + "source": [ + "model = # Create the A2C model and try to find the best parameters" + ], + "metadata": { + "id": "vR3T4qFt164I" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "#### Solution" + ], + "metadata": { + "id": "nWAuOOLh-oQf" + } + }, + { + "cell_type": "code", + "source": [ + "model = A2C(policy = \"MlpPolicy\",\n", + " env = env,\n", + " gae_lambda = 0.9,\n", + " gamma = 0.99,\n", + " learning_rate = 0.00096,\n", + " max_grad_norm = 0.5,\n", + " n_steps = 8,\n", + " vf_coef = 0.4,\n", + " ent_coef = 0.0,\n", + " tensorboard_log = \"./tensorboard\",\n", + " policy_kwargs=dict(\n", + " log_std_init=-2, ortho_init=False),\n", + " normalize_advantage=False,\n", + " use_rms_prop= True,\n", + " use_sde= True,\n", + " verbose=1)" + ], + "metadata": { + "id": "FKFLY54T-pU1" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "### Train the A2C agent ๐Ÿƒ\n", + "- Let's train our agent for 2,000,000 timesteps, don't forget to use GPU on Colab. It will take approximately ~25-40min" + ], + "metadata": { + "id": "opyK3mpJ1-m9" + } + }, + { + "cell_type": "code", + "source": [ + "model.learn(2_000_000)" + ], + "metadata": { + "id": "4TuGHZD7RF1G" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "# Save the model and VecNormalize statistics when saving the agent\n", + "model.save(\"a2c-AntBulletEnv-v0\")\n", + "env.save(\"vec_normalize.pkl\")" + ], + "metadata": { + "id": "MfYtjj19cKFr" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "### Evaluate the agent ๐Ÿ“ˆ\n", + "- Now that's our agent is trained, we need to **check its performance**.\n", + "- Stable-Baselines3 provides a method to do that `evaluate_policy`\n", + "- In my case, I've got a mean reward of `2371.90 +/- 16.50`" + ], + "metadata": { + "id": "01M9GCd32Ig-" + } + }, + { + "cell_type": "code", + "source": [ + "from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize\n", + "\n", + "# Load the saved statistics\n", + "eval_env = DummyVecEnv([lambda: gym.make(\"AntBulletEnv-v0\")])\n", + "eval_env = VecNormalize.load(\"vec_normalize.pkl\", eval_env)\n", + "\n", + "# do not update them at test time\n", + "eval_env.training = False\n", + "# reward normalization is not needed at test time\n", + "eval_env.norm_reward = False\n", + "\n", + "# Load the agent\n", + "model = A2C.load(\"a2c-AntBulletEnv-v0\")\n", + "\n", + "mean_reward, std_reward = evaluate_policy(model, env)\n", + "\n", + "print(f\"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}\")" + ], + "metadata": { + "id": "liirTVoDkHq3" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "### Publish your trained model on the Hub ๐Ÿ”ฅ\n", + "Now that we saw we got good results after the training, we can publish our trained model on the hub ๐Ÿค— with one line of code.\n", + "\n", + "๐Ÿ“š The libraries documentation ๐Ÿ‘‰ https://github.com/huggingface/huggingface_sb3/tree/main#hugging-face--x-stable-baselines3-v20\n", + "\n", + "Here's an example of a Model Card (with a PyBullet environment):\n", + "\n", + "\"Model" + ], + "metadata": { + "id": "44L9LVQaavR8" + } + }, + { + "cell_type": "markdown", + "source": [ + "By using `package_to_hub`, as we already mentionned in the former units, **you evaluate, record a replay, generate a model card of your agent and push it to the hub**.\n", + "\n", + "This way:\n", + "- You can **showcase our work** ๐Ÿ”ฅ\n", + "- You can **visualize your agent playing** ๐Ÿ‘€\n", + "- You can **share with the community an agent that others can use** ๐Ÿ’พ\n", + "- You can **access a leaderboard ๐Ÿ† to see how well your agent is performing compared to your classmates** ๐Ÿ‘‰ https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard\n" + ], + "metadata": { + "id": "MkMk99m8bgaQ" + } + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JquRrWytA6eo" + }, + "source": [ + "To be able to share your model with the community there are three more steps to follow:\n", + "\n", + "1๏ธโƒฃ (If it's not already done) create an account to HF โžก https://huggingface.co/join\n", + "\n", + "2๏ธโƒฃ Sign in and then, you need to store your authentication token from the Hugging Face website.\n", + "- Create a new token (https://huggingface.co/settings/tokens) **with write role**\n", + "\n", + "\"Create\n", + "\n", + "- Copy the token \n", + "- Run the cell below and paste the token" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "GZiFBBlzxzxY" + }, + "outputs": [], + "source": [ + "notebook_login()\n", + "!git config --global credential.helper store" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_tsf2uv0g_4p" + }, + "source": [ + "If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FGNh9VsZok0i" + }, + "source": [ + "3๏ธโƒฃ We're now ready to push our trained agent to the ๐Ÿค— Hub ๐Ÿ”ฅ using `package_to_hub()` function" + ] + }, + { + "cell_type": "code", + "source": [ + "package_to_hub(\n", + " model=model,\n", + " model_name=f\"a2c-{env_id}\",\n", + " model_architecture=\"A2C\",\n", + " env_id=env_id,\n", + " eval_env=eval_env,\n", + " repo_id=f\"ThomasSimonini/a2c-{env_id}\", # Change the username\n", + " commit_message=\"Initial commit\",\n", + ")" + ], + "metadata": { + "id": "ueuzWVCUTkfS" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Environment 2: HalfCheetahBulletEnv-v0\n", + "\n", + "For this environment, you need to follow the same process that the first one. **Don't hesitate here to save this notebook to your Google Drive** since timeout can happen. You may also want to **complete this notebook in two times**.\n", + "\n", + "In order to see that you understood the complete process from environment definition to `package_to_hub` why not trying to do **it yourself first without solution?**\n", + "\n", + "1. Define the enviroment called HalfCheetahBulletEnv-v0\n", + "2. Make a vectorized environment\n", + "3. Add a wrapper to normalize the observations and rewards. [Check the documentation](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize)\n", + "4. Create the A2C Model\n", + "5. Train it for 2M Timesteps\n", + "6. Save the model and VecNormalize statistics when saving the agent\n", + "7. Evaluate your agent\n", + "8. Publish your trained model on the Hub ๐Ÿ”ฅ with `package_to_hub`" + ], + "metadata": { + "id": "-voECBK3An9j" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Take a coffee break โ˜•\n", + "- You already trained two robotics environments that learned to move congratutlations ๐Ÿฅณ!\n", + "- It's **time to take a break**. Don't hesitate to **save this notebook** `File > Save a copy to Drive` to work on this second part later.\n" + ], + "metadata": { + "id": "Qk9ykOk9D6Qh" + } + }, + { + "cell_type": "markdown", + "source": [ + "# Part 2: Robotic Arm Environments with `panda-gym`\n" + ], + "metadata": { + "id": "5VWfwAA7EJg7" + } + }, + { + "cell_type": "markdown", + "source": [], + "metadata": { + "id": "fW_CdlUsEVP2" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Some additional challenges ๐Ÿ†\n", + "The best way to learn **is to try things by your own**! Why not trying `HalfCheetahBulletEnv-v0`?\n", + "\n", + "In the [Leaderboard](https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?\n", + "\n", + "Here are some ideas to achieve so:\n", + "* Train more steps\n", + "* Try different hyperparameters by looking at what your classmates have done ๐Ÿ‘‰ https://huggingface.co/models?other=https://huggingface.co/models?other=AntBulletEnv-v0\n", + "* **Push your new trained model** on the Hub ๐Ÿ”ฅ\n" + ], + "metadata": { + "id": "G3xy3Nf3c2O1" + } + }, + { + "cell_type": "markdown", + "source": [ + "See you on Unit 8! ๐Ÿ”ฅ\n", + "## Keep learning, stay awesome ๐Ÿค—" + ], + "metadata": { + "id": "usatLaZ8dM4P" + } + } + ] +} \ No newline at end of file From f937f8c7db9be926287f68df78189ffc36518215 Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Mon, 2 Jan 2023 10:26:55 +0100 Subject: [PATCH 09/21] Update introduction.mdx --- units/en/unit6/introduction.mdx | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/units/en/unit6/introduction.mdx b/units/en/unit6/introduction.mdx index 8d3e6a6..64b8605 100644 --- a/units/en/unit6/introduction.mdx +++ b/units/en/unit6/introduction.mdx @@ -1,6 +1,7 @@ # Introduction [[introduction]] -TODO: ADD THUMBNAIL + +Thumbnail In unit 4, we learned about our first Policy-Based algorithm called **Reinforce**. From be7f8a34f0b4bd4b6a00be602a650cb4b9221e59 Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Mon, 2 Jan 2023 12:44:57 +0100 Subject: [PATCH 10/21] Update notebook --- notebooks/unit6/unit6.ipynb | 161 ++++++++++++++++++++++++++++++++---- 1 file changed, 145 insertions(+), 16 deletions(-) diff --git a/notebooks/unit6/unit6.ipynb b/notebooks/unit6/unit6.ipynb index 8ecae3c..7358a72 100644 --- a/notebooks/unit6/unit6.ipynb +++ b/notebooks/unit6/unit6.ipynb @@ -5,7 +5,18 @@ "colab": { "provenance": [], "private_outputs": true, - "authorship_tag": "ABX9TyM4Z04oGTU1B2rRuxHfuNly", + "collapsed_sections": [ + "MoubJX20oKaQ", + "DoUNkTExoUED", + "BTuQAUAPoa5E", + "tF42HvI7-gs5", + "nWAuOOLh-oQf", + "-voECBK3An9j", + "Qk9ykOk9D6Qh", + "G3xy3Nf3c2O1", + "usatLaZ8dM4P" + ], + "authorship_tag": "ABX9TyPovbUwEqbQAH1J8OxiHKDm", "include_colab_link": true }, "kernelspec": { @@ -34,7 +45,7 @@ "source": [ "# Unit 6: Advantage Actor Critic (A2C) using Robotics Simulations with PyBullet and Panda-Gym ๐Ÿค–\n", "\n", - "TODO: ADD THUMBNAIL\n", + "\"Thumbnail\"/\n", "\n", "In this small notebook you'll learn to use A2C with PyBullet and Panda-Gym two set of robotics environments. \n", "\n", @@ -252,10 +263,7 @@ "- `stable-baselines3[extra]`: The SB3 deep reinforcement learning library.\n", "- `huggingface_sb3`: Additional code for Stable-baselines3 to load and upload models from the Hugging Face ๐Ÿค— Hub.\n", "- `huggingface_hub`: Library allowing anyone to work with the Hub repositories.\n", - "\n", - "We're going to install **two versions of gym**:\n", - "- `gym==0.21`: The classical version of gym for PyBullet environments.\n", - "- `gymnasium`: [The new Gym library by Farama Foundation](https://github.com/Farama-Foundation/Gymnasium) for Panda Gym environments." + "- `gym==0.21`: The classical version of gym." ], "metadata": { "id": "e1obkbdJ_KnG" @@ -295,12 +303,12 @@ { "cell_type": "code", "source": [ - "import gymnasium as gymnasium\n", - "import panda_gym\n", - "\n", "import gym\n", "import pybullet_envs\n", "\n", + "import gymnasium\n", + "import panda_gym\n", + "\n", "import os\n", "\n", "from huggingface_sb3 import load_from_hub, package_to_hub\n", @@ -351,7 +359,7 @@ { "cell_type": "code", "source": [ - "import gym # As mentionned we use gym for PyBullet and gymnasium for panda-gym" + "import gym" ], "metadata": { "id": "RJ0XJccTt9FX" @@ -389,6 +397,15 @@ "execution_count": null, "outputs": [] }, + { + "cell_type": "markdown", + "source": [ + "TODO: Add explanation obs space" + ], + "metadata": { + "id": "QzMmsdMJS7jh" + } + }, { "cell_type": "code", "source": [ @@ -402,6 +419,15 @@ "execution_count": null, "outputs": [] }, + { + "cell_type": "markdown", + "source": [ + "Todo: Add explanation action space" + ], + "metadata": { + "id": "3RfsHhzZS9Pw" + } + }, { "cell_type": "markdown", "source": [ @@ -696,11 +722,11 @@ "source": [ "## Environment 2: HalfCheetahBulletEnv-v0\n", "\n", - "For this environment, you need to follow the same process that the first one. **Don't hesitate here to save this notebook to your Google Drive** since timeout can happen. You may also want to **complete this notebook in two times**.\n", + "For this environment, you must follow the same process as the first one. **Don't hesitate to save this notebook to your Google Drive** since timeout can happen. You may also want to **complete this notebook two times**.\n", "\n", - "In order to see that you understood the complete process from environment definition to `package_to_hub` why not trying to do **it yourself first without solution?**\n", + "To see that you understood the complete process from environment definition to `package_to_hub` why not try to do **it yourself first without the solution?**\n", "\n", - "1. Define the enviroment called HalfCheetahBulletEnv-v0\n", + "1. Define the environment called HalfCheetahBulletEnv-v0\n", "2. Make a vectorized environment\n", "3. Add a wrapper to normalize the observations and rewards. [Check the documentation](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize)\n", "4. Create the A2C Model\n", @@ -727,18 +753,121 @@ { "cell_type": "markdown", "source": [ - "# Part 2: Robotic Arm Environments with `panda-gym`\n" + "# Part 2: Robotic Arm Environments with `panda-gym`\n", + "\n", + "The second set of robotics environments we're going to train are a robotic arm that needs to do controls (moving the arm and using the end-effector).\n", + "\n", + "In robotics, the *end-effector* is the device at the end of a robotic arm designed to interact with the environment.\n", + "\n", + "1. In the first environment, `PandaReach`, the robot must place its end-effector at a target position (green ball).\n", + "2. In the second environment, `PandaSlide`, the robot has to slide an object to a target position.\n", + "\n", + "We're going to use the dense version of the environments. It means we'll get a *dense reward function* that **will provide a reward at each timestep** (the closer the agent is to complete the task, the higher the reward). Contrary to a *sparse reward function* where the environment **return a reward if and only if the task is completed**.\n", + "\n", + "Also, we're going to use the *End-effector displacement control*, it means the **action corresponds to the displacement of the end-effector**. We don't control the individual motion of each joint (joint control).\n", + "\n", + "\"Robotics\"/\n", + "\n", + "\n", + "This way, **the training will be easier**.\n", + "\n" ], "metadata": { "id": "5VWfwAA7EJg7" } }, + { + "cell_type": "code", + "source": [ + "env_id = \"PandaReachDense-v2\"\n", + "\n", + "# Create the env\n", + "env = gym.make(env_id)\n", + "\n", + "# Get the state space and action space\n", + "s_size = env.observation_space.shape\n", + "a_size = env.action_space" + ], + "metadata": { + "id": "zXzAu3HYF1WD" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "print(\"_____OBSERVATION SPACE_____ \\n\")\n", + "print(\"The State Space is: \", s_size)\n", + "print(\"Sample observation\", env.observation_space.sample()) # Get a random observation" + ], + "metadata": { + "id": "E-U9dexcF-FB" + }, + "execution_count": null, + "outputs": [] + }, { "cell_type": "markdown", + "source": [ + "The observation space is a dictionary with 3 different element:\n", + "- `achieved_goal`: (x,y,z) position of the goal.\n", + "- `desired_goal`: (x,y,z) distance between the goal position and the current object position.\n", + "- `observation`: position (x,y,z) and velocity of the end-effector (vx, vy, vz).\n", + "\n" + ], + "metadata": { + "id": "g_JClfElGFnF" + } + }, + { + "cell_type": "code", + "source": [ + "print(\"\\n _____ACTION SPACE_____ \\n\")\n", + "print(\"The Action Space is: \", a_size)\n", + "print(\"Action Space Sample\", env.action_space.sample()) # Take a random action" + ], + "metadata": { + "id": "ib1Kxy4AF-FC" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "TODO: ADd action space" + ], + "metadata": { + "id": "5MHTHEHZS4yp" + } + }, + { + "cell_type": "code", + "source": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "model = A2C(\"MultiInputPolicy\", env)\n", + "model.learn(total_timesteps=100000)" + ], + "metadata": { + "id": "C-3SfbJr0N7I" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", "source": [], "metadata": { - "id": "fW_CdlUsEVP2" - } + "id": "16pttUsKFyZY" + }, + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", From 2a35c66ec5b789901b0397a3e1bda76e8e6d57db Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Mon, 16 Jan 2023 18:08:36 +0100 Subject: [PATCH 11/21] Apply suggestions from code review Co-authored-by: Omar Sanseviero --- units/en/unit6/advantage-actor-critic.mdx | 10 +++++----- units/en/unit6/introduction.mdx | 10 +++++----- units/en/unit6/variance-problem.mdx | 4 ++-- 3 files changed, 12 insertions(+), 12 deletions(-) diff --git a/units/en/unit6/advantage-actor-critic.mdx b/units/en/unit6/advantage-actor-critic.mdx index d0731f0..6544eb3 100644 --- a/units/en/unit6/advantage-actor-critic.mdx +++ b/units/en/unit6/advantage-actor-critic.mdx @@ -1,8 +1,8 @@ # Advantage Actor-Critic (A2C) [[advantage-actor-critic-a2c]] ## Reducing variance with Actor-Critic methods -The solution to reducing the variance of Reinforce algorithm and training our agent faster and better is to use a combination of policy-based and value-based methods: *the Actor-Critic method*. +The solution to reducing the variance of the Reinforce algorithm and training our agent faster and better is to use a combination of Policy-Based and Value-Based methods: *the Actor-Critic method*. -To understand the Actor-Critic, imagine you play a video game. You can play with a friend that will provide you with some feedback. You're the Actor, and your friend is the Critic. +To understand the Actor-Critic, imagine you play a video game. You can play with a friend that will provide you with some feedback. You're the Actor and your friend is the Critic. Actor Critic @@ -19,7 +19,7 @@ This is the idea behind Actor-Critic. We learn two function approximations: - *A value function* to assist the policy update by measuring how good the action taken is: \\( \hat{q}_{w}(s,a) \\) ## The Actor-Critic Process -Now that we have seen the Actor Critic's big picture let's dive deeper to understand how Actor and Critic improve together during the training. +Now that we have seen the Actor Critic's big picture, let's dive deeper to understand how Actor and Critic improve together during the training. As we saw, with Actor-Critic methods, there are two function approximations (two neural networks): - *Actor*, a **policy function** parameterized by theta: \\( \pi_{\theta}(s,a) \\) @@ -50,10 +50,10 @@ Let's see the training process to understand how Actor and Critic are optimized: Step 5 Actor Critic -## Adding "Advantage" in Actor Critic (A2C) +## Adding "Advantage" in Actor-Critic (A2C) We can stabilize learning further by **using the Advantage function as Critic instead of the Action value function**. -The idea is that the Advantage function calculates the relative advantage of an action compared to the others possible at a state: **how better taking that action at a state is compared to the average value of the state**. It's subtracting the mean value of the state from the state action pair: +The idea is that the Advantage function calculates the relative advantage of an action compared to the others possible at a state: **how taking that action at a state is better compared to the average value of the state**. It's subtracting the mean value of the state from the state action pair: Advantage Function diff --git a/units/en/unit6/introduction.mdx b/units/en/unit6/introduction.mdx index 64b8605..b96ba39 100644 --- a/units/en/unit6/introduction.mdx +++ b/units/en/unit6/introduction.mdx @@ -9,14 +9,14 @@ In Policy-Based methods, **we aim to optimize the policy directly without using We saw that Reinforce worked well. However, because we use Monte-Carlo sampling to estimate return (we use an entire episode to calculate the return), **we have significant variance in policy gradient estimation**. -Remember that the policy gradient estimation is **the direction of the steepest increase in return**. Aka, how to update our policy weights so that actions that lead to good returns have a higher probability of being taken. The Monte Carlo variance, which we will further study in this unit, **leads to slower training since we need a lot of samples to mitigate it**. +Remember that the policy gradient estimation is **the direction of the steepest increase in return**. In other words, how to update our policy weights so that actions that lead to good returns have a higher probability of being taken. The Monte Carlo variance, which we will further study in this unit, **leads to slower training since we need a lot of samples to mitigate it**. -So, today we'll study **Actor-Critic methods**, a hybrid architecture combining value-based and policy-based methods that help to stabilize the training by reducing the variance: -- *An Actor* that controls **how our agent behaves** (policy-based method) -- *A Critic* that measures **how good the action taken is** (value-based method) +So, today we'll study **Actor-Critic methods**, a hybrid architecture combining value-based and Policy-Based methods that help to stabilize the training by reducing the variance: +- *An Actor* that controls **how our agent behaves** (Policy-Based method) +- *A Critic* that measures **how good the taken action is** (Value-Based method) -We'll study one of these hybrid methods, Advantage Actor Critic (A2C), **and train our agent using Stable-Baselines3 in robotic environments**. Where we'll train three robots: +We'll study one of these hybrid methods, Advantage Actor Critic (A2C), **and train our agent using Stable-Baselines3 in robotic environments**. We'll train three robots: - A bipedal walker ๐Ÿšถ to learn to walk. - A spider ๐Ÿ•ท๏ธ to learn to move. - A robotic arm ๐Ÿฆพ to move objects in the correct position. diff --git a/units/en/unit6/variance-problem.mdx b/units/en/unit6/variance-problem.mdx index bb8df6a..9eb1888 100644 --- a/units/en/unit6/variance-problem.mdx +++ b/units/en/unit6/variance-problem.mdx @@ -8,13 +8,13 @@ In Reinforce, we want to **increase the probability of actions in a trajectory p - If the **return is high**, we will **push up** the probabilities of the (state, action) combinations. - Else, if the **return is low**, it will **push down** the probabilities of the (state, action) combinations. -This return \\(R(\tau)\\) is calculated using a *Monte-Carlo sampling*. Indeed, we collect a trajectory and calculate the discounted return, **and use this score to increase or decrease the probability of every action taken in that trajectory**. If the return is good, all actions will be โ€œreinforcedโ€ by increasing their likelihood of being taken. +This return \\(R(\tau)\\) is calculated using a *Monte-Carlo sampling*. We collect a trajectory and calculate the discounted return, **and use this score to increase or decrease the probability of every action taken in that trajectory**. If the return is good, all actions will be โ€œreinforcedโ€ by increasing their likelihood of being taken. \\(R(\tau) = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ...\\) The advantage of this method is that **itโ€™s unbiased. Since weโ€™re not estimating the return**, we use only the true return we obtain. -But the problem is that **the variance is high, since trajectories can lead to different returns** due to stochasticity of the environment (random events during episode) and stochasticity of the policy. Consequently, the same starting state can lead to very different returns. +Given the stochasticity of the environment (random events during an episode) and stochasticity of the policy, **trajectories can lead to different returns, which can lead to high variance**. Consequently, the same starting state can lead to very different returns. Because of this, **the return starting at the same state can vary significantly across episodes**. variance From 196b80e15b66d5b308ee5b5bbe1b7095c88fffbe Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Tue, 17 Jan 2023 07:16:47 +0100 Subject: [PATCH 12/21] Update requirements-unit6.txt --- notebooks/unit6/requirements-unit6.txt | 1 - 1 file changed, 1 deletion(-) diff --git a/notebooks/unit6/requirements-unit6.txt b/notebooks/unit6/requirements-unit6.txt index a346f80..1c8ffaa 100644 --- a/notebooks/unit6/requirements-unit6.txt +++ b/notebooks/unit6/requirements-unit6.txt @@ -1,4 +1,3 @@ -gymnasium stable-baselines3[extra] huggingface_sb3 panda_gym==2.0.0 From 368b54970f6ee7721a3b298252b8341790efec88 Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Tue, 17 Jan 2023 07:34:01 +0100 Subject: [PATCH 13/21] Update _toctree.yml --- units/en/_toctree.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/units/en/_toctree.yml b/units/en/_toctree.yml index 2843096..9562baf 100644 --- a/units/en/_toctree.yml +++ b/units/en/_toctree.yml @@ -104,7 +104,7 @@ title: Optuna - local: unitbonus2/hands-on title: Hands-on -- title: Unit 6. Actor Crtic methods with Robotics environments +- title: Unit 6. Actor Critic methods with Robotics environments sections: - local: unit6/introduction title: Introduction From d406e5bb08923e0804cfceac7cb91069e4703c17 Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Tue, 17 Jan 2023 07:34:49 +0100 Subject: [PATCH 14/21] =?UTF-8?q?Cr=C3=A9=C3=A9=20avec=20Colaboratory?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- notebooks/unit6/unit6.ipynb | 260 +++++++++++++++++++----------------- 1 file changed, 138 insertions(+), 122 deletions(-) diff --git a/notebooks/unit6/unit6.ipynb b/notebooks/unit6/unit6.ipynb index 7358a72..ceee2b1 100644 --- a/notebooks/unit6/unit6.ipynb +++ b/notebooks/unit6/unit6.ipynb @@ -5,18 +5,7 @@ "colab": { "provenance": [], "private_outputs": true, - "collapsed_sections": [ - "MoubJX20oKaQ", - "DoUNkTExoUED", - "BTuQAUAPoa5E", - "tF42HvI7-gs5", - "nWAuOOLh-oQf", - "-voECBK3An9j", - "Qk9ykOk9D6Qh", - "G3xy3Nf3c2O1", - "usatLaZ8dM4P" - ], - "authorship_tag": "ABX9TyPovbUwEqbQAH1J8OxiHKDm", + "authorship_tag": "ABX9TyNTCZRW9WsSED/roRBW2oQ5", "include_colab_link": true }, "kernelspec": { @@ -47,17 +36,15 @@ "\n", "\"Thumbnail\"/\n", "\n", - "In this small notebook you'll learn to use A2C with PyBullet and Panda-Gym two set of robotics environments. \n", + "In this notebook, you'll learn to use A2C with PyBullet and Panda-Gym, two set of robotics environments. \n", "\n", - "With [PyBullet](https://github.com/bulletphysics/bullet3), you're going to **train robots to walk and run**:\n", - "- `AntBulletEnv-v0` ๐Ÿ•ธ๏ธ More precisely a spider (they say Ant but come on... it's a spider ๐Ÿ˜†) ๐Ÿ•ธ๏ธ\n", - "- `HalfCheetahBulletEnv-v0`\n", + "With [PyBullet](https://github.com/bulletphysics/bullet3), you're going to **train a robot to move**:\n", + "- `AntBulletEnv-v0` ๐Ÿ•ธ๏ธ More precisely, a spider (they say Ant but come on... it's a spider ๐Ÿ˜†) ๐Ÿ•ธ๏ธ\n", "\n", - "Then, with [Panda-Gym](https://github.com/qgallouedec/panda-gym), you're going **to train a robotic arm** (Franka Emika Panda robot) to perform some tasks:\n", + "Then, with [Panda-Gym](https://github.com/qgallouedec/panda-gym), you're going **to train a robotic arm** (Franka Emika Panda robot) to perform a task:\n", "- `Reach`: the robot must place its end-effector at a target position.\n", - "- `Slide`: the robot has to slide an object to a target position.\n", "\n", - "After that, you'll be able to train other robotics environments." + "After that, you'll be able **to train in other robotics environments**.\n" ], "metadata": { "id": "-PTReiOw-RAN" @@ -66,7 +53,7 @@ { "cell_type": "markdown", "source": [ - "TODO: ADD VIDEO OF WHAT IT LOOKS LIKE" + "\"Robotics" ], "metadata": { "id": "2VGL_0ncoAJI" @@ -162,12 +149,15 @@ { "cell_type": "markdown", "source": [ - "To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to:\n", + "To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push three models:\n", "\n", - "TODO ADD CERTIFICATION RECOMMENDATION\n", + "- `AntBulletEnv-v0` get a result of >= 650.\n", + "- `PandaReachDense-v2` get a result of >= -3.5.\n", "\n", "To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**\n", "\n", + "If you don't find your model, **go to the bottom of the page and click on the refresh button**\n", + "\n", "For more information about the certification process, check this section ๐Ÿ‘‰ https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process" ], "metadata": { @@ -225,18 +215,6 @@ "!pip3 install pyvirtualdisplay" ] }, - { - "cell_type": "code", - "source": [ - "# Additional dependencies for RL Baselines3 Zoo\n", - "!apt-get install swig cmake freeglut3-dev " - ], - "metadata": { - "id": "fWyKJCy_NJBX" - }, - "execution_count": null, - "outputs": [] - }, { "cell_type": "code", "source": [ @@ -262,24 +240,12 @@ "- `panda-gym`: Contains the robotics arm environments.\n", "- `stable-baselines3[extra]`: The SB3 deep reinforcement learning library.\n", "- `huggingface_sb3`: Additional code for Stable-baselines3 to load and upload models from the Hugging Face ๐Ÿค— Hub.\n", - "- `huggingface_hub`: Library allowing anyone to work with the Hub repositories.\n", - "- `gym==0.21`: The classical version of gym." + "- `huggingface_hub`: Library allowing anyone to work with the Hub repositories." ], "metadata": { "id": "e1obkbdJ_KnG" } }, - { - "cell_type": "code", - "source": [ - "!pip install -r https://huggingface.co/spaces/ThomasSimonini/temp-space-requirements/raw/main/requirements/requirements-unit6.txt" - ], - "metadata": { - "id": "69jUeXrLryos" - }, - "execution_count": null, - "outputs": [] - }, { "cell_type": "code", "execution_count": null, @@ -288,7 +254,7 @@ }, "outputs": [], "source": [ - "TODO: CHANGE TO THE ONE COMMENTED#!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit6/requirements-unit6.txt" + "!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit6/requirements-unit6.txt" ] }, { @@ -303,11 +269,9 @@ { "cell_type": "code", "source": [ - "import gym\n", "import pybullet_envs\n", - "\n", - "import gymnasium\n", "import panda_gym\n", + "import gym\n", "\n", "import os\n", "\n", @@ -326,15 +290,6 @@ "execution_count": null, "outputs": [] }, - { - "cell_type": "markdown", - "source": [ - "# Part 1: PyBullet Environments\n" - ], - "metadata": { - "id": "KIqf-N-otczo" - } - }, { "cell_type": "markdown", "source": [ @@ -350,23 +305,13 @@ "source": [ "### Create the AntBulletEnv-v0\n", "#### The environment ๐ŸŽฎ\n", - "In this environment, the agent needs to use correctly its different joints to walk correctly." + "In this environment, the agent needs to use correctly its different joints to walk correctly.\n", + "You can find a detailled explanation of this environment here: https://hackmd.io/@jeffreymo/SJJrSJh5_#PyBullet" ], "metadata": { "id": "frVXOrnlBerQ" } }, - { - "cell_type": "code", - "source": [ - "import gym" - ], - "metadata": { - "id": "RJ0XJccTt9FX" - }, - "execution_count": null, - "outputs": [] - }, { "cell_type": "code", "source": [ @@ -400,7 +345,9 @@ { "cell_type": "markdown", "source": [ - "TODO: Add explanation obs space" + "The observation Space (from [Jeffrey Y Mo](https://hackmd.io/@jeffreymo/SJJrSJh5_#PyBullet)):\n", + "\n", + "\"PyBullet\n" ], "metadata": { "id": "QzMmsdMJS7jh" @@ -422,7 +369,9 @@ { "cell_type": "markdown", "source": [ - "Todo: Add explanation action space" + "The action Space (from [Jeffrey Y Mo](https://hackmd.io/@jeffreymo/SJJrSJh5_#PyBullet)):\n", + "\n", + "\"PyBullet\n" ], "metadata": { "id": "3RfsHhzZS9Pw" @@ -440,7 +389,9 @@ { "cell_type": "markdown", "source": [ - "A good practice in reinforcement learning is to [normalize input features](https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html). For that, a wrapper exists and will compute a running average and standard deviation of input features.\n", + "A good practice in reinforcement learning is to [normalize input features](https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html). \n", + "\n", + "For that, a wrapper exists and will compute a running average and standard deviation of input features.\n", "\n", "We also normalize rewards with this same wrapper by adding `norm_reward = True`\n", "\n", @@ -493,6 +444,8 @@ "\n", "In this case, because we have a vector of 28 values as input, we'll use an MLP (multi-layer perceptron) as policy.\n", "\n", + "For more information about A2C implementation with StableBaselines3 check: https://stable-baselines3.readthedocs.io/en/master/modules/a2c.html#notes\n", + "\n", "To find the best parameters I checked the [official trained agents by Stable-Baselines3 team](https://huggingface.co/sb3)." ], "metadata": { @@ -531,7 +484,6 @@ " n_steps = 8,\n", " vf_coef = 0.4,\n", " ent_coef = 0.0,\n", - " tensorboard_log = \"./tensorboard\",\n", " policy_kwargs=dict(\n", " log_std_init=-2, ortho_init=False),\n", " normalize_advantage=False,\n", @@ -717,33 +669,11 @@ "execution_count": null, "outputs": [] }, - { - "cell_type": "markdown", - "source": [ - "## Environment 2: HalfCheetahBulletEnv-v0\n", - "\n", - "For this environment, you must follow the same process as the first one. **Don't hesitate to save this notebook to your Google Drive** since timeout can happen. You may also want to **complete this notebook two times**.\n", - "\n", - "To see that you understood the complete process from environment definition to `package_to_hub` why not try to do **it yourself first without the solution?**\n", - "\n", - "1. Define the environment called HalfCheetahBulletEnv-v0\n", - "2. Make a vectorized environment\n", - "3. Add a wrapper to normalize the observations and rewards. [Check the documentation](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize)\n", - "4. Create the A2C Model\n", - "5. Train it for 2M Timesteps\n", - "6. Save the model and VecNormalize statistics when saving the agent\n", - "7. Evaluate your agent\n", - "8. Publish your trained model on the Hub ๐Ÿ”ฅ with `package_to_hub`" - ], - "metadata": { - "id": "-voECBK3An9j" - } - }, { "cell_type": "markdown", "source": [ "## Take a coffee break โ˜•\n", - "- You already trained two robotics environments that learned to move congratutlations ๐Ÿฅณ!\n", + "- You already trained your first robot that learned to move congratutlations ๐Ÿฅณ!\n", "- It's **time to take a break**. Don't hesitate to **save this notebook** `File > Save a copy to Drive` to work on this second part later.\n" ], "metadata": { @@ -753,16 +683,15 @@ { "cell_type": "markdown", "source": [ - "# Part 2: Robotic Arm Environments with `panda-gym`\n", + "## Environment 2: PandaReachDense-v2 ๐Ÿฆพ\n", "\n", - "The second set of robotics environments we're going to train are a robotic arm that needs to do controls (moving the arm and using the end-effector).\n", + "The agent we're going to train is a robotic arm that needs to do controls (moving the arm and using the end-effector).\n", "\n", "In robotics, the *end-effector* is the device at the end of a robotic arm designed to interact with the environment.\n", "\n", - "1. In the first environment, `PandaReach`, the robot must place its end-effector at a target position (green ball).\n", - "2. In the second environment, `PandaSlide`, the robot has to slide an object to a target position.\n", + "In `PandaReach`, the robot must place its end-effector at a target position (green ball).\n", "\n", - "We're going to use the dense version of the environments. It means we'll get a *dense reward function* that **will provide a reward at each timestep** (the closer the agent is to complete the task, the higher the reward). Contrary to a *sparse reward function* where the environment **return a reward if and only if the task is completed**.\n", + "We're going to use the dense version of this environment. It means we'll get a *dense reward function* that **will provide a reward at each timestep** (the closer the agent is to completing the task, the higher the reward). Contrary to a *sparse reward function* where the environment **return a reward if and only if the task is completed**.\n", "\n", "Also, we're going to use the *End-effector displacement control*, it means the **action corresponds to the displacement of the end-effector**. We don't control the individual motion of each joint (joint control).\n", "\n", @@ -776,10 +705,24 @@ "id": "5VWfwAA7EJg7" } }, + { + "cell_type": "markdown", + "source": [ + "\n", + "\n", + "In `PandaReachDense-v2` the robotic arm must place its end-effector at a target position (green ball).\n", + "\n" + ], + "metadata": { + "id": "oZ7FyDEi7G3T" + } + }, { "cell_type": "code", "source": [ - "env_id = \"PandaReachDense-v2\"\n", + "import gym\n", + "\n", + "env_id = \"PandaPushDense-v2\"\n", "\n", "# Create the env\n", "env = gym.make(env_id)\n", @@ -810,11 +753,12 @@ { "cell_type": "markdown", "source": [ - "The observation space is a dictionary with 3 different element:\n", + "The observation space **is a dictionary with 3 different element**:\n", "- `achieved_goal`: (x,y,z) position of the goal.\n", "- `desired_goal`: (x,y,z) distance between the goal position and the current object position.\n", "- `observation`: position (x,y,z) and velocity of the end-effector (vx, vy, vz).\n", - "\n" + "\n", + "Given it's a dictionary as observation, **we will need to use a MultiInputPolicy policy instead of MlpPolicy**." ], "metadata": { "id": "g_JClfElGFnF" @@ -836,35 +780,103 @@ { "cell_type": "markdown", "source": [ - "TODO: ADd action space" + "The action space is a vector with 3 values:\n", + "- Control x, y, z movement" ], "metadata": { "id": "5MHTHEHZS4yp" } }, { - "cell_type": "code", + "cell_type": "markdown", "source": [ + "Now it's your turn:\n", "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "model = A2C(\"MultiInputPolicy\", env)\n", - "model.learn(total_timesteps=100000)" + "1. Define the environment called \"PandaReachDense-v2\"\n", + "2. Make a vectorized environment\n", + "3. Add a wrapper to normalize the observations and rewards. [Check the documentation](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize)\n", + "4. Create the A2C Model (don't forget verbose=1 to print the training logs).\n", + "5. Train it for 2M Timesteps\n", + "6. Save the model and VecNormalize statistics when saving the agent\n", + "7. Evaluate your agent\n", + "8. Publish your trained model on the Hub ๐Ÿ”ฅ with `package_to_hub`" ], "metadata": { - "id": "C-3SfbJr0N7I" + "id": "nIhPoc5t9HjG" + } + }, + { + "cell_type": "markdown", + "source": [ + "### Solution (fill the todo)" + ], + "metadata": { + "id": "sKGbFXZq9ikN" + } + }, + { + "cell_type": "code", + "source": [ + "# 1 - 2\n", + "env_id = \"PandaReachDense-v2\"\n", + "env = make_vec_env(env_id, n_envs=4)\n", + "\n", + "# 3\n", + "env = VecNormalize(env, norm_obs=True, norm_reward=False, clip_obs=10.)\n", + "\n", + "# 4\n", + "model = A2C(policy = \"MultiInputPolicy\",\n", + " env = env,\n", + " verbose=1)\n", + "# 5\n", + "model.learn(1_000_000)" + ], + "metadata": { + "id": "J-cC-Feg9iMm" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", - "source": [], + "source": [ + "# 6\n", + "model_name = \"a2c-PandaReachDense-v2\"; \n", + "model.save(model_name)\n", + "env.save(\"vec_normalize.pkl\")\n", + "\n", + "# 7\n", + "from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize\n", + "\n", + "# Load the saved statistics\n", + "eval_env = DummyVecEnv([lambda: gym.make(\"PandaReachDense-v2\")])\n", + "eval_env = VecNormalize.load(\"vec_normalize.pkl\", eval_env)\n", + "\n", + "# do not update them at test time\n", + "eval_env.training = False\n", + "# reward normalization is not needed at test time\n", + "eval_env.norm_reward = False\n", + "\n", + "# Load the agent\n", + "model = A2C.load(model_name)\n", + "\n", + "mean_reward, std_reward = evaluate_policy(model, env)\n", + "\n", + "print(f\"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}\")\n", + "\n", + "# 8\n", + "package_to_hub(\n", + " model=model,\n", + " model_name=f\"a2c-{env_id}\",\n", + " model_architecture=\"A2C\",\n", + " env_id=env_id,\n", + " eval_env=eval_env,\n", + " repo_id=f\"ThomasSimonini/a2c-{env_id}\", # TODO: Change the username\n", + " commit_message=\"Initial commit\",\n", + ")" + ], "metadata": { - "id": "16pttUsKFyZY" + "id": "-UnlKLmpg80p" }, "execution_count": null, "outputs": [] @@ -873,9 +885,13 @@ "cell_type": "markdown", "source": [ "## Some additional challenges ๐Ÿ†\n", - "The best way to learn **is to try things by your own**! Why not trying `HalfCheetahBulletEnv-v0`?\n", + "The best way to learn **is to try things by your own**! Why not trying `HalfCheetahBulletEnv-v0` for PyBullet?\n", "\n", - "In the [Leaderboard](https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?\n", + "If you want to try more advanced tasks for panda-gym you need to check what was done using **TQC or SAC** (a more sample efficient algorithm suited for robotics tasks). In real robotics, you'll use more sample-efficient algorithm for a simple reason: contrary to a simulation **if you move your robotic arm too much you have a risk to break it**.\n", + "\n", + "PandaPickAndPlace-v1: https://huggingface.co/sb3/tqc-PandaPickAndPlace-v1\n", + "\n", + "And don't hesitate to check panda-gym documentation here: https://panda-gym.readthedocs.io/en/latest/usage/train_with_sb3.html\n", "\n", "Here are some ideas to achieve so:\n", "* Train more steps\n", @@ -889,7 +905,7 @@ { "cell_type": "markdown", "source": [ - "See you on Unit 8! ๐Ÿ”ฅ\n", + "See you on Unit 7! ๐Ÿ”ฅ\n", "## Keep learning, stay awesome ๐Ÿค—" ], "metadata": { From 28ef99046d10028215012a17d76ec8957cec2559 Mon Sep 17 00:00:00 2001 From: simoninithomas Date: Tue, 17 Jan 2023 07:47:05 +0100 Subject: [PATCH 15/21] Finalize A2C --- units/en/unit6/additional-readings.mdx | 1 + units/en/unit6/advantage-actor-critic.mdx | 3 +- units/en/unit6/conclusion.mdx | 8 +- units/en/unit6/hands-on.mdx | 438 +++++++++++++++++++++- units/en/unit6/introduction.mdx | 7 +- 5 files changed, 441 insertions(+), 16 deletions(-) diff --git a/units/en/unit6/additional-readings.mdx b/units/en/unit6/additional-readings.mdx index 5e7f386..07d80fb 100644 --- a/units/en/unit6/additional-readings.mdx +++ b/units/en/unit6/additional-readings.mdx @@ -3,6 +3,7 @@ ## Bias-variance tradeoff in Reinforcement Learning If you want to dive deeper into the question of variance and bias tradeoff in Deep Reinforcement Learning, you can check these two articles: + - [Making Sense of the Bias / Variance Trade-off in (Deep) Reinforcement Learning](https://blog.mlreview.com/making-sense-of-the-bias-variance-trade-off-in-deep-reinforcement-learning-79cf1e83d565) - [Bias-variance Tradeoff in Reinforcement Learning](https://www.endtoend.ai/blog/bias-variance-tradeoff-in-reinforcement-learning/) diff --git a/units/en/unit6/advantage-actor-critic.mdx b/units/en/unit6/advantage-actor-critic.mdx index 6544eb3..f3ed336 100644 --- a/units/en/unit6/advantage-actor-critic.mdx +++ b/units/en/unit6/advantage-actor-critic.mdx @@ -1,5 +1,6 @@ -# Advantage Actor-Critic (A2C) [[advantage-actor-critic-a2c]] +# Advantage Actor-Critic (A2C) ## Reducing variance with Actor-Critic methods + The solution to reducing the variance of the Reinforce algorithm and training our agent faster and better is to use a combination of Policy-Based and Value-Based methods: *the Actor-Critic method*. To understand the Actor-Critic, imagine you play a video game. You can play with a friend that will provide you with some feedback. You're the Actor and your friend is the Critic. diff --git a/units/en/unit6/conclusion.mdx b/units/en/unit6/conclusion.mdx index 3da4332..85d0229 100644 --- a/units/en/unit6/conclusion.mdx +++ b/units/en/unit6/conclusion.mdx @@ -4,12 +4,8 @@ Congrats on finishing this unit and the tutorial. You've just trained your first **Take time to grasp the material before continuing**. You can also look at the additional reading materials we provided in the *additional reading* section. -Feel free to train your agent in other environments. The **best way to learn is to try things on your own!** For instance, what about teaching your robotic arm [to stack objects](https://panda-gym.readthedocs.io/en/latest/usage/environments.html#sparce-reward-end-effector-control-default-setting) or slide objects? - -In the next unit, we will learn to improve Actor-Critic Methods with Proximal Policy Optimization using the [CleanRL library](https://github.com/vwxyzjn/cleanrl). Then we'll study how to speed up the process with the [Sample Factory library](https://samplefactory.dev/). You'll train your PPO agents in these environments: VizDoom, Racing Car, and a 3D FPS. - -TODO: IMAGE of the environment Vizdoom + ED - Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please ๐Ÿ‘‰ [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9) +See you in next unit, + ### Keep learning, stay awesome ๐Ÿค—, diff --git a/units/en/unit6/hands-on.mdx b/units/en/unit6/hands-on.mdx index 28ca5c7..244ce11 100644 --- a/units/en/unit6/hands-on.mdx +++ b/units/en/unit6/hands-on.mdx @@ -8,23 +8,23 @@ askForHelpUrl="http://hf.co/join/discord" /> -Now that you've studied the theory behind Advantage Actor Critic (A2C), **you're ready to train your A2C agent** using Stable-Baselines3 in robotic environments. And train three robots: +Now that you've studied the theory behind Advantage Actor Critic (A2C), **you're ready to train your A2C agent** using Stable-Baselines3 in robotic environments. And train two robots: -- A bipedal walker ๐Ÿšถ to learn to walk. - A spider ๐Ÿ•ท๏ธ to learn to move. -- A robotic arm ๐Ÿฆพ to move objects in the correct position. +- A robotic arm ๐Ÿฆพ to move in the correct position. We're going to use two Robotics environments: - [PyBullet](https://github.com/bulletphysics/bullet3) - [panda-gym](https://github.com/qgallouedec/panda-gym) -TODO: ADD IMAGE OF THREE +Environments To validate this hands-on for the certification process, you need to push your three trained model to the Hub and get: -TODO ADD CERTIFICATION ELEMENTS +- `AntBulletEnv-v0` get a result of >= 650. +- `PandaReachDense-v2` get a result of >= -3.5. To find your result, [go to the leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward** @@ -33,3 +33,431 @@ For more information about the certification process, check this section ๐Ÿ‘‰ ht **To start the hands-on click on Open In Colab button** ๐Ÿ‘‡ : [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit6/unit6.ipynb) + + +# Unit 6: Advantage Actor Critic (A2C) using Robotics Simulations with PyBullet and Panda-Gym ๐Ÿค– + +### ๐ŸŽฎ Environments: + +- [PyBullet](https://github.com/bulletphysics/bullet3) +- [Panda-Gym](https://github.com/qgallouedec/panda-gym) + +### ๐Ÿ“š RL-Library: + +- [Stable-Baselines3](https://stable-baselines3.readthedocs.io/) + +We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues). + +## Objectives of this notebook ๐Ÿ† + +At the end of the notebook, you will: + +- Be able to use **PyBullet** and **Panda-Gym**, the environment libraries. +- Be able to **train robots using A2C**. +- Understand why **we need to normalize the input**. +- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score ๐Ÿ”ฅ. + +## Prerequisites ๐Ÿ—๏ธ +Before diving into the notebook, you need to: + +๐Ÿ”ฒ ๐Ÿ“š Study [Actor-Critic methods by reading Unit 6](https://huggingface.co/deep-rl-course/unit6/introduction) ๐Ÿค— + +# Let's train our first robots ๐Ÿค– + +## Set the GPU ๐Ÿ’ช + +- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type` + +GPU Step 1 + +- `Hardware Accelerator > GPU` + +GPU Step 2 + +## Create a virtual display ๐Ÿ”ฝ + +During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames). + +Hence the following cell will install the librairies and create and run a virtual screen ๐Ÿ–ฅ + +```python +%%capture +!apt install python-opengl +!apt install ffmpeg +!apt install xvfb +!pip3 install pyvirtualdisplay +``` + +```python +# Virtual display +from pyvirtualdisplay import Display + +virtual_display = Display(visible=0, size=(1400, 900)) +virtual_display.start() +``` + +### Install dependencies ๐Ÿ”ฝ +The first step is to install the dependencies, weโ€™ll install multiple ones: + +- `pybullet`: Contains the walking robots environments. +- `panda-gym`: Contains the robotics arm environments. +- `stable-baselines3[extra]`: The SB3 deep reinforcement learning library. +- `huggingface_sb3`: Additional code for Stable-baselines3 to load and upload models from the Hugging Face ๐Ÿค— Hub. +- `huggingface_hub`: Library allowing anyone to work with the Hub repositories. + +```bash +!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit6/requirements-unit6.txt +``` + +## Import the packages ๐Ÿ“ฆ + +```python +import pybullet_envs +import panda_gym +import gym + +import os + +from huggingface_sb3 import load_from_hub, package_to_hub + +from stable_baselines3 import A2C +from stable_baselines3.common.evaluation import evaluate_policy +from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize +from stable_baselines3.common.env_util import make_vec_env + +from huggingface_hub import notebook_login +``` + +## Environment 1: AntBulletEnv-v0 ๐Ÿ•ธ + +### Create the AntBulletEnv-v0 +#### The environment ๐ŸŽฎ + +In this environment, the agent needs to use correctly its different joints to walk correctly. +You can find a detailled explanation of this environment here: https://hackmd.io/@jeffreymo/SJJrSJh5_#PyBullet + +```python +env_id = "AntBulletEnv-v0" +# Create the env +env = gym.make(env_id) + +# Get the state space and action space +s_size = env.observation_space.shape[0] +a_size = env.action_space +``` + +```python +print("_____OBSERVATION SPACE_____ \n") +print("The State Space is: ", s_size) +print("Sample observation", env.observation_space.sample()) # Get a random observation +``` + +The observation Space (from [Jeffrey Y Mo](https://hackmd.io/@jeffreymo/SJJrSJh5_#PyBullet)): + +PyBullet Ant Obs space + + +```python +print("\n _____ACTION SPACE_____ \n") +print("The Action Space is: ", a_size) +print("Action Space Sample", env.action_space.sample()) # Take a random action +``` + +The action Space (from [Jeffrey Y Mo](https://hackmd.io/@jeffreymo/SJJrSJh5_#PyBullet)): + +PyBullet Ant Obs space + + +### Normalize observation and rewards + +A good practice in reinforcement learning is to [normalize input features](https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html). + +For that, a wrapper exists and will compute a running average and standard deviation of input features. + +We also normalize rewards with this same wrapper by adding `norm_reward = True` + +[You should check the documentation to fill this cell](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize) + +```python +env = make_vec_env(env_id, n_envs=4) + +# Adding this wrapper to normalize the observation and the reward +env = # TODO: Add the wrapper +``` + +#### Solution + +```python +env = make_vec_env(env_id, n_envs=4) + +env = VecNormalize(env, norm_obs=True, norm_reward=False, clip_obs=10.0) +``` + +### Create the A2C Model ๐Ÿค– + +In this case, because we have a vector of 28 values as input, we'll use an MLP (multi-layer perceptron) as policy. + +For more information about A2C implementation with StableBaselines3 check: https://stable-baselines3.readthedocs.io/en/master/modules/a2c.html#notes + +To find the best parameters I checked the [official trained agents by Stable-Baselines3 team](https://huggingface.co/sb3). + +```python +model = # Create the A2C model and try to find the best parameters +``` + +#### Solution + +```python +model = A2C( + policy="MlpPolicy", + env=env, + gae_lambda=0.9, + gamma=0.99, + learning_rate=0.00096, + max_grad_norm=0.5, + n_steps=8, + vf_coef=0.4, + ent_coef=0.0, + policy_kwargs=dict(log_std_init=-2, ortho_init=False), + normalize_advantage=False, + use_rms_prop=True, + use_sde=True, + verbose=1, +) +``` + +### Train the A2C agent ๐Ÿƒ + +- Let's train our agent for 2,000,000 timesteps, don't forget to use GPU on Colab. It will take approximately ~25-40min + +```python +model.learn(2_000_000) +``` + +```python +# Save the model and VecNormalize statistics when saving the agent +model.save("a2c-AntBulletEnv-v0") +env.save("vec_normalize.pkl") +``` + +### Evaluate the agent ๐Ÿ“ˆ +- Now that's our agent is trained, we need to **check its performance**. +- Stable-Baselines3 provides a method to do that `evaluate_policy` +- In my case, I've got a mean reward of `2371.90 +/- 16.50` + +```python +from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize + +# Load the saved statistics +eval_env = DummyVecEnv([lambda: gym.make("AntBulletEnv-v0")]) +eval_env = VecNormalize.load("vec_normalize.pkl", eval_env) + +# do not update them at test time +eval_env.training = False +# reward normalization is not needed at test time +eval_env.norm_reward = False + +# Load the agent +model = A2C.load("a2c-AntBulletEnv-v0") + +mean_reward, std_reward = evaluate_policy(model, env) + +print(f"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}") +``` + +### Publish your trained model on the Hub ๐Ÿ”ฅ +Now that we saw we got good results after the training, we can publish our trained model on the hub ๐Ÿค— with one line of code. + +๐Ÿ“š The libraries documentation ๐Ÿ‘‰ https://github.com/huggingface/huggingface_sb3/tree/main#hugging-face--x-stable-baselines3-v20 + +Here's an example of a Model Card (with a PyBullet environment): + +Model Card Pybullet + +By using `package_to_hub`, as we already mentionned in the former units, **you evaluate, record a replay, generate a model card of your agent and push it to the hub**. + +This way: +- You can **showcase our work** ๐Ÿ”ฅ +- You can **visualize your agent playing** ๐Ÿ‘€ +- You can **share with the community an agent that others can use** ๐Ÿ’พ +- You can **access a leaderboard ๐Ÿ† to see how well your agent is performing compared to your classmates** ๐Ÿ‘‰ https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard + + +To be able to share your model with the community there are three more steps to follow: + +1๏ธโƒฃ (If it's not already done) create an account to HF โžก https://huggingface.co/join + +2๏ธโƒฃ Sign in and then, you need to store your authentication token from the Hugging Face website. +- Create a new token (https://huggingface.co/settings/tokens) **with write role** + +Create HF Token + +- Copy the token +- Run the cell below and paste the token + +```python +notebook_login() +!git config --global credential.helper store +``` + +If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login` + +3๏ธโƒฃ We're now ready to push our trained agent to the ๐Ÿค— Hub ๐Ÿ”ฅ using `package_to_hub()` function + +```python +package_to_hub( + model=model, + model_name=f"a2c-{env_id}", + model_architecture="A2C", + env_id=env_id, + eval_env=eval_env, + repo_id=f"ThomasSimonini/a2c-{env_id}", # Change the username + commit_message="Initial commit", +) +``` + +## Take a coffee break โ˜• +- You already trained your first robot that learned to move congratutlations ๐Ÿฅณ! +- It's **time to take a break**. Don't hesitate to **save this notebook** `File > Save a copy to Drive` to work on this second part later. + + +## Environment 2: PandaReachDense-v2 ๐Ÿฆพ + +The agent we're going to train is a robotic arm that needs to do controls (moving the arm and using the end-effector). + +In robotics, the *end-effector* is the device at the end of a robotic arm designed to interact with the environment. + +In `PandaReach`, the robot must place its end-effector at a target position (green ball). + +We're going to use the dense version of this environment. It means we'll get a *dense reward function* that **will provide a reward at each timestep** (the closer the agent is to completing the task, the higher the reward). Contrary to a *sparse reward function* where the environment **return a reward if and only if the task is completed**. + +Also, we're going to use the *End-effector displacement control*, it means the **action corresponds to the displacement of the end-effector**. We don't control the individual motion of each joint (joint control). + +Robotics + + +This way, **the training will be easier**. + + + +In `PandaReachDense-v2` the robotic arm must place its end-effector at a target position (green ball). + + + +```python +import gym + +env_id = "PandaPushDense-v2" + +# Create the env +env = gym.make(env_id) + +# Get the state space and action space +s_size = env.observation_space.shape +a_size = env.action_space +``` + +```python +print("_____OBSERVATION SPACE_____ \n") +print("The State Space is: ", s_size) +print("Sample observation", env.observation_space.sample()) # Get a random observation +``` + +The observation space **is a dictionary with 3 different element**: +- `achieved_goal`: (x,y,z) position of the goal. +- `desired_goal`: (x,y,z) distance between the goal position and the current object position. +- `observation`: position (x,y,z) and velocity of the end-effector (vx, vy, vz). + +Given it's a dictionary as observation, **we will need to use a MultiInputPolicy policy instead of MlpPolicy**. + +```python +print("\n _____ACTION SPACE_____ \n") +print("The Action Space is: ", a_size) +print("Action Space Sample", env.action_space.sample()) # Take a random action +``` + +The action space is a vector with 3 values: +- Control x, y, z movement + +Now it's your turn: + +1. Define the environment called "PandaReachDense-v2" +2. Make a vectorized environment +3. Add a wrapper to normalize the observations and rewards. [Check the documentation](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize) +4. Create the A2C Model (don't forget verbose=1 to print the training logs). +5. Train it for 2M Timesteps +6. Save the model and VecNormalize statistics when saving the agent +7. Evaluate your agent +8. Publish your trained model on the Hub ๐Ÿ”ฅ with `package_to_hub` + +### Solution (fill the todo) + +```python +# 1 - 2 +env_id = "PandaReachDense-v2" +env = make_vec_env(env_id, n_envs=4) + +# 3 +env = VecNormalize(env, norm_obs=True, norm_reward=False, clip_obs=10.0) + +# 4 +model = A2C(policy="MultiInputPolicy", env=env, verbose=1) +# 5 +model.learn(1_000_000) +``` + +```python +# 6 +model_name = "a2c-PandaReachDense-v2" +model.save(model_name) +env.save("vec_normalize.pkl") + +# 7 +from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize + +# Load the saved statistics +eval_env = DummyVecEnv([lambda: gym.make("PandaReachDense-v2")]) +eval_env = VecNormalize.load("vec_normalize.pkl", eval_env) + +# do not update them at test time +eval_env.training = False +# reward normalization is not needed at test time +eval_env.norm_reward = False + +# Load the agent +model = A2C.load(model_name) + +mean_reward, std_reward = evaluate_policy(model, env) + +print(f"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}") + +# 8 +package_to_hub( + model=model, + model_name=f"a2c-{env_id}", + model_architecture="A2C", + env_id=env_id, + eval_env=eval_env, + repo_id=f"ThomasSimonini/a2c-{env_id}", # TODO: Change the username + commit_message="Initial commit", +) +``` + +## Some additional challenges ๐Ÿ† + +The best way to learn **is to try things by your own**! Why not trying `HalfCheetahBulletEnv-v0` for PyBullet? + +If you want to try more advanced tasks for panda-gym you need to check what was done using **TQC or SAC** (a more sample efficient algorithm suited for robotics tasks). In real robotics, you'll use more sample-efficient algorithm for a simple reason: contrary to a simulation **if you move your robotic arm too much you have a risk to break it**. + +PandaPickAndPlace-v1: https://huggingface.co/sb3/tqc-PandaPickAndPlace-v1 + +And don't hesitate to check panda-gym documentation here: https://panda-gym.readthedocs.io/en/latest/usage/train_with_sb3.html + +Here are some ideas to achieve so: +* Train more steps +* Try different hyperparameters by looking at what your classmates have done ๐Ÿ‘‰ https://huggingface.co/models?other=https://huggingface.co/models?other=AntBulletEnv-v0 +* **Push your new trained model** on the Hub ๐Ÿ”ฅ + + +See you on Unit 7! ๐Ÿ”ฅ +## Keep learning, stay awesome ๐Ÿค— diff --git a/units/en/unit6/introduction.mdx b/units/en/unit6/introduction.mdx index b96ba39..d85281d 100644 --- a/units/en/unit6/introduction.mdx +++ b/units/en/unit6/introduction.mdx @@ -16,11 +16,10 @@ So, today we'll study **Actor-Critic methods**, a hybrid architecture combining - *A Critic* that measures **how good the taken action is** (Value-Based method) -We'll study one of these hybrid methods, Advantage Actor Critic (A2C), **and train our agent using Stable-Baselines3 in robotic environments**. We'll train three robots: -- A bipedal walker ๐Ÿšถ to learn to walk. +We'll study one of these hybrid methods, Advantage Actor Critic (A2C), **and train our agent using Stable-Baselines3 in robotic environments**. We'll train two robots: - A spider ๐Ÿ•ท๏ธ to learn to move. -- A robotic arm ๐Ÿฆพ to move objects in the correct position. +- A robotic arm ๐Ÿฆพ to move in the correct position. -TODO: ADD IMAGE OF THREE +Environments Sounds exciting? Let's get started! From ae37a884ed6e621fec11fb637b528cccc6b5c74b Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Tue, 17 Jan 2023 08:08:58 +0100 Subject: [PATCH 16/21] Update advantage-actor-critic.mdx --- units/en/unit6/advantage-actor-critic.mdx | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/units/en/unit6/advantage-actor-critic.mdx b/units/en/unit6/advantage-actor-critic.mdx index f3ed336..398e46f 100644 --- a/units/en/unit6/advantage-actor-critic.mdx +++ b/units/en/unit6/advantage-actor-critic.mdx @@ -1,4 +1,5 @@ -# Advantage Actor-Critic (A2C) +# Advantage Actor-Critic (A2C) [[advantage-actor-critic]] + ## Reducing variance with Actor-Critic methods The solution to reducing the variance of the Reinforce algorithm and training our agent faster and better is to use a combination of Policy-Based and Value-Based methods: *the Actor-Critic method*. From b4aae36314e436a08c30579bfa00e1528edbe46a Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Tue, 17 Jan 2023 09:06:08 +0100 Subject: [PATCH 17/21] Update _toctree.yml --- units/en/_toctree.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/units/en/_toctree.yml b/units/en/_toctree.yml index 4eb75cf..3b6f440 100644 --- a/units/en/_toctree.yml +++ b/units/en/_toctree.yml @@ -155,7 +155,7 @@ - local: unit6/variance-problem title: The Problem of Variance in Reinforce - local: unit6/advantage-actor-critic - title: Advantage Actor-Critic (A2C) + title: Advantage Actor Critic (A2C) - local: unit6/hands-on title: Advantage Actor Critic (A2C) using Robotics Simulations with PyBullet and Panda-Gym ๐Ÿค– - local: unit6/conclusion From 87c33d790bc33ed8e212888dfa71487081e8010d Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Tue, 17 Jan 2023 14:23:14 +0100 Subject: [PATCH 18/21] Update advantage-actor-critic.mdx --- units/en/unit6/advantage-actor-critic.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/units/en/unit6/advantage-actor-critic.mdx b/units/en/unit6/advantage-actor-critic.mdx index 398e46f..8b7863c 100644 --- a/units/en/unit6/advantage-actor-critic.mdx +++ b/units/en/unit6/advantage-actor-critic.mdx @@ -52,7 +52,7 @@ Let's see the training process to understand how Actor and Critic are optimized: Step 5 Actor Critic -## Adding "Advantage" in Actor-Critic (A2C) +## Adding Advantage in Actor-Critic (A2C) We can stabilize learning further by **using the Advantage function as Critic instead of the Action value function**. The idea is that the Advantage function calculates the relative advantage of an action compared to the others possible at a state: **how taking that action at a state is better compared to the average value of the state**. It's subtracting the mean value of the state from the state action pair: From 770adfdd2bda937268fecacca42fdf5e1eb540e7 Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Tue, 17 Jan 2023 14:31:28 +0100 Subject: [PATCH 19/21] Apply suggestions from code review Co-authored-by: Omar Sanseviero --- units/en/unit6/hands-on.mdx | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/units/en/unit6/hands-on.mdx b/units/en/unit6/hands-on.mdx index 244ce11..7a043a4 100644 --- a/units/en/unit6/hands-on.mdx +++ b/units/en/unit6/hands-on.mdx @@ -21,7 +21,7 @@ We're going to use two Robotics environments: Environments -To validate this hands-on for the certification process, you need to push your three trained model to the Hub and get: +To validate this hands-on for the certification process, you need to push your two trained models to the Hub and get the following results: - `AntBulletEnv-v0` get a result of >= 650. - `PandaReachDense-v2` get a result of >= -3.5. @@ -172,7 +172,7 @@ The action Space (from [Jeffrey Y Mo](https://hackmd.io/@jeffreymo/SJJrSJh5_#PyB A good practice in reinforcement learning is to [normalize input features](https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html). -For that, a wrapper exists and will compute a running average and standard deviation of input features. +For that purpose, there is a wrapper that will compute a running average and standard deviation of input features. We also normalize rewards with this same wrapper by adding `norm_reward = True` @@ -242,8 +242,8 @@ env.save("vec_normalize.pkl") ### Evaluate the agent ๐Ÿ“ˆ - Now that's our agent is trained, we need to **check its performance**. -- Stable-Baselines3 provides a method to do that `evaluate_policy` -- In my case, I've got a mean reward of `2371.90 +/- 16.50` +- Stable-Baselines3 provides a method to do that: `evaluate_policy` +- In my case, I got a mean reward of `2371.90 +/- 16.50` ```python from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize @@ -266,7 +266,7 @@ print(f"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}") ``` ### Publish your trained model on the Hub ๐Ÿ”ฅ -Now that we saw we got good results after the training, we can publish our trained model on the hub ๐Ÿค— with one line of code. +Now that we saw we got good results after the training, we can publish our trained model on the Hub with one line of code. ๐Ÿ“š The libraries documentation ๐Ÿ‘‰ https://github.com/huggingface/huggingface_sb3/tree/main#hugging-face--x-stable-baselines3-v20 @@ -336,11 +336,11 @@ Also, we're going to use the *End-effector displacement control*, it means the * Robotics -This way, **the training will be easier**. +This way **the training will be easier**. -In `PandaReachDense-v2` the robotic arm must place its end-effector at a target position (green ball). +In `PandaReachDense-v2`, the robotic arm must place its end-effector at a target position (green ball). @@ -363,7 +363,7 @@ print("The State Space is: ", s_size) print("Sample observation", env.observation_space.sample()) # Get a random observation ``` -The observation space **is a dictionary with 3 different element**: +The observation space **is a dictionary with 3 different elements**: - `achieved_goal`: (x,y,z) position of the goal. - `desired_goal`: (x,y,z) distance between the goal position and the current object position. - `observation`: position (x,y,z) and velocity of the end-effector (vx, vy, vz). @@ -447,7 +447,7 @@ package_to_hub( The best way to learn **is to try things by your own**! Why not trying `HalfCheetahBulletEnv-v0` for PyBullet? -If you want to try more advanced tasks for panda-gym you need to check what was done using **TQC or SAC** (a more sample efficient algorithm suited for robotics tasks). In real robotics, you'll use more sample-efficient algorithm for a simple reason: contrary to a simulation **if you move your robotic arm too much you have a risk to break it**. +If you want to try more advanced tasks for panda-gym, you need to check what was done using **TQC or SAC** (a more sample-efficient algorithm suited for robotics tasks). In real robotics, you'll use a more sample-efficient algorithm for a simple reason: contrary to a simulation **if you move your robotic arm too much, you have a risk of breaking it**. PandaPickAndPlace-v1: https://huggingface.co/sb3/tqc-PandaPickAndPlace-v1 From 9caf7e27593c2b082ac41e223166760acd9e9557 Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Tue, 17 Jan 2023 14:44:13 +0100 Subject: [PATCH 20/21] Update hands-on.mdx --- units/en/unit6/hands-on.mdx | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/units/en/unit6/hands-on.mdx b/units/en/unit6/hands-on.mdx index 7a043a4..37a0d93 100644 --- a/units/en/unit6/hands-on.mdx +++ b/units/en/unit6/hands-on.mdx @@ -153,6 +153,7 @@ print("Sample observation", env.observation_space.sample()) # Get a random obse ``` The observation Space (from [Jeffrey Y Mo](https://hackmd.io/@jeffreymo/SJJrSJh5_#PyBullet)): +The difference is that our observation space is 28 not 29. PyBullet Ant Obs space @@ -385,7 +386,7 @@ Now it's your turn: 2. Make a vectorized environment 3. Add a wrapper to normalize the observations and rewards. [Check the documentation](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize) 4. Create the A2C Model (don't forget verbose=1 to print the training logs). -5. Train it for 2M Timesteps +5. Train it for 1M Timesteps 6. Save the model and VecNormalize statistics when saving the agent 7. Evaluate your agent 8. Publish your trained model on the Hub ๐Ÿ”ฅ with `package_to_hub` @@ -445,7 +446,7 @@ package_to_hub( ## Some additional challenges ๐Ÿ† -The best way to learn **is to try things by your own**! Why not trying `HalfCheetahBulletEnv-v0` for PyBullet? +The best way to learn **is to try things by your own**! Why not trying `HalfCheetahBulletEnv-v0` for PyBullet and `PandaPickAndPlace-v1` for Panda-Gym? If you want to try more advanced tasks for panda-gym, you need to check what was done using **TQC or SAC** (a more sample-efficient algorithm suited for robotics tasks). In real robotics, you'll use a more sample-efficient algorithm for a simple reason: contrary to a simulation **if you move your robotic arm too much, you have a risk of breaking it**. From 59c10769af12124dba9c295d8c22b5404ab1defa Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Tue, 17 Jan 2023 14:46:18 +0100 Subject: [PATCH 21/21] Update --- notebooks/unit6/unit6.ipynb | 24 +++++++++++++----------- 1 file changed, 13 insertions(+), 11 deletions(-) diff --git a/notebooks/unit6/unit6.ipynb b/notebooks/unit6/unit6.ipynb index ceee2b1..95056b5 100644 --- a/notebooks/unit6/unit6.ipynb +++ b/notebooks/unit6/unit6.ipynb @@ -5,7 +5,7 @@ "colab": { "provenance": [], "private_outputs": true, - "authorship_tag": "ABX9TyNTCZRW9WsSED/roRBW2oQ5", + "authorship_tag": "ABX9TyMm2AvQJHZiNbxotv6J/Rf+", "include_colab_link": true }, "kernelspec": { @@ -149,7 +149,7 @@ { "cell_type": "markdown", "source": [ - "To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push three models:\n", + "To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push your two trained models to the Hub and get the following results:\n", "\n", "- `AntBulletEnv-v0` get a result of >= 650.\n", "- `PandaReachDense-v2` get a result of >= -3.5.\n", @@ -347,6 +347,8 @@ "source": [ "The observation Space (from [Jeffrey Y Mo](https://hackmd.io/@jeffreymo/SJJrSJh5_#PyBullet)):\n", "\n", + "The difference is that our observation space is 28 not 29.\n", + "\n", "\"PyBullet\n" ], "metadata": { @@ -391,7 +393,7 @@ "source": [ "A good practice in reinforcement learning is to [normalize input features](https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html). \n", "\n", - "For that, a wrapper exists and will compute a running average and standard deviation of input features.\n", + "For that purpose, there is a wrapper that will compute a running average and standard deviation of input features.\n", "\n", "We also normalize rewards with this same wrapper by adding `norm_reward = True`\n", "\n", @@ -536,8 +538,8 @@ "source": [ "### Evaluate the agent ๐Ÿ“ˆ\n", "- Now that's our agent is trained, we need to **check its performance**.\n", - "- Stable-Baselines3 provides a method to do that `evaluate_policy`\n", - "- In my case, I've got a mean reward of `2371.90 +/- 16.50`" + "- Stable-Baselines3 provides a method to do that: `evaluate_policy`\n", + "- In my case, I got a mean reward of `2371.90 +/- 16.50`" ], "metadata": { "id": "01M9GCd32Ig-" @@ -574,7 +576,7 @@ "cell_type": "markdown", "source": [ "### Publish your trained model on the Hub ๐Ÿ”ฅ\n", - "Now that we saw we got good results after the training, we can publish our trained model on the hub ๐Ÿค— with one line of code.\n", + "Now that we saw we got good results after the training, we can publish our trained model on the Hub with one line of code.\n", "\n", "๐Ÿ“š The libraries documentation ๐Ÿ‘‰ https://github.com/huggingface/huggingface_sb3/tree/main#hugging-face--x-stable-baselines3-v20\n", "\n", @@ -698,7 +700,7 @@ "\"Robotics\"/\n", "\n", "\n", - "This way, **the training will be easier**.\n", + "This way **the training will be easier**.\n", "\n" ], "metadata": { @@ -753,7 +755,7 @@ { "cell_type": "markdown", "source": [ - "The observation space **is a dictionary with 3 different element**:\n", + "The observation space **is a dictionary with 3 different elements**:\n", "- `achieved_goal`: (x,y,z) position of the goal.\n", "- `desired_goal`: (x,y,z) distance between the goal position and the current object position.\n", "- `observation`: position (x,y,z) and velocity of the end-effector (vx, vy, vz).\n", @@ -796,7 +798,7 @@ "2. Make a vectorized environment\n", "3. Add a wrapper to normalize the observations and rewards. [Check the documentation](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize)\n", "4. Create the A2C Model (don't forget verbose=1 to print the training logs).\n", - "5. Train it for 2M Timesteps\n", + "5. Train it for 1M Timesteps\n", "6. Save the model and VecNormalize statistics when saving the agent\n", "7. Evaluate your agent\n", "8. Publish your trained model on the Hub ๐Ÿ”ฅ with `package_to_hub`" @@ -885,9 +887,9 @@ "cell_type": "markdown", "source": [ "## Some additional challenges ๐Ÿ†\n", - "The best way to learn **is to try things by your own**! Why not trying `HalfCheetahBulletEnv-v0` for PyBullet?\n", + "The best way to learn **is to try things by your own**! Why not trying `HalfCheetahBulletEnv-v0` for PyBullet and `PandaPickAndPlace-v1` for Panda-Gym?\n", "\n", - "If you want to try more advanced tasks for panda-gym you need to check what was done using **TQC or SAC** (a more sample efficient algorithm suited for robotics tasks). In real robotics, you'll use more sample-efficient algorithm for a simple reason: contrary to a simulation **if you move your robotic arm too much you have a risk to break it**.\n", + "If you want to try more advanced tasks for panda-gym, you need to check what was done using **TQC or SAC** (a more sample-efficient algorithm suited for robotics tasks). In real robotics, you'll use a more sample-efficient algorithm for a simple reason: contrary to a simulation **if you move your robotic arm too much, you have a risk of breaking it**.\n", "\n", "PandaPickAndPlace-v1: https://huggingface.co/sb3/tqc-PandaPickAndPlace-v1\n", "\n",