From 28ef99046d10028215012a17d76ec8957cec2559 Mon Sep 17 00:00:00 2001
From: simoninithomas <simonini_thomas@outlook.fr>
Date: Tue, 17 Jan 2023 07:47:05 +0100
Subject: [PATCH] Finalize A2C

---
 units/en/unit6/additional-readings.mdx    |   1 +
 units/en/unit6/advantage-actor-critic.mdx |   3 +-
 units/en/unit6/conclusion.mdx             |   8 +-
 units/en/unit6/hands-on.mdx               | 438 +++++++++++++++++++++-
 units/en/unit6/introduction.mdx           |   7 +-
 5 files changed, 441 insertions(+), 16 deletions(-)

diff --git a/units/en/unit6/additional-readings.mdx b/units/en/unit6/additional-readings.mdx
index 5e7f386..07d80fb 100644
--- a/units/en/unit6/additional-readings.mdx
+++ b/units/en/unit6/additional-readings.mdx
@@ -3,6 +3,7 @@
 ## Bias-variance tradeoff in Reinforcement Learning
 
 If you want to dive deeper into the question of variance and bias tradeoff in Deep Reinforcement Learning, you can check these two articles:
+
 - [Making Sense of the Bias / Variance Trade-off in (Deep) Reinforcement Learning](https://blog.mlreview.com/making-sense-of-the-bias-variance-trade-off-in-deep-reinforcement-learning-79cf1e83d565)
 - [Bias-variance Tradeoff in Reinforcement Learning](https://www.endtoend.ai/blog/bias-variance-tradeoff-in-reinforcement-learning/)
 
diff --git a/units/en/unit6/advantage-actor-critic.mdx b/units/en/unit6/advantage-actor-critic.mdx
index 6544eb3..f3ed336 100644
--- a/units/en/unit6/advantage-actor-critic.mdx
+++ b/units/en/unit6/advantage-actor-critic.mdx
@@ -1,5 +1,6 @@
-# Advantage Actor-Critic (A2C) [[advantage-actor-critic-a2c]]
+# Advantage Actor-Critic (A2C)
 ## Reducing variance with Actor-Critic methods
+
 The solution to reducing the variance of the Reinforce algorithm and training our agent faster and better is to use a combination of Policy-Based and Value-Based methods: *the Actor-Critic method*.
 
 To understand the Actor-Critic, imagine you play a video game. You can play with a friend that will provide you with some feedback. You're the Actor and your friend is the Critic.
diff --git a/units/en/unit6/conclusion.mdx b/units/en/unit6/conclusion.mdx
index 3da4332..85d0229 100644
--- a/units/en/unit6/conclusion.mdx
+++ b/units/en/unit6/conclusion.mdx
@@ -4,12 +4,8 @@ Congrats on finishing this unit and the tutorial. You've just trained your first
 
 **Take time to grasp the material before continuing**. You can also look at the additional reading materials we provided in the *additional reading* section.
 
-Feel free to train your agent in other environments. The **best way to learn is to try things on your own!** For instance, what about teaching your robotic arm [to stack objects](https://panda-gym.readthedocs.io/en/latest/usage/environments.html#sparce-reward-end-effector-control-default-setting) or slide objects?
-
-In the next unit, we will learn to improve Actor-Critic Methods with Proximal Policy Optimization using the [CleanRL library](https://github.com/vwxyzjn/cleanrl). Then we'll study how to speed up the process with the [Sample Factory library](https://samplefactory.dev/). You'll train your PPO agents in these environments: VizDoom, Racing Car, and a 3D FPS.
-
-TODO: IMAGE of the environment Vizdoom + ED
-
 Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉  [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
 
+See you in next unit,
+
 ### Keep learning, stay awesome 🤗,
diff --git a/units/en/unit6/hands-on.mdx b/units/en/unit6/hands-on.mdx
index 28ca5c7..244ce11 100644
--- a/units/en/unit6/hands-on.mdx
+++ b/units/en/unit6/hands-on.mdx
@@ -8,23 +8,23 @@
         askForHelpUrl="http://hf.co/join/discord" />
 
 
-Now that you've studied the theory behind Advantage Actor Critic (A2C), **you're ready to train your A2C agent** using Stable-Baselines3 in robotic environments. And train three robots:
+Now that you've studied the theory behind Advantage Actor Critic (A2C), **you're ready to train your A2C agent** using Stable-Baselines3 in robotic environments. And train two robots:
 
-- A bipedal walker 🚶 to learn to walk.
 - A spider 🕷️ to learn to move.
-- A robotic arm 🦾 to move objects in the correct position.
+- A robotic arm 🦾 to move in the correct position.
 
 We're going to use two Robotics environments:
 
 - [PyBullet](https://github.com/bulletphysics/bullet3)
 - [panda-gym](https://github.com/qgallouedec/panda-gym)
 
-TODO: ADD IMAGE OF THREE
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/environments.gif" alt="Environments"/>
 
 
 To validate this hands-on for the certification process, you need to push your three trained model to the Hub and get:
 
-TODO ADD CERTIFICATION ELEMENTS
+- `AntBulletEnv-v0` get a result of >= 650.
+- `PandaReachDense-v2` get a result of >= -3.5.
 
 To find your result, [go to the leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**
 
@@ -33,3 +33,431 @@ For more information about the certification process, check this section 👉 ht
 **To start the hands-on click on Open In Colab button** 👇 :
 
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit6/unit6.ipynb)
+
+
+# Unit 6: Advantage Actor Critic (A2C) using Robotics Simulations with PyBullet and Panda-Gym 🤖
+
+### 🎮 Environments:
+
+- [PyBullet](https://github.com/bulletphysics/bullet3)
+- [Panda-Gym](https://github.com/qgallouedec/panda-gym)
+
+### 📚 RL-Library:
+
+- [Stable-Baselines3](https://stable-baselines3.readthedocs.io/)
+
+We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues).
+
+## Objectives of this notebook 🏆
+
+At the end of the notebook, you will:
+
+- Be able to use **PyBullet** and **Panda-Gym**, the environment libraries.
+- Be able to **train robots using A2C**.
+- Understand why **we need to normalize the input**.
+- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.
+
+## Prerequisites 🏗️
+Before diving into the notebook, you need to:
+
+🔲 📚 Study [Actor-Critic methods by reading Unit 6](https://huggingface.co/deep-rl-course/unit6/introduction) 🤗
+
+# Let's train our first robots 🤖
+
+## Set the GPU 💪
+
+- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg" alt="GPU Step 1">
+
+- `Hardware Accelerator > GPU`
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg" alt="GPU Step 2">
+
+## Create a virtual display 🔽
+
+During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).
+
+Hence the following cell will install the librairies and create and run a virtual screen 🖥
+
+```python
+%%capture
+!apt install python-opengl
+!apt install ffmpeg
+!apt install xvfb
+!pip3 install pyvirtualdisplay
+```
+
+```python
+# Virtual display
+from pyvirtualdisplay import Display
+
+virtual_display = Display(visible=0, size=(1400, 900))
+virtual_display.start()
+```
+
+### Install dependencies 🔽
+The first step is to install the dependencies, we’ll install multiple ones:
+
+- `pybullet`: Contains the walking robots environments.
+- `panda-gym`: Contains the robotics arm environments.
+- `stable-baselines3[extra]`: The SB3 deep reinforcement learning library.
+- `huggingface_sb3`: Additional code for Stable-baselines3 to load and upload models from the Hugging Face 🤗 Hub.
+- `huggingface_hub`: Library allowing anyone to work with the Hub repositories.
+
+```bash
+!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit6/requirements-unit6.txt
+```
+
+## Import the packages 📦
+
+```python
+import pybullet_envs
+import panda_gym
+import gym
+
+import os
+
+from huggingface_sb3 import load_from_hub, package_to_hub
+
+from stable_baselines3 import A2C
+from stable_baselines3.common.evaluation import evaluate_policy
+from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
+from stable_baselines3.common.env_util import make_vec_env
+
+from huggingface_hub import notebook_login
+```
+
+## Environment 1: AntBulletEnv-v0 🕸
+
+### Create the AntBulletEnv-v0
+#### The environment 🎮
+
+In this environment, the agent needs to use correctly its different joints to walk correctly.
+You can find a detailled explanation of this environment here: https://hackmd.io/@jeffreymo/SJJrSJh5_#PyBullet
+
+```python
+env_id = "AntBulletEnv-v0"
+# Create the env
+env = gym.make(env_id)
+
+# Get the state space and action space
+s_size = env.observation_space.shape[0]
+a_size = env.action_space
+```
+
+```python
+print("_____OBSERVATION SPACE_____ \n")
+print("The State Space is: ", s_size)
+print("Sample observation", env.observation_space.sample())  # Get a random observation
+```
+
+The observation Space (from [Jeffrey Y Mo](https://hackmd.io/@jeffreymo/SJJrSJh5_#PyBullet)):
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/obs_space.png" alt="PyBullet Ant Obs space"/>
+
+
+```python
+print("\n _____ACTION SPACE_____ \n")
+print("The Action Space is: ", a_size)
+print("Action Space Sample", env.action_space.sample())  # Take a random action
+```
+
+The action Space (from [Jeffrey Y Mo](https://hackmd.io/@jeffreymo/SJJrSJh5_#PyBullet)):
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/action_space.png" alt="PyBullet Ant Obs space"/>
+
+
+### Normalize observation and rewards
+
+A good practice in reinforcement learning is to [normalize input features](https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html).
+
+For that, a wrapper exists and will compute a running average and standard deviation of input features.
+
+We also normalize rewards with this same wrapper by adding `norm_reward = True`
+
+[You should check the documentation to fill this cell](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize)
+
+```python
+env = make_vec_env(env_id, n_envs=4)
+
+# Adding this wrapper to normalize the observation and the reward
+env = # TODO: Add the wrapper
+```
+
+#### Solution
+
+```python
+env = make_vec_env(env_id, n_envs=4)
+
+env = VecNormalize(env, norm_obs=True, norm_reward=False, clip_obs=10.0)
+```
+
+### Create the A2C Model 🤖
+
+In this case, because we have a vector of 28 values as input, we'll use an MLP (multi-layer perceptron) as policy.
+
+For more information about A2C implementation with StableBaselines3 check: https://stable-baselines3.readthedocs.io/en/master/modules/a2c.html#notes
+
+To find the best parameters I checked the [official trained agents by Stable-Baselines3 team](https://huggingface.co/sb3).
+
+```python
+model = # Create the A2C model and try to find the best parameters
+```
+
+#### Solution
+
+```python
+model = A2C(
+    policy="MlpPolicy",
+    env=env,
+    gae_lambda=0.9,
+    gamma=0.99,
+    learning_rate=0.00096,
+    max_grad_norm=0.5,
+    n_steps=8,
+    vf_coef=0.4,
+    ent_coef=0.0,
+    policy_kwargs=dict(log_std_init=-2, ortho_init=False),
+    normalize_advantage=False,
+    use_rms_prop=True,
+    use_sde=True,
+    verbose=1,
+)
+```
+
+### Train the A2C agent 🏃
+
+- Let's train our agent for 2,000,000 timesteps, don't forget to use GPU on Colab. It will take approximately ~25-40min
+
+```python
+model.learn(2_000_000)
+```
+
+```python
+# Save the model and  VecNormalize statistics when saving the agent
+model.save("a2c-AntBulletEnv-v0")
+env.save("vec_normalize.pkl")
+```
+
+### Evaluate the agent 📈
+- Now that's our  agent is trained, we need to **check its performance**.
+- Stable-Baselines3 provides a method to do that `evaluate_policy`
+- In my case, I've got a mean reward of `2371.90 +/- 16.50`
+
+```python
+from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
+
+# Load the saved statistics
+eval_env = DummyVecEnv([lambda: gym.make("AntBulletEnv-v0")])
+eval_env = VecNormalize.load("vec_normalize.pkl", eval_env)
+
+#  do not update them at test time
+eval_env.training = False
+# reward normalization is not needed at test time
+eval_env.norm_reward = False
+
+# Load the agent
+model = A2C.load("a2c-AntBulletEnv-v0")
+
+mean_reward, std_reward = evaluate_policy(model, env)
+
+print(f"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}")
+```
+
+### Publish your trained model on the Hub 🔥
+Now that we saw we got good results after the training, we can publish our trained model on the hub 🤗 with one line of code.
+
+📚 The libraries documentation 👉 https://github.com/huggingface/huggingface_sb3/tree/main#hugging-face--x-stable-baselines3-v20
+
+Here's an example of a Model Card (with a PyBullet environment):
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/modelcardpybullet.png" alt="Model Card Pybullet"/>
+
+By using `package_to_hub`, as we already mentionned in the former units, **you evaluate, record a replay, generate a model card of your agent and push it to the hub**.
+
+This way:
+- You can **showcase our work** 🔥
+- You can **visualize your agent playing** 👀
+- You can **share with the community an agent that others can use** 💾
+- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
+
+
+To be able to share your model with the community there are three more steps to follow:
+
+1️⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join
+
+2️⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.
+- Create a new token (https://huggingface.co/settings/tokens) **with write role**
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
+
+- Copy the token
+- Run the cell below and paste the token
+
+```python
+notebook_login()
+!git config --global credential.helper store
+```
+
+If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
+
+3️⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥 using `package_to_hub()` function
+
+```python
+package_to_hub(
+    model=model,
+    model_name=f"a2c-{env_id}",
+    model_architecture="A2C",
+    env_id=env_id,
+    eval_env=eval_env,
+    repo_id=f"ThomasSimonini/a2c-{env_id}",  # Change the username
+    commit_message="Initial commit",
+)
+```
+
+## Take a coffee break ☕
+- You already trained your first robot that learned to move congratutlations 🥳!
+- It's **time to take a break**. Don't hesitate to **save this notebook** `File > Save a copy to Drive` to work on this second part later.
+
+
+## Environment 2: PandaReachDense-v2 🦾
+
+The agent we're going to train is a robotic arm that needs to do controls (moving the arm and using the end-effector).
+
+In robotics, the *end-effector* is the device at the end of a robotic arm designed to interact with the environment.
+
+In `PandaReach`, the robot must place its end-effector at a target position (green ball).
+
+We're going to use the dense version of this environment. It means we'll get a *dense reward function* that **will provide a reward at each timestep** (the closer the agent is to completing the task, the higher the reward). Contrary to a *sparse reward function* where the environment **return a reward if and only if the task is completed**.
+
+Also, we're going to use the *End-effector displacement control*, it means the **action corresponds to the displacement of the end-effector**. We don't control the individual motion of each joint (joint control).
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/robotics.jpg"  alt="Robotics"/>
+
+
+This way, **the training will be easier**.
+
+
+
+In `PandaReachDense-v2` the robotic arm must place its end-effector at a target position (green ball).
+
+
+
+```python
+import gym
+
+env_id = "PandaPushDense-v2"
+
+# Create the env
+env = gym.make(env_id)
+
+# Get the state space and action space
+s_size = env.observation_space.shape
+a_size = env.action_space
+```
+
+```python
+print("_____OBSERVATION SPACE_____ \n")
+print("The State Space is: ", s_size)
+print("Sample observation", env.observation_space.sample())  # Get a random observation
+```
+
+The observation space **is a dictionary with 3 different element**:
+- `achieved_goal`: (x,y,z) position of the goal.
+- `desired_goal`: (x,y,z) distance between the goal position and the current object position.
+- `observation`: position (x,y,z) and velocity of the end-effector (vx, vy, vz).
+
+Given it's a dictionary as observation, **we will need to use a MultiInputPolicy policy instead of MlpPolicy**.
+
+```python
+print("\n _____ACTION SPACE_____ \n")
+print("The Action Space is: ", a_size)
+print("Action Space Sample", env.action_space.sample())  # Take a random action
+```
+
+The action space is a vector with 3 values:
+- Control x, y, z movement
+
+Now it's your turn:
+
+1. Define the environment called "PandaReachDense-v2"
+2. Make a vectorized environment
+3. Add a wrapper to normalize the observations and rewards. [Check the documentation](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize)
+4. Create the A2C Model (don't forget verbose=1 to print the training logs).
+5. Train it for 2M Timesteps
+6. Save the model and  VecNormalize statistics when saving the agent
+7. Evaluate your agent
+8. Publish your trained model on the Hub 🔥 with `package_to_hub`
+
+### Solution (fill the todo)
+
+```python
+# 1 - 2
+env_id = "PandaReachDense-v2"
+env = make_vec_env(env_id, n_envs=4)
+
+# 3
+env = VecNormalize(env, norm_obs=True, norm_reward=False, clip_obs=10.0)
+
+# 4
+model = A2C(policy="MultiInputPolicy", env=env, verbose=1)
+# 5
+model.learn(1_000_000)
+```
+
+```python
+# 6
+model_name = "a2c-PandaReachDense-v2"
+model.save(model_name)
+env.save("vec_normalize.pkl")
+
+# 7
+from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
+
+# Load the saved statistics
+eval_env = DummyVecEnv([lambda: gym.make("PandaReachDense-v2")])
+eval_env = VecNormalize.load("vec_normalize.pkl", eval_env)
+
+#  do not update them at test time
+eval_env.training = False
+# reward normalization is not needed at test time
+eval_env.norm_reward = False
+
+# Load the agent
+model = A2C.load(model_name)
+
+mean_reward, std_reward = evaluate_policy(model, env)
+
+print(f"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}")
+
+# 8
+package_to_hub(
+    model=model,
+    model_name=f"a2c-{env_id}",
+    model_architecture="A2C",
+    env_id=env_id,
+    eval_env=eval_env,
+    repo_id=f"ThomasSimonini/a2c-{env_id}",  # TODO: Change the username
+    commit_message="Initial commit",
+)
+```
+
+## Some additional challenges 🏆
+
+The best way to learn **is to try things by your own**! Why not trying  `HalfCheetahBulletEnv-v0` for PyBullet?
+
+If you want to try more advanced tasks for panda-gym you need to check what was done using **TQC or SAC** (a more sample efficient algorithm suited for robotics tasks). In real robotics, you'll use more sample-efficient algorithm for a simple reason: contrary to a simulation **if you move your robotic arm too much you have a risk to break it**.
+
+PandaPickAndPlace-v1: https://huggingface.co/sb3/tqc-PandaPickAndPlace-v1
+
+And don't hesitate to check panda-gym documentation here: https://panda-gym.readthedocs.io/en/latest/usage/train_with_sb3.html
+
+Here are some ideas to achieve so:
+* Train more steps
+* Try different hyperparameters by looking at what your classmates have done 👉 https://huggingface.co/models?other=https://huggingface.co/models?other=AntBulletEnv-v0
+* **Push your new trained model** on the Hub 🔥
+
+
+See you on Unit 7! 🔥
+## Keep learning, stay awesome 🤗
diff --git a/units/en/unit6/introduction.mdx b/units/en/unit6/introduction.mdx
index b96ba39..d85281d 100644
--- a/units/en/unit6/introduction.mdx
+++ b/units/en/unit6/introduction.mdx
@@ -16,11 +16,10 @@ So, today we'll study **Actor-Critic methods**, a hybrid architecture combining
 - *A Critic* that measures **how good the taken action is** (Value-Based method)
 
 
-We'll study one of these hybrid methods, Advantage Actor Critic (A2C), **and train our agent using Stable-Baselines3 in robotic environments**. We'll train three robots:
-- A bipedal walker 🚶 to learn to walk.
+We'll study one of these hybrid methods, Advantage Actor Critic (A2C), **and train our agent using Stable-Baselines3 in robotic environments**. We'll train two robots:
 - A spider 🕷️ to learn to move.
-- A robotic arm 🦾 to move objects in the correct position.
+- A robotic arm 🦾 to move in the correct position.
 
-TODO: ADD IMAGE OF THREE
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/environments.gif" alt="Environments"/>
 
 Sounds exciting? Let's get started!