Typos Unit6

This commit is contained in:
Dylan Wilson
2023-04-19 10:21:13 -05:00
parent 2eaf7c7428
commit 4b9cb12bdc
6 changed files with 32 additions and 32 deletions

View File

@@ -2,7 +2,7 @@
## Bias-variance tradeoff in Reinforcement Learning
If you want to dive deeper into the question of variance and bias tradeoff in Deep Reinforcement Learning, you can check these two articles:
If you want to dive deeper into the question of variance and bias tradeoff in Deep Reinforcement Learning, you can check out these two articles:
- [Making Sense of the Bias / Variance Trade-off in (Deep) Reinforcement Learning](https://blog.mlreview.com/making-sense-of-the-bias-variance-trade-off-in-deep-reinforcement-learning-79cf1e83d565)
- [Bias-variance Tradeoff in Reinforcement Learning](https://www.endtoend.ai/blog/bias-variance-tradeoff-in-reinforcement-learning/)

View File

@@ -4,7 +4,7 @@
The solution to reducing the variance of the Reinforce algorithm and training our agent faster and better is to use a combination of Policy-Based and Value-Based methods: *the Actor-Critic method*.
To understand the Actor-Critic, imagine you play a video game. You can play with a friend that will provide you with some feedback. You're the Actor and your friend is the Critic.
To understand the Actor-Critic, imagine you're playing a video game. You can play with a friend that will provide you with some feedback. You're the Actor and your friend is the Critic.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/ac.jpg" alt="Actor Critic"/>
@@ -21,13 +21,13 @@ This is the idea behind Actor-Critic. We learn two function approximations:
- *A value function* to assist the policy update by measuring how good the action taken is: \\( \hat{q}_{w}(s,a) \\)
## The Actor-Critic Process
Now that we have seen the Actor Critic's big picture, let's dive deeper to understand how Actor and Critic improve together during the training.
Now that we have seen the Actor Critic's big picture, let's dive deeper to understand how the Actor and Critic improve together during the training.
As we saw, with Actor-Critic methods, there are two function approximations (two neural networks):
- *Actor*, a **policy function** parameterized by theta: \\( \pi_{\theta}(s) \\)
- *Critic*, a **value function** parameterized by w: \\( \hat{q}_{w}(s,a) \\)
Let's see the training process to understand how Actor and Critic are optimized:
Let's see the training process to understand how the Actor and Critic are optimized:
- At each timestep, t, we get the current state \\( S_t\\) from the environment and **pass it as input through our Actor and Critic**.
- Our Policy takes the state and **outputs an action** \\( A_t \\).

View File

@@ -4,8 +4,8 @@ Congrats on finishing this unit and the tutorial. You've just trained your first
**Take time to grasp the material before continuing**. You can also look at the additional reading materials we provided in the *additional reading* section.
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then please 👉 [fill out this form](https://forms.gle/BzKXWzLAGZESGNaE9)
See you in next unit,
See you in next unit!
### Keep learning, stay awesome 🤗,
### Keep learning, stay awesome 🤗

View File

@@ -11,7 +11,7 @@
Now that you've studied the theory behind Advantage Actor Critic (A2C), **you're ready to train your A2C agent** using Stable-Baselines3 in robotic environments. And train two robots:
- A spider 🕷️ to learn to move.
- A robotic arm 🦾 to move in the correct position.
- A robotic arm 🦾 to move to the correct position.
We're going to use two Robotics environments:
@@ -54,7 +54,7 @@ We're constantly trying to improve our tutorials, so **if you find some issues i
At the end of the notebook, you will:
- Be able to use **PyBullet** and **Panda-Gym**, the environment libraries.
- Be able to use the environment librairies **PyBullet** and **Panda-Gym**.
- Be able to **train robots using A2C**.
- Understand why **we need to normalize the input**.
- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.
@@ -80,7 +80,7 @@ Before diving into the notebook, you need to:
During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).
Hence the following cell will install the librairies and create and run a virtual screen 🖥
The following cell will install the librairies and create and run a virtual screen 🖥
```python
%%capture
@@ -135,7 +135,7 @@ from huggingface_hub import notebook_login
### Create the AntBulletEnv-v0
#### The environment 🎮
In this environment, the agent needs to use correctly its different joints to walk correctly.
In this environment, the agent needs to use its different joints correctly in order to walk.
You can find a detailled explanation of this environment here: https://hackmd.io/@jeffreymo/SJJrSJh5_#PyBullet
```python
@@ -231,7 +231,7 @@ model = A2C(
### Train the A2C agent 🏃
- Let's train our agent for 2,000,000 timesteps, don't forget to use GPU on Colab. It will take approximately ~25-40min
- Let's train our agent for 2,000,000 timesteps. Don't forget to use GPU on Colab. It will take approximately ~25-40min
```python
model.learn(2_000_000)
@@ -244,7 +244,7 @@ env.save("vec_normalize.pkl")
```
### Evaluate the agent 📈
- Now that's our agent is trained, we need to **check its performance**.
- Now that our agent is trained, we need to **check its performance**.
- Stable-Baselines3 provides a method to do that: `evaluate_policy`
- In my case, I got a mean reward of `2371.90 +/- 16.50`
@@ -282,7 +282,7 @@ By using `package_to_hub`, as we already mentionned in the former units, **you e
This way:
- You can **showcase our work** 🔥
- You can **visualize your agent playing** 👀
- You can **share with the community an agent that others can use** 💾
- You can **share an agent with the community that others can use** 💾
- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
@@ -290,7 +290,7 @@ To be able to share your model with the community there are three more steps to
1⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join
2⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.
2⃣ Sign in and then you need to get your authentication token from the Hugging Face website.
- Create a new token (https://huggingface.co/settings/tokens) **with write role**
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
@@ -303,7 +303,7 @@ notebook_login()
!git config --global credential.helper store
```
If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
If you don't want to use Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
3⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥 using `package_to_hub()` function
@@ -332,9 +332,9 @@ In robotics, the *end-effector* is the device at the end of a robotic arm design
In `PandaReach`, the robot must place its end-effector at a target position (green ball).
We're going to use the dense version of this environment. It means we'll get a *dense reward function* that **will provide a reward at each timestep** (the closer the agent is to completing the task, the higher the reward). Contrary to a *sparse reward function* where the environment **return a reward if and only if the task is completed**.
We're going to use the dense version of this environment. This means we'll get a *dense reward function* that **will provide a reward at each timestep** (the closer the agent is to completing the task, the higher the reward). This is in contrast to a *sparse reward function* where the environment **return a reward if and only if the task is completed**.
Also, we're going to use the *End-effector displacement control*, it means the **action corresponds to the displacement of the end-effector**. We don't control the individual motion of each joint (joint control).
Also, we're going to use the *End-effector displacement control*, which means the **action corresponds to the displacement of the end-effector**. We don't control the individual motion of each joint (joint control).
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/robotics.jpg" alt="Robotics"/>
@@ -384,14 +384,14 @@ The action space is a vector with 3 values:
Now it's your turn:
1. Define the environment called "PandaReachDense-v2"
2. Make a vectorized environment
1. Define the environment called "PandaReachDense-v2".
2. Make a vectorized environment.
3. Add a wrapper to normalize the observations and rewards. [Check the documentation](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize)
4. Create the A2C Model (don't forget verbose=1 to print the training logs).
5. Train it for 1M Timesteps
6. Save the model and VecNormalize statistics when saving the agent
7. Evaluate your agent
8. Publish your trained model on the Hub 🔥 with `package_to_hub`
5. Train it for 1M Timesteps.
6. Save the model and VecNormalize statistics when saving the agent.
7. Evaluate your agent.
8. Publish your trained model on the Hub 🔥 with `package_to_hub`.
### Solution (fill the todo)
@@ -448,7 +448,7 @@ package_to_hub(
## Some additional challenges 🏆
The best way to learn **is to try things by your own**! Why not trying `HalfCheetahBulletEnv-v0` for PyBullet and `PandaPickAndPlace-v1` for Panda-Gym?
The best way to learn **is to try things on your own**! Why not try `HalfCheetahBulletEnv-v0` for PyBullet and `PandaPickAndPlace-v1` for Panda-Gym?
If you want to try more advanced tasks for panda-gym, you need to check what was done using **TQC or SAC** (a more sample-efficient algorithm suited for robotics tasks). In real robotics, you'll use a more sample-efficient algorithm for a simple reason: contrary to a simulation **if you move your robotic arm too much, you have a risk of breaking it**.
@@ -456,7 +456,7 @@ PandaPickAndPlace-v1: https://huggingface.co/sb3/tqc-PandaPickAndPlace-v1
And don't hesitate to check panda-gym documentation here: https://panda-gym.readthedocs.io/en/latest/usage/train_with_sb3.html
Here are some ideas to achieve so:
Here are some ideas to go further:
* Train more steps
* Try different hyperparameters by looking at what your classmates have done 👉 https://huggingface.co/models?other=https://huggingface.co/models?other=AntBulletEnv-v0
* **Push your new trained model** on the Hub 🔥

View File

@@ -11,15 +11,15 @@ We saw that Reinforce worked well. However, because we use Monte-Carlo sampling
Remember that the policy gradient estimation is **the direction of the steepest increase in return**. In other words, how to update our policy weights so that actions that lead to good returns have a higher probability of being taken. The Monte Carlo variance, which we will further study in this unit, **leads to slower training since we need a lot of samples to mitigate it**.
So, today we'll study **Actor-Critic methods**, a hybrid architecture combining value-based and Policy-Based methods that help to stabilize the training by reducing the variance:
So today we'll study **Actor-Critic methods**, a hybrid architecture combining value-based and Policy-Based methods that helps to stabilize the training by reducing the variance using:
- *An Actor* that controls **how our agent behaves** (Policy-Based method)
- *A Critic* that measures **how good the taken action is** (Value-Based method)
We'll study one of these hybrid methods, Advantage Actor Critic (A2C), **and train our agent using Stable-Baselines3 in robotic environments**. We'll train two robots:
- A spider 🕷️ to learn to move.
- A robotic arm 🦾 to move in the correct position.
- A robotic arm 🦾 to move to the correct position.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/environments.gif" alt="Environments"/>
Sounds exciting? Let's get started!
Sound exciting? Let's get started!

View File

@@ -1,12 +1,12 @@
# The Problem of Variance in Reinforce [[the-problem-of-variance-in-reinforce]]
In Reinforce, we want to **increase the probability of actions in a trajectory proportional to how high the return is**.
In Reinforce, we want to **increase the probability of actions in a trajectory proportionally to how high the return is**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/pg.jpg" alt="Reinforce"/>
- If the **return is high**, we will **push up** the probabilities of the (state, action) combinations.
- Else, if the **return is low**, it will **push down** the probabilities of the (state, action) combinations.
- Otherwise, if the **return is low**, it will **push down** the probabilities of the (state, action) combinations.
This return \\(R(\tau)\\) is calculated using a *Monte-Carlo sampling*. We collect a trajectory and calculate the discounted return, **and use this score to increase or decrease the probability of every action taken in that trajectory**. If the return is good, all actions will be “reinforced” by increasing their likelihood of being taken.
@@ -24,7 +24,7 @@ The solution is to mitigate the variance by **using a large number of trajectori
However, increasing the batch size significantly **reduces sample efficiency**. So we need to find additional mechanisms to reduce the variance.
---
If you want to dive deeper into the question of variance and bias tradeoff in Deep Reinforcement Learning, you can check these two articles:
If you want to dive deeper into the question of variance and bias tradeoff in Deep Reinforcement Learning, you can check out these two articles:
- [Making Sense of the Bias / Variance Trade-off in (Deep) Reinforcement Learning](https://blog.mlreview.com/making-sense-of-the-bias-variance-trade-off-in-deep-reinforcement-learning-79cf1e83d565)
- [Bias-variance Tradeoff in Reinforcement Learning](https://www.endtoend.ai/blog/bias-variance-tradeoff-in-reinforcement-learning/)
---