Typos Unit8

This commit is contained in:
Dylan Wilson
2023-04-19 11:13:01 -05:00
parent 87e65269dd
commit 8df53602e7
9 changed files with 37 additions and 37 deletions

View File

@@ -1,7 +1,7 @@
# Introducing the Clipped Surrogate Objective Function
## Recap: The Policy Objective Function
Lets remember what is the objective to optimize in Reinforce:
Lets remember what the objective is to optimize in Reinforce:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/lpg.jpg" alt="Reinforce"/>
The idea was that by taking a gradient ascent step on this function (equivalent to taking gradient descent of the negative of this function), we would **push our agent to take actions that lead to higher rewards and avoid harmful actions.**
@@ -10,9 +10,9 @@ However, the problem comes from the step size:
- Too small, **the training process was too slow**
- Too high, **there was too much variability in the training**
Here with PPO, the idea is to constrain our policy update with a new objective function called the *Clipped surrogate objective function* that **will constrain the policy change in a small range using a clip.**
With PPO, the idea is to constrain our policy update with a new objective function called the *Clipped surrogate objective function* that **will constrain the policy change in a small range using a clip.**
This new function **is designed to avoid destructive large weights updates** :
This new function **is designed to avoid destructively large weights updates** :
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/ppo-surrogate.jpg" alt="PPO surrogate function"/>
@@ -21,11 +21,11 @@ Lets study each part to understand how it works.
## The Ratio Function
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/ratio1.jpg" alt="Ratio"/>
This ratio is calculated this way:
This ratio is calculated as follows:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/ratio2.jpg" alt="Ratio"/>
Its the probability of taking action \\( a_t \\) at state \\( s_t \\) in the current policy divided by the previous one.
Its the probability of taking action \\( a_t \\) at state \\( s_t \\) in the current policy, divided by the same for the previous policy.
As we can see, \\( r_t(\theta) \\) denotes the probability ratio between the current and old policy:
@@ -49,7 +49,7 @@ However, without a constraint, if the action taken is much more probable in our
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/clipped.jpg" alt="PPO"/>
Consequently, we need to constrain this objective function by penalizing changes that lead to a ratio away from 1 (in the paper, the ratio can only vary from 0.8 to 1.2).
Consequently, we need to constrain this objective function by penalizing changes that lead to a ratio far away from 1 (in the paper, the ratio can only vary from 0.8 to 1.2).
**By clipping the ratio, we ensure that we do not have a too large policy update because the current policy can't be too different from the older one.**
@@ -62,7 +62,7 @@ To do that, we have two solutions:
This clipped part is a version where rt(theta) is clipped between \\( [1 - \epsilon, 1 + \epsilon] \\).
With the Clipped Surrogate Objective function, we have two probability ratios, one non-clipped and one clipped in a range (between \\( [1 - \epsilon, 1 + \epsilon] \\), epsilon is a hyperparameter that helps us to define this clip range (in the paper \\( \epsilon = 0.2 \\).).
With the Clipped Surrogate Objective function, we have two probability ratios, one non-clipped and one clipped in a range between \\( [1 - \epsilon, 1 + \epsilon] \\), epsilon is a hyperparameter that helps us to define this clip range (in the paper \\( \epsilon = 0.2 \\).).
Then, we take the minimum of the clipped and non-clipped objective, **so the final objective is a lower bound (pessimistic bound) of the unclipped objective.**

View File

@@ -6,8 +6,8 @@ Now that you've successfully trained your Doom agent, why not try deathmatch? Re
If you do it, don't hesitate to share your model in the `#rl-i-made-this` channel in our [discord server](https://www.hf.co/join/discord).
This concludes the last unit, but we are not finished yet! 🤗 The following **bonus unit includes some of the most interesting, advanced and cutting edge work in Deep Reinforcement Learning**.
This concludes the last unit, but we are not finished yet! 🤗 The following **bonus unit includes some of the most interesting, advanced, and cutting edge work in Deep Reinforcement Learning**.
See you next time 🔥,
See you next time 🔥
## Keep Learning, Stay awesome 🤗

View File

@@ -2,8 +2,8 @@
Thats all for today. Congrats on finishing this unit and the tutorial!
The best way to learn is to practice and try stuff. **Why not improving the implementation to handle frames as input?**.
The best way to learn is to practice and try stuff. **Why not improve the implementation to handle frames as input?**.
See you on second part of this Unit 🔥,
See you on second part of this Unit 🔥
## Keep Learning, Stay awesome 🤗

View File

@@ -31,9 +31,9 @@ Then, to test its robustness, we're going to train it in:
</video>
</figure>
And finally, we will be push the trained model to the Hub to evaluate and visualize your agent playing.
And finally, we will push the trained model to the Hub to evaluate and visualize your agent playing.
LunarLander-v2 is the first environment you used when you started this course. At that time, you didn't know how it worked, and now, you can code it from scratch and train it. **How incredible is that 🤩.**
LunarLander-v2 is the first environment you used when you started this course. At that time, you didn't know how it worked, and now you can code it from scratch and train it. **How incredible is that 🤩.**
<iframe src="https://giphy.com/embed/pynZagVcYxVUk" width="480" height="480" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/the-office-michael-heartbreak-pynZagVcYxVUk">via GIPHY</a></p>
@@ -118,7 +118,7 @@ pip install huggingface_hub
pip install box2d
```
## Let's code PPO from scratch with Costa Huang tutorial
## Let's code PPO from scratch with Costa Huang's tutorial
- For the core implementation of PPO we're going to use the excellent [Costa Huang](https://costa.sh/) tutorial.
- In addition to the tutorial, to go deeper you can read the 37 core implementation details: https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/
@@ -429,7 +429,7 @@ package_to_hub(
)
```
- Here's what look the ppo.py final file
- Here's what the final ppo.py file looks like:
```python
# docs and experiment results can be found at https://docs.cleanrl.dev/rl-algorithms/ppo/#ppopy
@@ -1034,7 +1034,7 @@ To be able to share your model with the community there are three more steps to
1⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join
2⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.
2⃣ Sign in and get your authentication token from the Hugging Face website.
- Create a new token (https://huggingface.co/settings/tokens) **with write role**
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
@@ -1048,11 +1048,11 @@ notebook_login()
!git config --global credential.helper store
```
If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
If you don't want to use Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
## Let's start the training 🔥
- Now that you've coded from scratch PPO and added the Hugging Face Integration, we're ready to start the training 🔥
- Now that you've coded PPO from scratch and added the Hugging Face Integration, we're ready to start the training 🔥
- First, you need to copy all your code to a file you create called `ppo.py`
@@ -1060,7 +1060,7 @@ If you don't want to use a Google Colab or a Jupyter Notebook, you need to use t
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/step2.png" alt="PPO"/>
- Now we just need to run this python script using `python <name-of-python-script>.py` with the additional parameters we defined with `argparse`
- Now we just need to run this python script using `python <name-of-python-script>.py` with the additional parameters we defined using `argparse`
- You should modify more hyperparameters otherwise the training will not be super stable.
@@ -1070,8 +1070,8 @@ If you don't want to use a Google Colab or a Jupyter Notebook, you need to use t
## Some additional challenges 🏆
The best way to learn **is to try things by your own**! Why not trying another environment?
The best way to learn **is to try things on your own**! Why not try another environment?
See you on Unit 8, part 2 where we going to train agents to play Doom 🔥
See you in Unit 8, part 2 where we're going to train agents to play Doom 🔥
## Keep learning, stay awesome 🤗

View File

@@ -214,9 +214,9 @@ Now that the setup if complete, we can train the agent. We have chosen here to l
The objective of this scenario is to **teach the agent how to survive without knowing what makes him survive**. Agent know only that **life is precious** and death is bad so **it must learn what prolongs his existence and that his health is connected with it**.
The objective of this scenario is to **teach the agent how to survive without knowing what makes it survive**. The Agent know only that **life is precious** and death is bad so **it must learn what prolongs its existence and that its health is connected with survival**.
Map is a rectangle containing walls and with a green, acidic floor which **hurts the player periodically**. Initially there are some medkits spread uniformly over the map. A new medkit falls from the skies every now and then. **Medkits heal some portions of player's health** - to survive agent needs to pick them up. Episode finishes after player's death or on timeout.
The map is a rectangle containing walls and with a green, acidic floor which **hurts the player periodically**. Initially there are some medkits spread uniformly over the map. A new medkit falls from the skies every now and then. **Medkits heal some portions of player's health** - to survive, the agent needs to pick them up. The episode finishes after the player's death or on timeout.
Further configuration:
- Living_reward = 1
@@ -232,7 +232,7 @@ There are also a number of more complex scenarios that have been create for ViZD
## Training the agent
- We're going to train the agent for 4000000 steps it will take approximately 20min
- We're going to train the agent for 4000000 steps. It will take approximately 20min
```python
## Start the training, this should take around 15 minutes
@@ -288,7 +288,7 @@ To be able to share your model with the community there are three more steps to
1⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join
2⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.
2⃣ Sign in and get your authentication token from the Hugging Face website.
- Create a new token (https://huggingface.co/settings/tokens) **with write role**
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
@@ -296,7 +296,7 @@ To be able to share your model with the community there are three more steps to
- Copy the token
- Run the cell below and paste the token
If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
If you don't want to use Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
```python
from huggingface_hub import notebook_login
@@ -330,7 +330,7 @@ status = enjoy(cfg)
This agent's performance was good, but can do better! Let's download and visualize an agent trained for 10B timesteps from the hub.
This agent's performance was good, but we can do better! Let's download and visualize an agent trained for 10B timesteps from the hub.
```bash
#download the agent from the hub
@@ -425,6 +425,6 @@ If you prefer an easier scenario, **why not try training in another ViZDoom scen
---
This concludes the last unit. But we are not finished yet! 🤗 The following **bonus section include some of the most interesting, advanced and cutting edge work in Deep Reinforcement Learning**.
This concludes the last unit. But we are not finished yet! 🤗 The following **bonus section include some of the most interesting, advanced, and cutting edge work in Deep Reinforcement Learning**.
## Keep learning, stay awesome 🤗

View File

@@ -2,12 +2,12 @@
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/thumbnail2.png" alt="thumbnail"/>
In this second part of Unit 8, we'll get deeper into PPO optimization by using [Sample-Factory](https://samplefactory.dev/), an **asynchronous implementation of the PPO algorithm**, to train our agent playing [vizdoom](https://vizdoom.cs.put.edu.pl/) (an open source version of Doom).
In this second part of Unit 8, we'll get deeper into PPO optimization by using [Sample-Factory](https://samplefactory.dev/), an **asynchronous implementation of the PPO algorithm**, to train our agent to play [vizdoom](https://vizdoom.cs.put.edu.pl/) (an open source version of Doom).
In the notebook, **you'll train your agent to play the Health Gathering level**, where the agent must collect health packs to avoid dying. After that, you can **train your agent to play more complex levels, such as Deathmatch**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/environments.png" alt="Environment"/>
Sounds exciting? Let's get started! 🚀
Sound exciting? Let's get started! 🚀
The hands-on is made by [Edward Beeching](https://twitter.com/edwardbeeching), a Machine Learning Research Scientist at Hugging Face. He worked on Godot Reinforcement Learning Agents, an open-source interface for developing environments and agents in the Godot Game Engine.

View File

@@ -2,17 +2,17 @@
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/thumbnail.png" alt="Unit 8"/>
In Unit 6, we learned about Advantage Actor Critic (A2C), a hybrid architecture combining value-based and policy-based methods that help to stabilize the training by reducing the variance with:
In Unit 6, we learned about Advantage Actor Critic (A2C), a hybrid architecture combining value-based and policy-based methods that helps to stabilize the training by reducing the variance with:
- *An Actor* that controls **how our agent behaves** (policy-based method).
- *A Critic* that measures **how good the action taken is** (value-based method).
Today we'll learn about Proximal Policy Optimization (PPO), an architecture that **improves our agent's training stability by avoiding too large policy updates**. To do that, we use a ratio that indicates the difference between our current and old policy and clip this ratio from a specific range \\( [1 - \epsilon, 1 + \epsilon] \\) .
Today we'll learn about Proximal Policy Optimization (PPO), an architecture that **improves our agent's training stability by avoiding policy updates that are too large**. To do that, we use a ratio that indicates the difference between our current and old policy and clip this ratio to a specific range \\( [1 - \epsilon, 1 + \epsilon] \\) .
Doing this will ensure **that our policy update will not be too large and that the training is more stable.**
This Unit is in two parts:
- In this first part, you'll learn the theory behind PPO and code your PPO agent from scratch using [CleanRL](https://github.com/vwxyzjn/cleanrl) implementation. To test its robustness with LunarLander-v2. LunarLander-v2 **is the first environment you used when you started this course**. At that time, you didn't know how PPO worked, and now, **you can code it from scratch and train it. How incredible is that 🤩**.
- In this first part, you'll learn the theory behind PPO and code your PPO agent from scratch using the [CleanRL](https://github.com/vwxyzjn/cleanrl) implementation. To test its robustness you'll use LunarLander-v2. LunarLander-v2 **is the first environment you used when you started this course**. At that time, you didn't know how PPO worked, and now, **you can code it from scratch and train it. How incredible is that 🤩**.
- In the second part, we'll get deeper into PPO optimization by using [Sample-Factory](https://samplefactory.dev/) and train an agent playing vizdoom (an open source version of Doom).
<figure>
@@ -20,4 +20,4 @@ This Unit is in two parts:
<figcaption>These are the environments you're going to use to train your agents: VizDoom environments</figcaption>
</figure>
Sounds exciting? Let's get started! 🚀
Sound exciting? Let's get started! 🚀

View File

@@ -1,11 +1,11 @@
# The intuition behind PPO [[the-intuition-behind-ppo]]
The idea with Proximal Policy Optimization (PPO) is that we want to improve the training stability of the policy by limiting the change you make to the policy at each training epoch: **we want to avoid having too large policy updates.**
The idea with Proximal Policy Optimization (PPO) is that we want to improve the training stability of the policy by limiting the change you make to the policy at each training epoch: **we want to avoid having too large of a policy update.**
For two reasons:
- We know empirically that smaller policy updates during training are **more likely to converge to an optimal solution.**
- A too-big step in a policy update can result in falling “off the cliff” (getting a bad policy) **and having a long time or even no possibility to recover.**
- A too-big step in a policy update can result in falling “off the cliff” (getting a bad policy) **and taking a long time or even having no possibility to recover.**
<figure class="image table text-center m-0 w-full">
<img class="center" src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/cliff.jpg" alt="Policy Update cliff"/>

View File

@@ -59,7 +59,7 @@ So we update our policy only if:
**You might wonder why, when the minimum is the clipped ratio, the gradient is 0.** When the ratio is clipped, the derivative in this case will not be the derivative of the \\( r_t(\theta) * A_t \\) but the derivative of either \\( (1 - \epsilon)* A_t\\) or the derivative of \\( (1 + \epsilon)* A_t\\) which both = 0.
To summarize, thanks to this clipped surrogate objective, **we restrict the range that the current policy can vary from the old one.** Because we remove the incentive for the probability ratio to move outside of the interval since, the clip have the effect to gradient. If the ratio is > \\( 1 + \epsilon \\) or < \\( 1 - \epsilon \\) the gradient will be equal to 0.
To summarize, thanks to this clipped surrogate objective, **we restrict the range that the current policy can vary from the old one.** Because we remove the incentive for the probability ratio to move outside of the interval since the clip forces the gradient to be zero. If the ratio is > \\( 1 + \epsilon \\) or < \\( 1 - \epsilon \\) the gradient will be equal to 0.
The final Clipped Surrogate Objective Loss for PPO Actor-Critic style looks like this, it's a combination of Clipped Surrogate Objective function, Value Loss Function and Entropy bonus: