Merge branch 'main' into ThomasSimonini/AdvancedTopics

2026-06-15 14:36:45 +08:00 · 2023-02-16 16:32:23 +01:00
parent ee626a3d8c 996a596bcd
commit c41cd1feb4
11 changed files with 3724 additions and 1 deletions
--- a/notebooks/unit8/unit8_part1.ipynb
+++ b/notebooks/unit8/unit8_part1.ipynb
--- a/notebooks/unit8/unit8_part1.mdx
+++ b/notebooks/unit8/unit8_part1.mdx
--- a/units/en/_toctree.yml
+++ b/units/en/_toctree.yml
@@ -178,6 +178,22 @@
    title: Conclusion
  - local: unit7/additional-readings
    title: Additional Readings
+- title: Unit 8. Part 1 Proximal Policy Optimization (PPO)
+  sections:
+  - local: unit8/introduction
+    title: Introduction
+  - local: unit8/intuition-behind-ppo
+    title: The intuition behind PPO
+  - local: unit8/clipped-surrogate-objective
+    title: Introducing the Clipped Surrogate Objective Function
+  - local: unit8/visualize
+    title: Visualize the Clipped Surrogate Objective Function
+  - local: unit8/hands-on-cleanrl
+    title: PPO with CleanRL
+  - local: unit8/conclusion
+    title: Conclusion
+  - local: unit8/additional-readings
+    title: Additional Readings
 - title: Bonus Unit 3. Advanced Topics in Reinforcement Learning
  sections:
  - local: unitbonus3/introduction
--- a/units/en/unit2/hands-on.mdx
+++ b/units/en/unit2/hands-on.mdx
@@ -369,7 +369,7 @@ def epsilon_greedy_policy(Qtable, state, epsilon):
 ```

 ## Define the hyperparameters ⚙️
-The exploration related hyperparamters are some of the most important ones.
+The exploration related hyperparameters are some of the most important ones.

 - We need to make sure that our agent **explores enough of the state space** to learn a good value approximation. To do that, we need to have progressive decay of the epsilon.
 - If you decrease epsilon too fast (too high decay_rate), **you take the risk that your agent will be stuck**, since your agent didn't explore enough of the state space and hence can't solve the problem.
--- a/units/en/unit8/additional-readings.mdx
+++ b/units/en/unit8/additional-readings.mdx
@@ -0,0 +1,21 @@
+# Additional Readings [[additional-readings]]
+
+These are **optional readings** if you want to go deeper.
+
+## PPO Explained
+
+- [Towards Delivering a Coherent Self-Contained Explanation of Proximal Policy Optimization by Daniel Bick](https://fse.studenttheses.ub.rug.nl/25709/1/mAI_2021_BickD.pdf)
+- [What is the way to understand Proximal Policy Optimization Algorithm in RL?](https://stackoverflow.com/questions/46422845/what-is-the-way-to-understand-proximal-policy-optimization-algorithm-in-rl)
+- [Foundations of Deep RL Series, L4 TRPO and PPO by Pieter Abbeel](https://youtu.be/KjWF8VIMGiY)
+- [OpenAI PPO Blogpost](https://openai.com/blog/openai-baselines-ppo/)
+- [Spinning Up RL PPO](https://spinningup.openai.com/en/latest/algorithms/ppo.html)
+- [Paper Proximal Policy Optimization Algorithms](https://arxiv.org/abs/1707.06347)
+
+## PPO Implementation details
+
+- [The 37 Implementation Details of Proximal Policy Optimization](https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/)
+- [Part 1 of 3 — Proximal Policy Optimization Implementation: 11 Core Implementation Details](https://www.youtube.com/watch?v=MEt6rrxH8W4)
+
+## Importance Sampling
+
+- [Importance Sampling Explained](https://youtu.be/C3p2wI4RAi8)
--- a/units/en/unit8/clipped-surrogate-objective.mdx
+++ b/units/en/unit8/clipped-surrogate-objective.mdx
@@ -0,0 +1,69 @@
+# Introducing the Clipped Surrogate Objective Function
+## Recap: The Policy Objective Function
+
+Let’s remember what is the objective to optimize in Reinforce:
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/lpg.jpg" alt="Reinforce"/>
+
+The idea was that by taking a gradient ascent step on this function (equivalent to taking gradient descent of the negative of this function), we would **push our agent to take actions that lead to higher rewards and avoid harmful actions.**
+
+However, the problem comes from the step size:
+- Too small, **the training process was too slow**
+- Too high, **there was too much variability in the training**
+
+Here with PPO, the idea is to constrain our policy update with a new objective function called the *Clipped surrogate objective function* that **will constrain the policy change in a small range using a clip.**
+
+This new function **is designed to avoid destructive large weights updates** :
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/ppo-surrogate.jpg" alt="PPO surrogate function"/>
+
+Let’s study each part to understand how it works.
+
+## The Ratio Function
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/ratio1.jpg" alt="Ratio"/>
+
+This ratio is calculated this way:
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/ratio2.jpg" alt="Ratio"/>
+
+It’s the probability of taking action \\( a_t \\) at state \\( s_t \\) in the current policy divided by the previous one.
+
+As we can see, \\( r_t(\theta) \\) denotes the probability ratio between the current and old policy:
+
+- If \\( r_t(\theta) > 1 \\), the **action \\( a_t \\) at state \\( s_t \\) is more likely in the current policy than the old policy.**
+- If \\( r_t(\theta) \\) is between 0 and 1, the **action is less likely for the current policy than for the old one**.
+
+So this probability ratio is an **easy way to estimate the divergence between old and current policy.**
+
+## The unclipped part of the Clipped Surrogate Objective function
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/unclipped1.jpg" alt="PPO"/>
+
+This ratio **can replace the log probability we use in the policy objective function**. This gives us the left part of the new objective function: multiplying the ratio by the advantage.
+<figure class="image table text-center m-0 w-full">
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/unclipped2.jpg" alt="PPO"/>
+  <figcaption><a href="https://arxiv.org/pdf/1707.06347.pdf">Proximal Policy Optimization Algorithms</a></figcaption>
+</figure>
+
+However, without a constraint, if the action taken is much more probable in our current policy than in our former, **this would lead to a significant policy gradient step** and, therefore, an **excessive policy update.**
+
+## The clipped Part of the Clipped Surrogate Objective function
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/clipped.jpg" alt="PPO"/>
+
+Consequently, we need to constrain this objective function by penalizing changes that lead to a ratio away from 1 (in the paper, the ratio can only vary from 0.8 to 1.2).
+
+**By clipping the ratio, we ensure that we do not have a too large policy update because the current policy can't be too different from the older one.**
+
+To do that, we have two solutions:
+
+- *TRPO (Trust Region Policy Optimization)* uses KL divergence constraints outside the objective function to constrain the policy update. But this method **is complicated to implement and takes more computation time.**
+- *PPO* clip probability ratio directly in the objective function with its **Clipped surrogate objective function.**
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/clipped.jpg" alt="PPO"/>
+
+This clipped part is a version where rt(theta) is clipped between  \\( [1 - \epsilon, 1 + \epsilon] \\).
+
+With the Clipped Surrogate Objective function, we have two probability ratios, one non-clipped and one clipped in a range (between  \\( [1 - \epsilon, 1 + \epsilon] \\), epsilon is a hyperparameter that helps us to define this clip range (in the paper  \\( \epsilon = 0.2 \\).).
+
+Then, we take the minimum of the clipped and non-clipped objective, **so the final objective is a lower bound (pessimistic bound) of the unclipped objective.**
+
+Taking the minimum of the clipped and non-clipped objective means **we'll select either the clipped or the non-clipped objective based on the ratio and advantage situation**.
--- a/units/en/unit8/conclusion.mdx
+++ b/units/en/unit8/conclusion.mdx
@@ -0,0 +1,9 @@
+# Conclusion [[Conclusion]]
+
+That’s all for today. Congrats on finishing this unit and the tutorial!
+
+The best way to learn is to practice and try stuff. **Why not improving the implementation to handle frames as input?**.
+
+See you on second part of this Unit 🔥,
+
+## Keep Learning, Stay awesome 🤗
--- a/units/en/unit8/hands-on-cleanrl.mdx
+++ b/units/en/unit8/hands-on-cleanrl.mdx
--- a/units/en/unit8/introduction.mdx
+++ b/units/en/unit8/introduction.mdx
@@ -0,0 +1,23 @@
+# Introduction [[introduction]]
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/thumbnail.png" alt="Unit 8"/>
+
+In Unit 6, we learned about Advantage Actor Critic (A2C), a hybrid architecture combining value-based and policy-based methods that help to stabilize the training by reducing the variance with:
+
+- *An Actor* that controls **how our agent behaves** (policy-based method).
+- *A Critic* that measures **how good the action taken is** (value-based method).
+
+Today we'll learn about Proximal Policy Optimization (PPO), an architecture that **improves our agent's training stability by avoiding too large policy updates**. To do that, we use a ratio that indicates the difference between our current and old policy and clip this ratio from a specific range \\( [1 - \epsilon, 1 + \epsilon] \\) .
+
+Doing this will ensure **that our policy update will not be too large and that the training is more stable.**
+
+This Unit is in two parts:
+- In this first part, you'll learn the theory behind PPO and code your PPO agent from scratch using [CleanRL](https://github.com/vwxyzjn/cleanrl) implementation. To test its robustness with LunarLander-v2. LunarLander-v2 **is the first environment you used when you started this course**. At that time, you didn't know how PPO worked, and now, **you can code it from scratch and train it. How incredible is that 🤩**.
+- In the second part, we'll get deeper into PPO optimization by using [Sample-Factory](https://samplefactory.dev/) and train an agent playing vizdoom (an open source version of Doom).
+
+<figure>
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/environments.png" alt="Environment"/>
+<figcaption>This is the environments you're going to use to train your agents: VizDoom and GodotRL environments</figcaption>
+</figure>
+
+Sounds exciting? Let's get started! 🚀
--- a/units/en/unit8/intuition-behind-ppo.mdx
+++ b/units/en/unit8/intuition-behind-ppo.mdx
@@ -0,0 +1,16 @@
+# The intuition behind PPO [[the-intuition-behind-ppo]]
+
+
+The idea with Proximal Policy Optimization (PPO) is that we want to improve the training stability of the policy by limiting the change you make to the policy at each training epoch: **we want to avoid having too large policy updates.**
+
+For two reasons:
+- We know empirically that smaller policy updates during training are **more likely to converge to an optimal solution.**
+- A too-big step in a policy update can result in falling “off the cliff” (getting a bad policy) **and having a long time or even no possibility to recover.**
+
+<figure class="image table text-center m-0 w-full">
+  <img class="center" src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/cliff.jpg" alt="Policy Update cliff"/>
+  <figcaption>Taking smaller policy updates to improve the training stability</figcaption>
+  <figcaption>Modified version from RL — Proximal Policy Optimization (PPO) <a href="https://jonathan-hui.medium.com/rl-proximal-policy-optimization-ppo-explained-77f014ec3f12">Explained by Jonathan Hui</a></figcaption>
+</figure>
+
+**So with PPO, we update the policy conservatively**. To do so, we need to measure how much the current policy changed compared to the former one using a ratio calculation between the current and former policy. And we clip this ratio in a range \\( [1 - \epsilon, 1 + \epsilon] \\), meaning that we **remove the incentive for the current policy to go too far from the old one (hence the proximal policy term).**
--- a/units/en/unit8/visualize.mdx
+++ b/units/en/unit8/visualize.mdx
@@ -0,0 +1,68 @@
+# Visualize the Clipped Surrogate Objective Function
+
+Don't worry. **It's normal if this seems complex to handle right now**. But we're going to see what this Clipped Surrogate Objective Function looks like, and this will help you to visualize better what's going on.
+
+<figure class="image table text-center m-0 w-full">
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/recap.jpg" alt="PPO"/>
+  <figcaption><a href="https://fse.studenttheses.ub.rug.nl/25709/1/mAI_2021_BickD.pdf">Table from "Towards Delivering a Coherent Self-Contained
+    Explanation of Proximal Policy Optimization" by Daniel Bick</a></figcaption>
+</figure>
+
+We have six different situations. Remember first that we take the minimum between the clipped and unclipped objectives.
+
+## Case 1 and 2: the ratio is between the range
+
+In situations 1 and 2, **the clipping does not apply since the ratio is between the range** \\( [1 - \epsilon, 1 + \epsilon] \\)
+
+In situation 1, we have a positive advantage: the **action is better than the average** of all the actions in that state. Therefore, we should encourage our current policy to increase the probability of taking that action in that state.
+
+Since the ratio is between intervals, **we can increase our policy's probability of taking that action at that state.**
+
+In situation 2, we have a negative advantage: the action is worse than the average of all actions at that state. Therefore, we should discourage our current policy from taking that action in that state.
+
+Since the ratio is between intervals, **we can decrease the probability that our policy takes that action at that state.**
+
+## Case 3 and 4: the ratio is below the range
+<figure class="image table text-center m-0 w-full">
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/recap.jpg" alt="PPO"/>
+  <figcaption><a href="https://fse.studenttheses.ub.rug.nl/25709/1/mAI_2021_BickD.pdf">Table from "Towards Delivering a Coherent Self-Contained
+    Explanation of Proximal Policy Optimization" by Daniel Bick</a></figcaption>
+</figure>
+
+If the probability ratio is lower than \\( [1 - \epsilon] \\), the probability of taking that action at that state is much lower than with the old policy.
+
+If, like in situation 3, the advantage estimate is positive (A>0), then **you want to increase the probability of taking that action at that state.**
+
+But if, like situation 4, the advantage estimate is negative, **we don't want to decrease further** the probability of taking that action at that state. Therefore, the gradient is = 0 (since we're on a flat line), so we don't update our weights.
+
+## Case 5 and 6: the ratio is above the range
+<figure class="image table text-center m-0 w-full">
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/recap.jpg" alt="PPO"/>
+  <figcaption><a href="https://fse.studenttheses.ub.rug.nl/25709/1/mAI_2021_BickD.pdf">Table from "Towards Delivering a Coherent Self-Contained
+    Explanation of Proximal Policy Optimization" by Daniel Bick</a></figcaption>
+</figure>
+
+If the probability ratio is higher than \\( [1 + \epsilon] \\), the probability of taking that action at that state in the current policy is **much higher than in the former policy.**
+
+If, like in situation 5, the advantage is positive, **we don't want to get too greedy**. We already have a higher probability of taking that action at that state than the former policy. Therefore, the gradient is = 0 (since we're on a flat line), so we don't update our weights.
+
+If, like in situation 6, the advantage is negative, we want to decrease the probability of taking that action at that state.
+
+So if we recap, **we only update the policy with the unclipped objective part**. When the minimum is the clipped objective part, we don't update our policy weights since the gradient will equal 0.
+
+So we update our policy  only if:
+- Our ratio is in the range \\( [1 - \epsilon, 1 + \epsilon] \\)
+- Our ratio is outside the range, but **the advantage leads to getting closer to the range**
+    - Being below the ratio but the advantage is > 0
+    - Being above the ratio but the advantage is < 0
+
+**You might wonder why, when the minimum is the clipped ratio, the gradient is 0.** When the ratio is clipped, the derivative in this case will not be the derivative of the \\( r_t(\theta) * A_t \\)   but the derivative of either \\( (1 - \epsilon)* A_t\\) or the derivative of \\( (1 + \epsilon)* A_t\\) which both = 0.
+
+
+To summarize, thanks to this clipped surrogate objective, **we restrict the range that the current policy can vary from the old one.** Because we remove the incentive for the probability ratio to move outside of the interval since, the clip have the effect to gradient. If the ratio is > \\( 1 + \epsilon \\) or < \\( 1 - \epsilon \\) the gradient will be equal to 0.
+
+The final Clipped Surrogate Objective Loss for PPO Actor-Critic style looks like this, it's a combination of Clipped Surrogate Objective function, Value Loss Function and Entropy bonus:
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/ppo-objective.jpg" alt="PPO objective"/>
+
+That was quite complex. Take time to understand these situations by looking at the table and the graph. **You must understand why this makes sense.** If you want to go deeper, the best resource is the article [Towards Delivering a Coherent Self-Contained Explanation of Proximal Policy Optimization" by Daniel Bick, especially part 3.4](https://fse.studenttheses.ub.rug.nl/25709/1/mAI_2021_BickD.pdf).