diff --git a/units/en/unit8/clipped-surrogate-objective.mdx b/units/en/unit8/clipped-surrogate-objective.mdx
index 9319b3e..b2179db 100644
--- a/units/en/unit8/clipped-surrogate-objective.mdx
+++ b/units/en/unit8/clipped-surrogate-objective.mdx
@@ -1,7 +1,7 @@
# Introducing the Clipped Surrogate Objective Function
## Recap: The Policy Objective Function
-Let’s remember what is the objective to optimize in Reinforce:
+Let’s remember what the objective is to optimize in Reinforce:
The idea was that by taking a gradient ascent step on this function (equivalent to taking gradient descent of the negative of this function), we would **push our agent to take actions that lead to higher rewards and avoid harmful actions.**
@@ -10,9 +10,9 @@ However, the problem comes from the step size:
- Too small, **the training process was too slow**
- Too high, **there was too much variability in the training**
-Here with PPO, the idea is to constrain our policy update with a new objective function called the *Clipped surrogate objective function* that **will constrain the policy change in a small range using a clip.**
+With PPO, the idea is to constrain our policy update with a new objective function called the *Clipped surrogate objective function* that **will constrain the policy change in a small range using a clip.**
-This new function **is designed to avoid destructive large weights updates** :
+This new function **is designed to avoid destructively large weights updates** :
@@ -21,11 +21,11 @@ Let’s study each part to understand how it works.
## The Ratio Function
-This ratio is calculated this way:
+This ratio is calculated as follows:
-It’s the probability of taking action \\( a_t \\) at state \\( s_t \\) in the current policy divided by the previous one.
+It’s the probability of taking action \\( a_t \\) at state \\( s_t \\) in the current policy, divided by the same for the previous policy.
As we can see, \\( r_t(\theta) \\) denotes the probability ratio between the current and old policy:
@@ -49,7 +49,7 @@ However, without a constraint, if the action taken is much more probable in our
-Consequently, we need to constrain this objective function by penalizing changes that lead to a ratio away from 1 (in the paper, the ratio can only vary from 0.8 to 1.2).
+Consequently, we need to constrain this objective function by penalizing changes that lead to a ratio far away from 1 (in the paper, the ratio can only vary from 0.8 to 1.2).
**By clipping the ratio, we ensure that we do not have a too large policy update because the current policy can't be too different from the older one.**
@@ -62,7 +62,7 @@ To do that, we have two solutions:
This clipped part is a version where rt(theta) is clipped between \\( [1 - \epsilon, 1 + \epsilon] \\).
-With the Clipped Surrogate Objective function, we have two probability ratios, one non-clipped and one clipped in a range (between \\( [1 - \epsilon, 1 + \epsilon] \\), epsilon is a hyperparameter that helps us to define this clip range (in the paper \\( \epsilon = 0.2 \\).).
+With the Clipped Surrogate Objective function, we have two probability ratios, one non-clipped and one clipped in a range between \\( [1 - \epsilon, 1 + \epsilon] \\), epsilon is a hyperparameter that helps us to define this clip range (in the paper \\( \epsilon = 0.2 \\).).
Then, we take the minimum of the clipped and non-clipped objective, **so the final objective is a lower bound (pessimistic bound) of the unclipped objective.**
diff --git a/units/en/unit8/conclusion-sf.mdx b/units/en/unit8/conclusion-sf.mdx
index 34c85df..7b82f91 100644
--- a/units/en/unit8/conclusion-sf.mdx
+++ b/units/en/unit8/conclusion-sf.mdx
@@ -6,8 +6,8 @@ Now that you've successfully trained your Doom agent, why not try deathmatch? Re
If you do it, don't hesitate to share your model in the `#rl-i-made-this` channel in our [discord server](https://www.hf.co/join/discord).
-This concludes the last unit, but we are not finished yet! 🤗 The following **bonus unit includes some of the most interesting, advanced and cutting edge work in Deep Reinforcement Learning**.
+This concludes the last unit, but we are not finished yet! 🤗 The following **bonus unit includes some of the most interesting, advanced, and cutting edge work in Deep Reinforcement Learning**.
-See you next time 🔥,
+See you next time 🔥
## Keep Learning, Stay awesome 🤗
diff --git a/units/en/unit8/conclusion.mdx b/units/en/unit8/conclusion.mdx
index 7dc56e6..dd99c18 100644
--- a/units/en/unit8/conclusion.mdx
+++ b/units/en/unit8/conclusion.mdx
@@ -2,8 +2,8 @@
That’s all for today. Congrats on finishing this unit and the tutorial!
-The best way to learn is to practice and try stuff. **Why not improving the implementation to handle frames as input?**.
+The best way to learn is to practice and try stuff. **Why not improve the implementation to handle frames as input?**.
-See you on second part of this Unit 🔥,
+See you on second part of this Unit 🔥
## Keep Learning, Stay awesome 🤗
diff --git a/units/en/unit8/hands-on-cleanrl.mdx b/units/en/unit8/hands-on-cleanrl.mdx
index 88d1033..10d8426 100644
--- a/units/en/unit8/hands-on-cleanrl.mdx
+++ b/units/en/unit8/hands-on-cleanrl.mdx
@@ -31,9 +31,9 @@ Then, to test its robustness, we're going to train it in:
-And finally, we will be push the trained model to the Hub to evaluate and visualize your agent playing.
+And finally, we will push the trained model to the Hub to evaluate and visualize your agent playing.
-LunarLander-v2 is the first environment you used when you started this course. At that time, you didn't know how it worked, and now, you can code it from scratch and train it. **How incredible is that 🤩.**
+LunarLander-v2 is the first environment you used when you started this course. At that time, you didn't know how it worked, and now you can code it from scratch and train it. **How incredible is that 🤩.**
@@ -1048,11 +1048,11 @@ notebook_login()
!git config --global credential.helper store
```
-If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
+If you don't want to use Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
## Let's start the training 🔥
-- Now that you've coded from scratch PPO and added the Hugging Face Integration, we're ready to start the training 🔥
+- Now that you've coded PPO from scratch and added the Hugging Face Integration, we're ready to start the training 🔥
- First, you need to copy all your code to a file you create called `ppo.py`
@@ -1060,7 +1060,7 @@ If you don't want to use a Google Colab or a Jupyter Notebook, you need to use t
-- Now we just need to run this python script using `python
@@ -296,7 +296,7 @@ To be able to share your model with the community there are three more steps to
- Copy the token
- Run the cell below and paste the token
-If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
+If you don't want to use Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
```python
from huggingface_hub import notebook_login
@@ -330,7 +330,7 @@ status = enjoy(cfg)
-This agent's performance was good, but can do better! Let's download and visualize an agent trained for 10B timesteps from the hub.
+This agent's performance was good, but we can do better! Let's download and visualize an agent trained for 10B timesteps from the hub.
```bash
#download the agent from the hub
@@ -425,6 +425,6 @@ If you prefer an easier scenario, **why not try training in another ViZDoom scen
---
-This concludes the last unit. But we are not finished yet! 🤗 The following **bonus section include some of the most interesting, advanced and cutting edge work in Deep Reinforcement Learning**.
+This concludes the last unit. But we are not finished yet! 🤗 The following **bonus section include some of the most interesting, advanced, and cutting edge work in Deep Reinforcement Learning**.
## Keep learning, stay awesome 🤗
diff --git a/units/en/unit8/introduction-sf.mdx b/units/en/unit8/introduction-sf.mdx
index 9250cf4..a5dda98 100644
--- a/units/en/unit8/introduction-sf.mdx
+++ b/units/en/unit8/introduction-sf.mdx
@@ -2,12 +2,12 @@
-Sounds exciting? Let's get started! 🚀
+Sound exciting? Let's get started! 🚀
The hands-on is made by [Edward Beeching](https://twitter.com/edwardbeeching), a Machine Learning Research Scientist at Hugging Face. He worked on Godot Reinforcement Learning Agents, an open-source interface for developing environments and agents in the Godot Game Engine.
diff --git a/units/en/unit8/introduction.mdx b/units/en/unit8/introduction.mdx
index 6e8645d..bb6a45c 100644
--- a/units/en/unit8/introduction.mdx
+++ b/units/en/unit8/introduction.mdx
@@ -2,17 +2,17 @@
diff --git a/units/en/unit8/visualize.mdx b/units/en/unit8/visualize.mdx
index 958b61c..af05a57 100644
--- a/units/en/unit8/visualize.mdx
+++ b/units/en/unit8/visualize.mdx
@@ -59,7 +59,7 @@ So we update our policy only if:
**You might wonder why, when the minimum is the clipped ratio, the gradient is 0.** When the ratio is clipped, the derivative in this case will not be the derivative of the \\( r_t(\theta) * A_t \\) but the derivative of either \\( (1 - \epsilon)* A_t\\) or the derivative of \\( (1 + \epsilon)* A_t\\) which both = 0.
-To summarize, thanks to this clipped surrogate objective, **we restrict the range that the current policy can vary from the old one.** Because we remove the incentive for the probability ratio to move outside of the interval since, the clip have the effect to gradient. If the ratio is > \\( 1 + \epsilon \\) or < \\( 1 - \epsilon \\) the gradient will be equal to 0.
+To summarize, thanks to this clipped surrogate objective, **we restrict the range that the current policy can vary from the old one.** Because we remove the incentive for the probability ratio to move outside of the interval since the clip forces the gradient to be zero. If the ratio is > \\( 1 + \epsilon \\) or < \\( 1 - \epsilon \\) the gradient will be equal to 0.
The final Clipped Surrogate Objective Loss for PPO Actor-Critic style looks like this, it's a combination of Clipped Surrogate Objective function, Value Loss Function and Entropy bonus: