diff --git a/units/en/_toctree.yml b/units/en/_toctree.yml
index 35bfa4e..87e5fa0 100644
--- a/units/en/_toctree.yml
+++ b/units/en/_toctree.yml
@@ -110,12 +110,12 @@
title: Optuna
- local: unitbonus2/hands-on
title: Hands-on
-- title: Unit 4. Policy Gradient with Robotics
+- title: Unit 4. Policy Gradient with PyTorch
sections:
- local: unit4/introduction
title: Introduction
- local: unit4/what-are-policy-based-methods
- title: What are the Policy Based methods?
+ title: What are the policy-based methods?
- local: unit4/advantages-disadvantages
title: The advantages and disadvantages of Policy-based methods
- local: unit4/policy-gradient
diff --git a/units/en/unit4/introduction.mdx b/units/en/unit4/introduction.mdx
index aa96750..f45a6b8 100644
--- a/units/en/unit4/introduction.mdx
+++ b/units/en/unit4/introduction.mdx
@@ -8,13 +8,14 @@ Indeed, since the beginning of the course, we only studied value-based methods,
-Because, in value-based, ** \\(π\\) exists only because of the action value estimates, since policy is just a function** (for instance, greedy-policy) that will select the action with the highest value given a state.
+Because, in value-based, policy ** \\(π\\) exists only because of the action value estimates, since the policy is just a function** (for instance, greedy-policy) that will select the action with the highest value given a state.
But, with policy-based methods, we want to optimize the policy directly **without having an intermediate step of learning a value function.**
-So today, **we'll learn about policy-based methods, and we'll study a subset of these methods called Policy Gradients**. Then we'll implement our first policy gradient algorithm called Monte Carlo **Reinforce** from scratch using PyTorch. Before testing its robustness using CartPole-v1, and PixelCopter.
+So today, **we'll learn about policy-based methods, and we'll study a subset of these methods called policy gradient**. Then we'll implement our first policy gradient algorithm called Monte Carlo **Reinforce** from scratch using PyTorch.
+Before testing its robustness using CartPole-v1, and PixelCopter environments.
-You'll be then able to iterate and improve this implementation for more advanced environments.
+You'll then be able to iterate and improve this implementation for more advanced environments.
diff --git a/units/en/unit4/what-are-policy-based-methods.mdx b/units/en/unit4/what-are-policy-based-methods.mdx
index b86e647..5f0fa4e 100644
--- a/units/en/unit4/what-are-policy-based-methods.mdx
+++ b/units/en/unit4/what-are-policy-based-methods.mdx
@@ -1,44 +1,44 @@
# What are the policy-based methods?
-The main goal of Reinforcement learning is to **find the optimal policy \\(\pi*\\) that will maximize the expected cumulative reward**.
-Because Reinforcement Learning is based on the *reward hypothesis* that is **all goals can be described as the maximization of the expected cumulative reward.**
+The main goal of Reinforcement learning is to **find the optimal policy \\(\pi^{*}\\) that will maximize the expected cumulative reward**.
+Because Reinforcement Learning is based on the *reward hypothesis*: **all goals can be described as the maximization of the expected cumulative reward.**
-For instance, in a soccer game (that you're going to train the agents in two units), the goal is to win the game. We can describe this goal in reinforcement learning as
-maximizing the number of goal scored (when the ball cross the goal line) into your opponent soccer goals. And minimize the number of goals into yours soccer goals.
+For instance, in a soccer game (where you're going to train the agents in two units), the goal is to win the game. We can describe this goal in reinforcement learning as
+**maximizing the number of goals scored** (when the ball crosses the goal line) into your opponent's soccer goals. And **minimizing the number of goals in your soccer goals**.
-## Value-based, Policy-based and Actor-critic methods
+## Value-based, Policy-based, and Actor-critic methods
-We studied in the Unit 1, that we had two methods to find (most of the time approximate) this optimal policy \\(\pi*\\).
+We studied in the first unit, that we had two methods to find (most of the time approximate) this optimal policy \\(\pi*\\).
- In *value-Based methods*, we learn a value function.
- - The idea then is that an optimal value function leads to an optimal policy \\(\pi*\\).
- - Our objective is to **minimize the loss between, predicted and target value to match the true action-value function.
- - We have a policy, but it's implicit since it **was generated directly from the Value function**. For instance, in Q-Learning we defined an epsilon-greedy policy.
+ - The idea then is that an optimal value function leads to an optimal policy \\(\pi^{*}\\).
+ - Our objective is to **minimize the loss between the predicted and target value to approximate the true action-value function.
+ - We have a policy, but it's implicit since it **was generated directly from the Value function**. For instance, in Q-Learning, we defined an epsilon-greedy policy.
-- On the other hand, in *policy-based methods*, we directly learn to approximate \\(\pi*\\) without having to learn a value function.
- - The idea then is to parameterize policy, for instance using a neural network \\(\pi_\theta\\), this policy will output a probability distribution over actions (stochastic policy).
+- On the other hand, in *policy-based methods*, we directly learn to approximate \\(\pi^{*}\\) without having to learn a value function.
+ - The idea then is **to parameterize the policy**. For instance, using a neural network \\(\pi_\theta\\), this policy will output a probability distribution over actions (stochastic policy).
- - Our objective then is *to maximize the performance of the parameterized policy using gradient ascent*.
- - To do that we control the parameter \\(\theta\\) that will affect the distribution of actions over a state.
+ - Our objective then is **to maximize the performance of the parameterized policy using gradient ascent**.
+ - To do that, we control the parameter \\(\theta\\) that will affect the distribution of actions over a state.
-- Finally, we'll study next time *actor-critic* which is a combination of value-based and policy-based methods.
+- Finally, we'll study the next time *actor-critic* which is a combination of value-based and policy-based methods.
Consequently, thanks to policy-based methods, we can directly optimize our policy \\(\pi_theta\\) to output a probability distribution over actions \\(\pi_\theta(a|s)\\) that leads to the best cumulative return.
-To do that we define an objective function \\(J(\theta)\\), that is the expected cumulative reward and we **want to find \\(\theta\\) that maximize this objective function**.
+To do that, we define an objective function \\(J(\theta)\\), that is, the expected cumulative reward, and we **want to find \\(\theta\\) that maximizes this objective function**.
## The difference between policy-based and policy-gradient methods
-Policy-gradient methods, what we're going to study in this unit, is a subclass of policy-based methods. In policy-based methods, the optimization is most of the time *on-policy* since for each update we only use data (trajectories) collected **by our most recent version of** \\(\pi_\theta\\).
+Policy-gradient methods, what we're going to study in this unit, is a subclass of policy-based methods. In policy-based methods, the optimization is most of the time *on-policy* since for each update, we only use data (trajectories) collected **by our most recent version of** \\(\pi_\theta\\).
The difference between these two methods **lies on how we optimize the parameter** //(/theta//):
-- In *policy-based methods*, we search directly for the optimal policy. We can optimize the parameter //(/theta//) **indirectly** by maximize the local approximation of the objective function with techniques like hill climbing, simulated annealing or evolution strategies.
+- In *policy-based methods*, we search directly for the optimal policy. We can optimize the parameter //(/theta//) **indirectly** by maximizing the local approximation of the objective function with techniques like hill climbing, simulated annealing, or evolution strategies.
- In *policy-gradient methods*, because we're a subclass of the policy-based methods, we search directly for the optimal policy. But we optimize the parameter //(/theta//) **directly** by performing the gradient ascent on the performance of the objective function \\(J(\theta)\\).
-Before diving more into how works policy-gradient methods (the objective function, policy gradient theorem, gradient ascent etc.) let's study the advantages and disadvantages of policy-based methods.
+Before diving more into how works policy-gradient methods (the objective function, policy gradient theorem, gradient ascent, etc.), let's study the advantages and disadvantages of policy-based methods.