From 88fded6cf35c37d2bb3dd723ec89ff8fb072877d Mon Sep 17 00:00:00 2001
From: simoninithomas <simonini_thomas@outlook.fr>
Date: Mon, 2 Jan 2023 22:05:36 +0100
Subject: [PATCH] Update intro and what are policy based mtd

---
 units/en/_toctree.yml                         |  4 +--
 units/en/unit4/introduction.mdx               |  7 ++--
 .../unit4/what-are-policy-based-methods.mdx   | 36 +++++++++----------
 3 files changed, 24 insertions(+), 23 deletions(-)
diff --git a/units/en/_toctree.yml b/units/en/_toctree.yml
index 35bfa4e..87e5fa0 100644
--- a/units/en/_toctree.yml
+++ b/units/en/_toctree.yml
@@ -110,12 +110,12 @@
     title: Optuna
   - local: unitbonus2/hands-on
     title: Hands-on
-- title: Unit 4. Policy Gradient with Robotics
+- title: Unit 4. Policy Gradient with PyTorch
   sections:
   - local: unit4/introduction
     title: Introduction
   - local: unit4/what-are-policy-based-methods
-    title: What are the Policy Based methods?
+    title: What are the policy-based methods?
   - local: unit4/advantages-disadvantages
     title: The advantages and disadvantages of Policy-based methods
   - local: unit4/policy-gradient
diff --git a/units/en/unit4/introduction.mdx b/units/en/unit4/introduction.mdx
index aa96750..f45a6b8 100644
--- a/units/en/unit4/introduction.mdx
+++ b/units/en/unit4/introduction.mdx
@@ -8,13 +8,14 @@ Indeed, since the beginning of the course, we only studied value-based methods,
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy" />
 
-Because, in value-based, ** \\(π\\) exists only because of the action value estimates, since policy is just a function** (for instance, greedy-policy) that will select the action with the highest value given a state.
+Because, in value-based, policy ** \\(π\\) exists only because of the action value estimates, since the policy is just a function** (for instance, greedy-policy) that will select the action with the highest value given a state.
 
 But, with policy-based methods, we want to optimize the policy directly **without having an intermediate step of learning a value function.**
 
-So today, **we'll learn about policy-based methods, and we'll study a subset of these methods called Policy Gradients**. Then we'll implement our first policy gradient algorithm called Monte Carlo **Reinforce** from scratch using PyTorch. Before testing its robustness using CartPole-v1, and PixelCopter.
+So today, **we'll learn about policy-based methods, and we'll study a subset of these methods called policy gradient**. Then we'll implement our first policy gradient algorithm called Monte Carlo **Reinforce** from scratch using PyTorch.
+Before testing its robustness using CartPole-v1, and PixelCopter environments.
 
-You'll be then able to iterate and improve this implementation for more advanced environments.
+You'll then be able to iterate and improve this implementation for more advanced environments.
 
 <figure class="image table text-center m-0 w-full">
   <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/envs.gif" alt="Environments"/>
diff --git a/units/en/unit4/what-are-policy-based-methods.mdx b/units/en/unit4/what-are-policy-based-methods.mdx
index b86e647..5f0fa4e 100644
--- a/units/en/unit4/what-are-policy-based-methods.mdx
+++ b/units/en/unit4/what-are-policy-based-methods.mdx
@@ -1,44 +1,44 @@
 # What are the policy-based methods?
 
-The main goal of Reinforcement learning is to **find the optimal policy \\(\pi*\\) that will maximize the expected cumulative reward**.
-Because Reinforcement Learning is based on the *reward hypothesis* that is **all goals can be described as the maximization of the expected cumulative reward.**
+The main goal of Reinforcement learning is to **find the optimal policy \\(\pi^{*}\\) that will maximize the expected cumulative reward**.
+Because Reinforcement Learning is based on the *reward hypothesis*: **all goals can be described as the maximization of the expected cumulative reward.**
 
-For instance, in a soccer game (that you're going to train the agents in two units), the goal is to win the game. We can describe this goal in reinforcement learning as
-maximizing the number of goal scored (when the ball cross the goal line) into your opponent soccer goals. And minimize the number of goals into yours soccer goals.
+For instance, in a soccer game (where you're going to train the agents in two units), the goal is to win the game. We can describe this goal in reinforcement learning as
+**maximizing the number of goals scored** (when the ball crosses the goal line) into your opponent's soccer goals. And **minimizing the number of goals in your soccer goals**.
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/soccer.jpg" alt="Soccer" />
 
-## Value-based, Policy-based and Actor-critic methods
+## Value-based, Policy-based, and Actor-critic methods
 
-We studied in the Unit 1, that we had two methods to find (most of the time approximate) this optimal policy \\(\pi*\\).
+We studied in the first unit, that we had two methods to find (most of the time approximate) this optimal policy \\(\pi*\\).
 
 - In *value-Based methods*, we learn a value function.
-  - The idea then is that an optimal value function leads to an optimal policy \\(\pi*\\).
-  - Our objective is to **minimize the loss between, predicted and target value to match the true action-value function.
-  - We have a policy, but it's implicit since it **was generated directly from the Value function**. For instance, in Q-Learning we defined an epsilon-greedy policy.
+  - The idea then is that an optimal value function leads to an optimal policy \\(\pi^{*}\\).
+  - Our objective is to **minimize the loss between the predicted and target value to approximate the true action-value function.
+  - We have a policy, but it's implicit since it **was generated directly from the Value function**. For instance, in Q-Learning, we defined an epsilon-greedy policy.
 
-- On the other hand, in *policy-based methods*, we directly learn to approximate \\(\pi*\\) without having to learn a value function.
-  - The idea then is to parameterize policy, for instance using a neural network \\(\pi_\theta\\), this policy will output a probability distribution over actions (stochastic policy).
+- On the other hand, in *policy-based methods*, we directly learn to approximate \\(\pi^{*}\\) without having to learn a value function.
+  - The idea then is **to parameterize the policy**. For instance, using a neural network \\(\pi_\theta\\), this policy will output a probability distribution over actions (stochastic policy).
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/stochastic_policy.png" alt="stochastic policy" />
 
-  - Our objective then is *to maximize the performance of the parameterized policy using gradient ascent*.
-  - To do that we control the parameter \\(\theta\\) that will affect the distribution of actions over a state.
+  - Our objective then is **to maximize the performance of the parameterized policy using gradient ascent**.
+  - To do that, we control the parameter \\(\theta\\) that will affect the distribution of actions over a state.
 
-- Finally, we'll study next time *actor-critic* which is a combination of value-based and policy-based methods.
+- Finally, we'll study the next time *actor-critic* which is a combination of value-based and policy-based methods.
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy-based.png" alt="Policy based" />
 
 Consequently, thanks to policy-based methods, we can directly optimize our policy \\(\pi_theta\\) to output a probability distribution over actions \\(\pi_\theta(a|s)\\) that leads to the best cumulative return.
-To do that we define an objective function \\(J(\theta)\\), that is the expected cumulative reward and we **want to find \\(\theta\\) that maximize this objective function**.
+To do that, we define an objective function \\(J(\theta)\\), that is, the expected cumulative reward, and we **want to find \\(\theta\\) that maximizes this objective function**.
 
 ## The difference between policy-based and policy-gradient methods
 
-Policy-gradient methods, what we're going to study in this unit, is a subclass of policy-based methods. In policy-based methods, the optimization is most of the time *on-policy* since for each update we only use data (trajectories) collected **by our most recent version of** \\(\pi_\theta\\).
+Policy-gradient methods, what we're going to study in this unit, is a subclass of policy-based methods. In policy-based methods, the optimization is most of the time *on-policy* since for each update, we only use data (trajectories) collected **by our most recent version of** \\(\pi_\theta\\).
 
 The difference between these two methods **lies on how we optimize the parameter** //(/theta//):
 
-- In *policy-based methods*, we search directly for the optimal policy. We can optimize the parameter //(/theta//) **indirectly** by maximize the local approximation of the objective function with techniques like hill climbing, simulated annealing or evolution strategies.
+- In *policy-based methods*, we search directly for the optimal policy. We can optimize the parameter //(/theta//) **indirectly** by maximizing the local approximation of the objective function with techniques like hill climbing, simulated annealing, or evolution strategies.
 - In *policy-gradient methods*, because we're a subclass of the policy-based methods, we search directly for the optimal policy. But we optimize the parameter //(/theta//) **directly** by performing the gradient ascent on the performance of the objective function \\(J(\theta)\\).
 
-Before diving more into how works policy-gradient methods (the objective function, policy gradient theorem, gradient ascent etc.) let's study the advantages and disadvantages of policy-based methods.
+Before diving more into how works policy-gradient methods (the objective function, policy gradient theorem, gradient ascent, etc.), let's study the advantages and disadvantages of policy-based methods.