Update intro and what are policy based mtd

2026-04-01 17:51:01 +08:00 · 2023-01-02 22:05:36 +01:00
parent c0c4f9b565
commit 88fded6cf3
3 changed files with 24 additions and 23 deletions
--- a/units/en/_toctree.yml
+++ b/units/en/_toctree.yml
@@ -110,12 +110,12 @@
    title: Optuna
  - local: unitbonus2/hands-on
    title: Hands-on
- title: Unit 4. Policy Gradient with Robotics
+- title: Unit 4. Policy Gradient with PyTorch
  sections:
  - local: unit4/introduction
    title: Introduction
  - local: unit4/what-are-policy-based-methods
-    title: What are the Policy Based methods?
+    title: What are the policy-based methods?
  - local: unit4/advantages-disadvantages
    title: The advantages and disadvantages of Policy-based methods
  - local: unit4/policy-gradient
--- a/units/en/unit4/introduction.mdx
+++ b/units/en/unit4/introduction.mdx
@@ -8,13 +8,14 @@ Indeed, since the beginning of the course, we only studied value-based methods,

 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy" />

-Because, in value-based, ** \\(π\\) exists only because of the action value estimates, since policy is just a function** (for instance, greedy-policy) that will select the action with the highest value given a state.
+Because, in value-based, policy ** \\(π\\) exists only because of the action value estimates, since the policy is just a function** (for instance, greedy-policy) that will select the action with the highest value given a state.

 But, with policy-based methods, we want to optimize the policy directly **without having an intermediate step of learning a value function.**

-So today, **we'll learn about policy-based methods, and we'll study a subset of these methods called Policy Gradients**. Then we'll implement our first policy gradient algorithm called Monte Carlo **Reinforce** from scratch using PyTorch. Before testing its robustness using CartPole-v1, and PixelCopter.
+So today, **we'll learn about policy-based methods, and we'll study a subset of these methods called policy gradient**. Then we'll implement our first policy gradient algorithm called Monte Carlo **Reinforce** from scratch using PyTorch.
+Before testing its robustness using CartPole-v1, and PixelCopter environments.

-You'll be then able to iterate and improve this implementation for more advanced environments.
+You'll then be able to iterate and improve this implementation for more advanced environments.

 <figure class="image table text-center m-0 w-full">
  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/envs.gif" alt="Environments"/>
--- a/units/en/unit4/what-are-policy-based-methods.mdx
+++ b/units/en/unit4/what-are-policy-based-methods.mdx
@@ -1,44 +1,44 @@
 # What are the policy-based methods?

-The main goal of Reinforcement learning is to **find the optimal policy \\(\pi*\\) that will maximize the expected cumulative reward**.
-Because Reinforcement Learning is based on the *reward hypothesis* that is **all goals can be described as the maximization of the expected cumulative reward.**
+The main goal of Reinforcement learning is to **find the optimal policy \\(\pi^{*}\\) that will maximize the expected cumulative reward**.
+Because Reinforcement Learning is based on the *reward hypothesis*: **all goals can be described as the maximization of the expected cumulative reward.**

-For instance, in a soccer game (that you're going to train the agents in two units), the goal is to win the game. We can describe this goal in reinforcement learning as
-maximizing the number of goal scored (when the ball cross the goal line) into your opponent soccer goals. And minimize the number of goals into yours soccer goals.
+For instance, in a soccer game (where you're going to train the agents in two units), the goal is to win the game. We can describe this goal in reinforcement learning as
+**maximizing the number of goals scored** (when the ball crosses the goal line) into your opponent's soccer goals. And **minimizing the number of goals in your soccer goals**.

 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/soccer.jpg" alt="Soccer" />

-## Value-based, Policy-based and Actor-critic methods
+## Value-based, Policy-based, and Actor-critic methods

-We studied in the Unit 1, that we had two methods to find (most of the time approximate) this optimal policy \\(\pi*\\).
+We studied in the first unit, that we had two methods to find (most of the time approximate) this optimal policy \\(\pi*\\).

 - In *value-Based methods*, we learn a value function.
-  - The idea then is that an optimal value function leads to an optimal policy \\(\pi*\\).
-  - Our objective is to **minimize the loss between, predicted and target value to match the true action-value function.
-  - We have a policy, but it's implicit since it **was generated directly from the Value function**. For instance, in Q-Learning we defined an epsilon-greedy policy.
+  - The idea then is that an optimal value function leads to an optimal policy \\(\pi^{*}\\).
+  - Our objective is to **minimize the loss between the predicted and target value to approximate the true action-value function.
+  - We have a policy, but it's implicit since it **was generated directly from the Value function**. For instance, in Q-Learning, we defined an epsilon-greedy policy.

- On the other hand, in *policy-based methods*, we directly learn to approximate \\(\pi*\\) without having to learn a value function.
-  - The idea then is to parameterize policy, for instance using a neural network \\(\pi_\theta\\), this policy will output a probability distribution over actions (stochastic policy).
+- On the other hand, in *policy-based methods*, we directly learn to approximate \\(\pi^{*}\\) without having to learn a value function.
+  - The idea then is **to parameterize the policy**. For instance, using a neural network \\(\pi_\theta\\), this policy will output a probability distribution over actions (stochastic policy).

 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/stochastic_policy.png" alt="stochastic policy" />

-  - Our objective then is *to maximize the performance of the parameterized policy using gradient ascent*.
-  - To do that we control the parameter \\(\theta\\) that will affect the distribution of actions over a state.
+  - Our objective then is **to maximize the performance of the parameterized policy using gradient ascent**.
+  - To do that, we control the parameter \\(\theta\\) that will affect the distribution of actions over a state.

- Finally, we'll study next time *actor-critic* which is a combination of value-based and policy-based methods.
+- Finally, we'll study the next time *actor-critic* which is a combination of value-based and policy-based methods.

 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy-based.png" alt="Policy based" />

 Consequently, thanks to policy-based methods, we can directly optimize our policy \\(\pi_theta\\) to output a probability distribution over actions \\(\pi_\theta(a|s)\\) that leads to the best cumulative return.
-To do that we define an objective function \\(J(\theta)\\), that is the expected cumulative reward and we **want to find \\(\theta\\) that maximize this objective function**.
+To do that, we define an objective function \\(J(\theta)\\), that is, the expected cumulative reward, and we **want to find \\(\theta\\) that maximizes this objective function**.

 ## The difference between policy-based and policy-gradient methods

-Policy-gradient methods, what we're going to study in this unit, is a subclass of policy-based methods. In policy-based methods, the optimization is most of the time *on-policy* since for each update we only use data (trajectories) collected **by our most recent version of** \\(\pi_\theta\\).
+Policy-gradient methods, what we're going to study in this unit, is a subclass of policy-based methods. In policy-based methods, the optimization is most of the time *on-policy* since for each update, we only use data (trajectories) collected **by our most recent version of** \\(\pi_\theta\\).

 The difference between these two methods **lies on how we optimize the parameter** //(/theta//):

- In *policy-based methods*, we search directly for the optimal policy. We can optimize the parameter //(/theta//) **indirectly** by maximize the local approximation of the objective function with techniques like hill climbing, simulated annealing or evolution strategies.
+- In *policy-based methods*, we search directly for the optimal policy. We can optimize the parameter //(/theta//) **indirectly** by maximizing the local approximation of the objective function with techniques like hill climbing, simulated annealing, or evolution strategies.
 - In *policy-gradient methods*, because we're a subclass of the policy-based methods, we search directly for the optimal policy. But we optimize the parameter //(/theta//) **directly** by performing the gradient ascent on the performance of the objective function \\(J(\theta)\\).

-Before diving more into how works policy-gradient methods (the objective function, policy gradient theorem, gradient ascent etc.) let's study the advantages and disadvantages of policy-based methods.
+Before diving more into how works policy-gradient methods (the objective function, policy gradient theorem, gradient ascent, etc.), let's study the advantages and disadvantages of policy-based methods.