Modifications based on Omar feedback + cleanup

2026-06-15 06:27:24 +08:00 · 2023-01-04 08:48:30 +01:00
parent 1c93606aec
commit 5dbb460d90
7 changed files with 18 additions and 1607 deletions
--- a/notebooks/unit4.ipynb
+++ b/notebooks/unit4.ipynb
--- a/units/en/_toctree.yml
+++ b/units/en/_toctree.yml
@@ -117,9 +117,9 @@
  - local: unit4/what-are-policy-based-methods
    title: What are the policy-based methods?
  - local: unit4/advantages-disadvantages
-    title: The advantages and disadvantages of Policy-based methods
+    title: The advantages and disadvantages of policy-gradient methods
  - local: unit4/policy-gradient
-    title: Diving deeper into Policy-gradient
+    title: Diving deeper into policy-gradient
  - local: unit4/pg-theorem
    title: (Optional) the Policy Gradient Theorem
  - local: unit4/hands-on
--- a/units/en/unit4/hands-on.mdx
+++ b/units/en/unit4/hands-on.mdx
@@ -21,7 +21,7 @@ You'll then be able to iterate and improve this implementation for more advanced

 To validate this hands-on for the certification process, you need to push your trained models to the Hub.

- Get a result of >= 450 for `Cartpole-v1`.
+- Get a result of >= 350 for `Cartpole-v1`.
 - Get a result of >= 5 for `PixelCopter`.

 To find your result, go to the leaderboard and find your model, **the result = mean_reward - std of reward**. **If you don't see your model on the leaderboard, go at the bottom of the leaderboard page and click on the refresh button**.
--- a/units/en/unit4/introduction.mdx
+++ b/units/en/unit4/introduction.mdx
@@ -4,7 +4,7 @@

 In the last unit, we learned about Deep Q-Learning. In this value-based deep reinforcement learning algorithm, we **used a deep neural network to approximate the different Q-values for each possible action at a state.**

-Indeed, since the beginning of the course, we only studied value-based methods, **where we estimate a value function as an intermediate step towards finding an optimal policy.**
+Since the beginning of the course, we only studied value-based methods, **where we estimate a value function as an intermediate step towards finding an optimal policy.**

 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy" />

@@ -13,7 +13,7 @@ In value-based methods, the policy ** \\(π\\) only exists because of the actio
 But, with policy-based methods, we want to optimize the policy directly **without having an intermediate step of learning a value function.**

 So today, **we'll learn about policy-based methods and study a subset of these methods called policy gradient**. Then we'll implement our first policy gradient algorithm called Monte Carlo **Reinforce** from scratch using PyTorch.
-Before testing its robustness using CartPole-v1, and PixelCopter environments.
+Then, we'll test its robustness using the CartPole-v1 and PixelCopter environments.

 You'll then be able to iterate and improve this implementation for more advanced environments.

--- a/units/en/unit4/pg-theorem.mdx
+++ b/units/en/unit4/pg-theorem.mdx
@@ -9,7 +9,7 @@ Let's first recap our different formulas:
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/expected_reward.png" alt="Return"/>


-2. The probability of a trajectory (given that action comes from //(/pi_/theta//)):
+2. The probability of a trajectory (given that action comes from \\(\pi_\theta\\)):

 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/probability.png" alt="Probability"/>

@@ -54,7 +54,7 @@ But we still have some mathematics work to do there: we need to simplify \\(  \n

 We know that:

-\\(\nabla_\theta log P(\tau^{(i)};\theta)= \nabla_\theta log[ \mu(s_0) \prod_{t=0}^{H} P(s_{t+1}^{(i)}|s_{t}^{(i)}, a_{t}^{(i)}) \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})])\\
+\\(\nabla_\theta log P(\tau^{(i)};\theta)= \nabla_\theta log[ \mu(s_0) \prod_{t=0}^{H} P(s_{t+1}^{(i)}|s_{t}^{(i)}, a_{t}^{(i)}) \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})]\\)

 Where \\(\mu(s_0)\\) is the initial state distribution and \\( P(s_{t+1}^{(i)}|s_{t}^{(i)}, a_{t}^{(i)})  \\) is the state transition dynamics of the MDP.

@@ -69,7 +69,7 @@ We also know that the gradient of the sum is equal to the sum of gradient:
 Since neither initial state distribution or state transition dynamics of the MDP are dependent of \\(\theta\\), the derivate of both terms are 0. So we can remove them:

 Since:
-\\(\nabla_\theta \sum_{t=0}^{H} log P(s_{t+1}^{(i)}|s_{t}^{(i)}  a_{t}^{(i)}) = 0 \\) and (\\ \nabla_\theta \mu(s_0) = 0\\)
+\\(\nabla_\theta \sum_{t=0}^{H} log P(s_{t+1}^{(i)}|s_{t}^{(i)}  a_{t}^{(i)}) = 0 \\) and \\( \nabla_\theta \mu(s_0) = 0\\)

 \\(\nabla_\theta log P(\tau^{(i)};\theta) =   \nabla_\theta \sum_{t=0}^{H} log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})\\)

--- a/units/en/unit4/policy-gradient.mdx
+++ b/units/en/unit4/policy-gradient.mdx
@@ -2,7 +2,7 @@

 ## Getting the big picture

-We just learned that policy-gradient methods aim to find parameters (\\ \theta \\) that **maximize the expected return**.
+We just learned that policy-gradient methods aim to find parameters \\( \theta \\) that **maximize the expected return**.

 The idea is that we have a *parameterized stochastic policy*. In our case, a neural network outputs a probability distribution over actions. The probability of taking each action is also called *action preference*.

@@ -70,13 +70,15 @@ Our objective then is to maximize the expected cumulative rewards by finding \\(

 Policy-gradient is an optimization problem: we want to find the values of \\(\theta\\) that maximize our objective function \\(J(\theta)\\), we need to use **gradient-ascent**. It's the inverse of *gradient-descent* since it gives the direction of the steepest increase of \\(J(\theta)\\).

+(If you need a refresher on the difference between gradient descent and gradient ascent [check this](https://www.baeldung.com/cs/gradient-descent-vs-ascent) and [this](https://stats.stackexchange.com/questions/258721/gradient-ascent-vs-gradient-descent-in-logistic-regression)).
+
 Our update step for gradient-ascent is:

-(\\ \theta \leftarrow \theta + \alpha *  \nabla_\theta J(\theta) \\)
+\\( \theta \leftarrow \theta + \alpha *  \nabla_\theta J(\theta) \\)

 We can repeatedly apply this update state in the hope that \\(\theta)\\ converges to the value that maximizes \\(J(\theta)\\).

-However, we have two problems to derivate \\(J(\theta)\\):
+However, we have two problems to obtain the derivative of \\(J(\theta)\\):
 1. We can't calculate the true gradient of the objective function since it would imply calculating the probability of each possible trajectory which is computationally super expensive.
 We want then to **calculate a gradient estimation with a sample-based estimate (collect some trajectories)**.

@@ -91,6 +93,7 @@ Fortunately we're going to use a solution called the Policy Gradient Theorem tha
 If you want to understand how we derivate this formula that we will use to approximate the gradient, check the next (optional) section.

 ## The Reinforce algorithm (Monte Carlo Reinforce)
+
 The Reinforce algorithm, also called Monte-Carlo policy-gradient, is a policy-gradient algorithm that **uses an estimated return from an entire episode to update the policy parameter**  \\(\theta\\):

 In a loop:
--- a/units/en/unit4/what-are-policy-based-methods.mdx
+++ b/units/en/unit4/what-are-policy-based-methods.mdx
@@ -15,20 +15,18 @@ We studied in the first unit, that we had two methods to find (most of the time
 - In *value-based methods*, we learn a value function.
  - The idea is that an optimal value function leads to an optimal policy \\(\pi^{*}\\).
  - Our objective is to **minimize the loss between the predicted and target value** to approximate the true action-value function.
-  - We have a policy, but it's implicit since it **was generated directly from the Value function**. For instance, in Q-Learning, we defined an epsilon-greedy policy.
+  - We have a policy, but it's implicit since it **was generated directly from the value function**. For instance, in Q-Learning, we defined an epsilon-greedy policy.

 - On the other hand, in *policy-based methods*, we directly learn to approximate \\(\pi^{*}\\) without having to learn a value function.
  - The idea is **to parameterize the policy**. For instance, using a neural network \\(\pi_\theta\\), this policy will output a probability distribution over actions (stochastic policy).
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/stochastic_policy.png" alt="stochastic policy" />
-
+  - <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/stochastic_policy.png" alt="stochastic policy" />
  - Our objective then is **to maximize the performance of the parameterized policy using gradient ascent**.
  - To do that, we control the parameter \\(\theta\\) that will affect the distribution of actions over a state.

- Finally, we'll study the next time *actor-critic* which is a combination of value-based and policy-based methods.
-
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_based.png" alt="Policy based" />

+- Finally, we'll study the next time *actor-critic* which is a combination of value-based and policy-based methods.
+
 Consequently, thanks to policy-based methods, we can directly optimize our policy \\(\pi_\theta\\) to output a probability distribution over actions \\(\pi_\theta(a|s)\\) that leads to the best cumulative return.
 To do that, we define an objective function \\(J(\theta)\\), that is, the expected cumulative reward, and we **want to find \\(\theta\\) that maximizes this objective function**.