Modifications based on Omar feedback + cleanup

This commit is contained in:
simoninithomas
2023-01-04 08:48:30 +01:00
parent 1c93606aec
commit 5dbb460d90
7 changed files with 18 additions and 1607 deletions

File diff suppressed because one or more lines are too long

View File

@@ -117,9 +117,9 @@
- local: unit4/what-are-policy-based-methods
title: What are the policy-based methods?
- local: unit4/advantages-disadvantages
title: The advantages and disadvantages of Policy-based methods
title: The advantages and disadvantages of policy-gradient methods
- local: unit4/policy-gradient
title: Diving deeper into Policy-gradient
title: Diving deeper into policy-gradient
- local: unit4/pg-theorem
title: (Optional) the Policy Gradient Theorem
- local: unit4/hands-on

View File

@@ -21,7 +21,7 @@ You'll then be able to iterate and improve this implementation for more advanced
To validate this hands-on for the certification process, you need to push your trained models to the Hub.
- Get a result of >= 450 for `Cartpole-v1`.
- Get a result of >= 350 for `Cartpole-v1`.
- Get a result of >= 5 for `PixelCopter`.
To find your result, go to the leaderboard and find your model, **the result = mean_reward - std of reward**. **If you don't see your model on the leaderboard, go at the bottom of the leaderboard page and click on the refresh button**.

View File

@@ -4,7 +4,7 @@
In the last unit, we learned about Deep Q-Learning. In this value-based deep reinforcement learning algorithm, we **used a deep neural network to approximate the different Q-values for each possible action at a state.**
Indeed, since the beginning of the course, we only studied value-based methods, **where we estimate a value function as an intermediate step towards finding an optimal policy.**
Since the beginning of the course, we only studied value-based methods, **where we estimate a value function as an intermediate step towards finding an optimal policy.**
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy" />
@@ -13,7 +13,7 @@ In value-based methods, the policy ** \\(π\\) only exists because of the actio
But, with policy-based methods, we want to optimize the policy directly **without having an intermediate step of learning a value function.**
So today, **we'll learn about policy-based methods and study a subset of these methods called policy gradient**. Then we'll implement our first policy gradient algorithm called Monte Carlo **Reinforce** from scratch using PyTorch.
Before testing its robustness using CartPole-v1, and PixelCopter environments.
Then, we'll test its robustness using the CartPole-v1 and PixelCopter environments.
You'll then be able to iterate and improve this implementation for more advanced environments.

View File

@@ -9,7 +9,7 @@ Let's first recap our different formulas:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/expected_reward.png" alt="Return"/>
2. The probability of a trajectory (given that action comes from //(/pi_/theta//)):
2. The probability of a trajectory (given that action comes from \\(\pi_\theta\\)):
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/probability.png" alt="Probability"/>
@@ -54,7 +54,7 @@ But we still have some mathematics work to do there: we need to simplify \\( \n
We know that:
\\(\nabla_\theta log P(\tau^{(i)};\theta)= \nabla_\theta log[ \mu(s_0) \prod_{t=0}^{H} P(s_{t+1}^{(i)}|s_{t}^{(i)}, a_{t}^{(i)}) \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})])\\
\\(\nabla_\theta log P(\tau^{(i)};\theta)= \nabla_\theta log[ \mu(s_0) \prod_{t=0}^{H} P(s_{t+1}^{(i)}|s_{t}^{(i)}, a_{t}^{(i)}) \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})]\\)
Where \\(\mu(s_0)\\) is the initial state distribution and \\( P(s_{t+1}^{(i)}|s_{t}^{(i)}, a_{t}^{(i)}) \\) is the state transition dynamics of the MDP.
@@ -69,7 +69,7 @@ We also know that the gradient of the sum is equal to the sum of gradient:
Since neither initial state distribution or state transition dynamics of the MDP are dependent of \\(\theta\\), the derivate of both terms are 0. So we can remove them:
Since:
\\(\nabla_\theta \sum_{t=0}^{H} log P(s_{t+1}^{(i)}|s_{t}^{(i)} a_{t}^{(i)}) = 0 \\) and (\\ \nabla_\theta \mu(s_0) = 0\\)
\\(\nabla_\theta \sum_{t=0}^{H} log P(s_{t+1}^{(i)}|s_{t}^{(i)} a_{t}^{(i)}) = 0 \\) and \\( \nabla_\theta \mu(s_0) = 0\\)
\\(\nabla_\theta log P(\tau^{(i)};\theta) = \nabla_\theta \sum_{t=0}^{H} log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})\\)

View File

@@ -2,7 +2,7 @@
## Getting the big picture
We just learned that policy-gradient methods aim to find parameters (\\ \theta \\) that **maximize the expected return**.
We just learned that policy-gradient methods aim to find parameters \\( \theta \\) that **maximize the expected return**.
The idea is that we have a *parameterized stochastic policy*. In our case, a neural network outputs a probability distribution over actions. The probability of taking each action is also called *action preference*.
@@ -70,13 +70,15 @@ Our objective then is to maximize the expected cumulative rewards by finding \\(
Policy-gradient is an optimization problem: we want to find the values of \\(\theta\\) that maximize our objective function \\(J(\theta)\\), we need to use **gradient-ascent**. It's the inverse of *gradient-descent* since it gives the direction of the steepest increase of \\(J(\theta)\\).
(If you need a refresher on the difference between gradient descent and gradient ascent [check this](https://www.baeldung.com/cs/gradient-descent-vs-ascent) and [this](https://stats.stackexchange.com/questions/258721/gradient-ascent-vs-gradient-descent-in-logistic-regression)).
Our update step for gradient-ascent is:
(\\ \theta \leftarrow \theta + \alpha * \nabla_\theta J(\theta) \\)
\\( \theta \leftarrow \theta + \alpha * \nabla_\theta J(\theta) \\)
We can repeatedly apply this update state in the hope that \\(\theta)\\ converges to the value that maximizes \\(J(\theta)\\).
However, we have two problems to derivate \\(J(\theta)\\):
However, we have two problems to obtain the derivative of \\(J(\theta)\\):
1. We can't calculate the true gradient of the objective function since it would imply calculating the probability of each possible trajectory which is computationally super expensive.
We want then to **calculate a gradient estimation with a sample-based estimate (collect some trajectories)**.
@@ -91,6 +93,7 @@ Fortunately we're going to use a solution called the Policy Gradient Theorem tha
If you want to understand how we derivate this formula that we will use to approximate the gradient, check the next (optional) section.
## The Reinforce algorithm (Monte Carlo Reinforce)
The Reinforce algorithm, also called Monte-Carlo policy-gradient, is a policy-gradient algorithm that **uses an estimated return from an entire episode to update the policy parameter** \\(\theta\\):
In a loop:

View File

@@ -15,20 +15,18 @@ We studied in the first unit, that we had two methods to find (most of the time
- In *value-based methods*, we learn a value function.
- The idea is that an optimal value function leads to an optimal policy \\(\pi^{*}\\).
- Our objective is to **minimize the loss between the predicted and target value** to approximate the true action-value function.
- We have a policy, but it's implicit since it **was generated directly from the Value function**. For instance, in Q-Learning, we defined an epsilon-greedy policy.
- We have a policy, but it's implicit since it **was generated directly from the value function**. For instance, in Q-Learning, we defined an epsilon-greedy policy.
- On the other hand, in *policy-based methods*, we directly learn to approximate \\(\pi^{*}\\) without having to learn a value function.
- The idea is **to parameterize the policy**. For instance, using a neural network \\(\pi_\theta\\), this policy will output a probability distribution over actions (stochastic policy).
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/stochastic_policy.png" alt="stochastic policy" />
- <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/stochastic_policy.png" alt="stochastic policy" />
- Our objective then is **to maximize the performance of the parameterized policy using gradient ascent**.
- To do that, we control the parameter \\(\theta\\) that will affect the distribution of actions over a state.
- Finally, we'll study the next time *actor-critic* which is a combination of value-based and policy-based methods.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_based.png" alt="Policy based" />
- Finally, we'll study the next time *actor-critic* which is a combination of value-based and policy-based methods.
Consequently, thanks to policy-based methods, we can directly optimize our policy \\(\pi_\theta\\) to output a probability distribution over actions \\(\pi_\theta(a|s)\\) that leads to the best cumulative return.
To do that, we define an objective function \\(J(\theta)\\), that is, the expected cumulative reward, and we **want to find \\(\theta\\) that maximizes this objective function**.