mirror of
https://github.com/huggingface/deep-rl-class.git
synced 2026-04-13 18:00:45 +08:00
Modifications based on Omar feedback + cleanup
This commit is contained in:
File diff suppressed because one or more lines are too long
@@ -117,9 +117,9 @@
|
||||
- local: unit4/what-are-policy-based-methods
|
||||
title: What are the policy-based methods?
|
||||
- local: unit4/advantages-disadvantages
|
||||
title: The advantages and disadvantages of Policy-based methods
|
||||
title: The advantages and disadvantages of policy-gradient methods
|
||||
- local: unit4/policy-gradient
|
||||
title: Diving deeper into Policy-gradient
|
||||
title: Diving deeper into policy-gradient
|
||||
- local: unit4/pg-theorem
|
||||
title: (Optional) the Policy Gradient Theorem
|
||||
- local: unit4/hands-on
|
||||
|
||||
@@ -21,7 +21,7 @@ You'll then be able to iterate and improve this implementation for more advanced
|
||||
|
||||
To validate this hands-on for the certification process, you need to push your trained models to the Hub.
|
||||
|
||||
- Get a result of >= 450 for `Cartpole-v1`.
|
||||
- Get a result of >= 350 for `Cartpole-v1`.
|
||||
- Get a result of >= 5 for `PixelCopter`.
|
||||
|
||||
To find your result, go to the leaderboard and find your model, **the result = mean_reward - std of reward**. **If you don't see your model on the leaderboard, go at the bottom of the leaderboard page and click on the refresh button**.
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
|
||||
In the last unit, we learned about Deep Q-Learning. In this value-based deep reinforcement learning algorithm, we **used a deep neural network to approximate the different Q-values for each possible action at a state.**
|
||||
|
||||
Indeed, since the beginning of the course, we only studied value-based methods, **where we estimate a value function as an intermediate step towards finding an optimal policy.**
|
||||
Since the beginning of the course, we only studied value-based methods, **where we estimate a value function as an intermediate step towards finding an optimal policy.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy" />
|
||||
|
||||
@@ -13,7 +13,7 @@ In value-based methods, the policy ** \\(π\\) only exists because of the actio
|
||||
But, with policy-based methods, we want to optimize the policy directly **without having an intermediate step of learning a value function.**
|
||||
|
||||
So today, **we'll learn about policy-based methods and study a subset of these methods called policy gradient**. Then we'll implement our first policy gradient algorithm called Monte Carlo **Reinforce** from scratch using PyTorch.
|
||||
Before testing its robustness using CartPole-v1, and PixelCopter environments.
|
||||
Then, we'll test its robustness using the CartPole-v1 and PixelCopter environments.
|
||||
|
||||
You'll then be able to iterate and improve this implementation for more advanced environments.
|
||||
|
||||
|
||||
@@ -9,7 +9,7 @@ Let's first recap our different formulas:
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/expected_reward.png" alt="Return"/>
|
||||
|
||||
|
||||
2. The probability of a trajectory (given that action comes from //(/pi_/theta//)):
|
||||
2. The probability of a trajectory (given that action comes from \\(\pi_\theta\\)):
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/probability.png" alt="Probability"/>
|
||||
|
||||
@@ -54,7 +54,7 @@ But we still have some mathematics work to do there: we need to simplify \\( \n
|
||||
|
||||
We know that:
|
||||
|
||||
\\(\nabla_\theta log P(\tau^{(i)};\theta)= \nabla_\theta log[ \mu(s_0) \prod_{t=0}^{H} P(s_{t+1}^{(i)}|s_{t}^{(i)}, a_{t}^{(i)}) \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})])\\
|
||||
\\(\nabla_\theta log P(\tau^{(i)};\theta)= \nabla_\theta log[ \mu(s_0) \prod_{t=0}^{H} P(s_{t+1}^{(i)}|s_{t}^{(i)}, a_{t}^{(i)}) \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})]\\)
|
||||
|
||||
Where \\(\mu(s_0)\\) is the initial state distribution and \\( P(s_{t+1}^{(i)}|s_{t}^{(i)}, a_{t}^{(i)}) \\) is the state transition dynamics of the MDP.
|
||||
|
||||
@@ -69,7 +69,7 @@ We also know that the gradient of the sum is equal to the sum of gradient:
|
||||
Since neither initial state distribution or state transition dynamics of the MDP are dependent of \\(\theta\\), the derivate of both terms are 0. So we can remove them:
|
||||
|
||||
Since:
|
||||
\\(\nabla_\theta \sum_{t=0}^{H} log P(s_{t+1}^{(i)}|s_{t}^{(i)} a_{t}^{(i)}) = 0 \\) and (\\ \nabla_\theta \mu(s_0) = 0\\)
|
||||
\\(\nabla_\theta \sum_{t=0}^{H} log P(s_{t+1}^{(i)}|s_{t}^{(i)} a_{t}^{(i)}) = 0 \\) and \\( \nabla_\theta \mu(s_0) = 0\\)
|
||||
|
||||
\\(\nabla_\theta log P(\tau^{(i)};\theta) = \nabla_\theta \sum_{t=0}^{H} log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})\\)
|
||||
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
|
||||
## Getting the big picture
|
||||
|
||||
We just learned that policy-gradient methods aim to find parameters (\\ \theta \\) that **maximize the expected return**.
|
||||
We just learned that policy-gradient methods aim to find parameters \\( \theta \\) that **maximize the expected return**.
|
||||
|
||||
The idea is that we have a *parameterized stochastic policy*. In our case, a neural network outputs a probability distribution over actions. The probability of taking each action is also called *action preference*.
|
||||
|
||||
@@ -70,13 +70,15 @@ Our objective then is to maximize the expected cumulative rewards by finding \\(
|
||||
|
||||
Policy-gradient is an optimization problem: we want to find the values of \\(\theta\\) that maximize our objective function \\(J(\theta)\\), we need to use **gradient-ascent**. It's the inverse of *gradient-descent* since it gives the direction of the steepest increase of \\(J(\theta)\\).
|
||||
|
||||
(If you need a refresher on the difference between gradient descent and gradient ascent [check this](https://www.baeldung.com/cs/gradient-descent-vs-ascent) and [this](https://stats.stackexchange.com/questions/258721/gradient-ascent-vs-gradient-descent-in-logistic-regression)).
|
||||
|
||||
Our update step for gradient-ascent is:
|
||||
|
||||
(\\ \theta \leftarrow \theta + \alpha * \nabla_\theta J(\theta) \\)
|
||||
\\( \theta \leftarrow \theta + \alpha * \nabla_\theta J(\theta) \\)
|
||||
|
||||
We can repeatedly apply this update state in the hope that \\(\theta)\\ converges to the value that maximizes \\(J(\theta)\\).
|
||||
|
||||
However, we have two problems to derivate \\(J(\theta)\\):
|
||||
However, we have two problems to obtain the derivative of \\(J(\theta)\\):
|
||||
1. We can't calculate the true gradient of the objective function since it would imply calculating the probability of each possible trajectory which is computationally super expensive.
|
||||
We want then to **calculate a gradient estimation with a sample-based estimate (collect some trajectories)**.
|
||||
|
||||
@@ -91,6 +93,7 @@ Fortunately we're going to use a solution called the Policy Gradient Theorem tha
|
||||
If you want to understand how we derivate this formula that we will use to approximate the gradient, check the next (optional) section.
|
||||
|
||||
## The Reinforce algorithm (Monte Carlo Reinforce)
|
||||
|
||||
The Reinforce algorithm, also called Monte-Carlo policy-gradient, is a policy-gradient algorithm that **uses an estimated return from an entire episode to update the policy parameter** \\(\theta\\):
|
||||
|
||||
In a loop:
|
||||
|
||||
@@ -15,20 +15,18 @@ We studied in the first unit, that we had two methods to find (most of the time
|
||||
- In *value-based methods*, we learn a value function.
|
||||
- The idea is that an optimal value function leads to an optimal policy \\(\pi^{*}\\).
|
||||
- Our objective is to **minimize the loss between the predicted and target value** to approximate the true action-value function.
|
||||
- We have a policy, but it's implicit since it **was generated directly from the Value function**. For instance, in Q-Learning, we defined an epsilon-greedy policy.
|
||||
- We have a policy, but it's implicit since it **was generated directly from the value function**. For instance, in Q-Learning, we defined an epsilon-greedy policy.
|
||||
|
||||
- On the other hand, in *policy-based methods*, we directly learn to approximate \\(\pi^{*}\\) without having to learn a value function.
|
||||
- The idea is **to parameterize the policy**. For instance, using a neural network \\(\pi_\theta\\), this policy will output a probability distribution over actions (stochastic policy).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/stochastic_policy.png" alt="stochastic policy" />
|
||||
|
||||
- <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/stochastic_policy.png" alt="stochastic policy" />
|
||||
- Our objective then is **to maximize the performance of the parameterized policy using gradient ascent**.
|
||||
- To do that, we control the parameter \\(\theta\\) that will affect the distribution of actions over a state.
|
||||
|
||||
- Finally, we'll study the next time *actor-critic* which is a combination of value-based and policy-based methods.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_based.png" alt="Policy based" />
|
||||
|
||||
- Finally, we'll study the next time *actor-critic* which is a combination of value-based and policy-based methods.
|
||||
|
||||
Consequently, thanks to policy-based methods, we can directly optimize our policy \\(\pi_\theta\\) to output a probability distribution over actions \\(\pi_\theta(a|s)\\) that leads to the best cumulative return.
|
||||
To do that, we define an objective function \\(J(\theta)\\), that is, the expected cumulative reward, and we **want to find \\(\theta\\) that maximizes this objective function**.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user