Apply suggestions from code review

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
This commit is contained in:
Thomas Simonini
2023-01-04 08:22:31 +01:00
committed by GitHub
parent 4cf68f25b3
commit 1c93606aec
4 changed files with 12 additions and 12 deletions

View File

@@ -8,7 +8,7 @@ There are multiple advantages over value-based methods. Let's see some of them:
### The simplicity of integration
Indeed, **we can estimate the policy directly without storing additional data (action values).**
We can estimate the policy directly without storing additional data (action values).
### Policy-gradient methods can learn a stochastic policy
@@ -46,7 +46,7 @@ On the other hand, an optimal stochastic policy **will randomly move left or rig
### Policy-gradient methods are more effective in high-dimensional action spaces and continuous actions spaces
Indeed, the problem with Deep Q-learning is that their **predictions assign a score (maximum expected future reward) for each possible action**, at each time step, given the current state.
The problem with Deep Q-learning is that their **predictions assign a score (maximum expected future reward) for each possible action**, at each time step, given the current state.
But what if we have an infinite possibility of actions?
@@ -67,8 +67,8 @@ On the other hand, in policy-gradient methods, stochastic policy action preferen
Naturally, policy-gradient methods also have some disadvantages:
- **Policy-gradient converges a lot of time on a local maximum instead of a global optimum.**
- Policy-gradient goes less fast, **step by step: it can take longer to train (inefficient).**
- **Frequently, policy-gradient converges on a local maximum instead of a global optimum.**
- Policy-gradient goes slower, **step by step: it can take longer to train (inefficient).**
- Policy-gradient can have high variance. We'll see in actor-critic unit why and how we can solve this problem.
👉 If you want to go deeper into the advantages and disadvantages of policy-gradient methods, [you can check this video](https://youtu.be/y3oqOjHilio).

View File

@@ -8,11 +8,11 @@ Indeed, since the beginning of the course, we only studied value-based methods,
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy" />
Because, in value-based, policy ** \\(π\\) exists only because of the action value estimates, since the policy is just a function** (for instance, greedy-policy) that will select the action with the highest value given a state.
In value-based methods, the policy ** \\(π\\) only exists because of the action value estimates since the policy is just a function** (for instance, greedy-policy) that will select the action with the highest value given a state.
But, with policy-based methods, we want to optimize the policy directly **without having an intermediate step of learning a value function.**
So today, **we'll learn about policy-based methods, and we'll study a subset of these methods called policy gradient**. Then we'll implement our first policy gradient algorithm called Monte Carlo **Reinforce** from scratch using PyTorch.
So today, **we'll learn about policy-based methods and study a subset of these methods called policy gradient**. Then we'll implement our first policy gradient algorithm called Monte Carlo **Reinforce** from scratch using PyTorch.
Before testing its robustness using CartPole-v1, and PixelCopter environments.
You'll then be able to iterate and improve this implementation for more advanced environments.

View File

@@ -15,7 +15,7 @@ If we take the example of CartPole-v1:
Our goal with policy-gradient is to **control the probability distribution of actions** by tuning the policy such that **good actions (that maximize the return) are sampled more frequently in the future.**
Each time the agent interacts with the environment, we tweak the parameters such that good actions will be sampled more likely in the future.
But **how we're going to optimize the weights using the expected return**?
But **how are we going to optimize the weights using the expected return**?
The idea is that we're going to **let the agent interact during an episode**. And if we win the episode, we consider that each action taken was good and must be more sampled in the future
since they lead to win.
@@ -80,7 +80,7 @@ However, we have two problems to derivate \\(J(\theta)\\):
1. We can't calculate the true gradient of the objective function since it would imply calculating the probability of each possible trajectory which is computationally super expensive.
We want then to **calculate a gradient estimation with a sample-based estimate (collect some trajectories)**.
2. We have another problem that I detail in the optional next section. That is, to differentiate this objective function, we need to differentiate the state distribution, also called Markov Decision Process dynamics. This is attached to the environment. It gives us the probability of the environment going into the next state, given the current state and the action taken by the agent. The problem is that we can't differentiate it because we might not know about it.
2. We have another problem that I detail in the next optional section. To differentiate this objective function, we need to differentiate the state distribution, called Markov Decision Process dynamics. This is attached to the environment. It gives us the probability of the environment going into the next state, given the current state and the action taken by the agent. The problem is that we can't differentiate it because we might not know about it.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/probability.png" alt="Probability"/>
@@ -91,7 +91,7 @@ Fortunately we're going to use a solution called the Policy Gradient Theorem tha
If you want to understand how we derivate this formula that we will use to approximate the gradient, check the next (optional) section.
## The Reinforce algorithm (Monte Carlo Reinforce)
The Reinforce algorithm also called Monte-Carlo policy-gradient is a policy-gradient algorithm that **uses an estimated return from an entire episode to update the policy parameter** \\(\theta\\):
The Reinforce algorithm, also called Monte-Carlo policy-gradient, is a policy-gradient algorithm that **uses an estimated return from an entire episode to update the policy parameter** \\(\theta\\):
In a loop:
- Use the policy \\(\pi_\theta\\) to collect an episode \\(\tau\\)

View File

@@ -12,13 +12,13 @@ For instance, in a soccer game (where you're going to train the agents in two un
We studied in the first unit, that we had two methods to find (most of the time approximate) this optimal policy \\(\pi*\\).
- In *value-Based methods*, we learn a value function.
- The idea then is that an optimal value function leads to an optimal policy \\(\pi^{*}\\).
- In *value-based methods*, we learn a value function.
- The idea is that an optimal value function leads to an optimal policy \\(\pi^{*}\\).
- Our objective is to **minimize the loss between the predicted and target value** to approximate the true action-value function.
- We have a policy, but it's implicit since it **was generated directly from the Value function**. For instance, in Q-Learning, we defined an epsilon-greedy policy.
- On the other hand, in *policy-based methods*, we directly learn to approximate \\(\pi^{*}\\) without having to learn a value function.
- The idea then is **to parameterize the policy**. For instance, using a neural network \\(\pi_\theta\\), this policy will output a probability distribution over actions (stochastic policy).
- The idea is **to parameterize the policy**. For instance, using a neural network \\(\pi_\theta\\), this policy will output a probability distribution over actions (stochastic policy).
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/stochastic_policy.png" alt="stochastic policy" />