From 1c93606aece5611fce8b3ff71a30e59b7b9133bb Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Wed, 4 Jan 2023 08:22:31 +0100 Subject: [PATCH] Apply suggestions from code review Co-authored-by: Omar Sanseviero --- units/en/unit4/advantages-disadvantages.mdx | 8 ++++---- units/en/unit4/introduction.mdx | 4 ++-- units/en/unit4/policy-gradient.mdx | 6 +++--- units/en/unit4/what-are-policy-based-methods.mdx | 6 +++--- 4 files changed, 12 insertions(+), 12 deletions(-) diff --git a/units/en/unit4/advantages-disadvantages.mdx b/units/en/unit4/advantages-disadvantages.mdx index 3c60933..3b483a9 100644 --- a/units/en/unit4/advantages-disadvantages.mdx +++ b/units/en/unit4/advantages-disadvantages.mdx @@ -8,7 +8,7 @@ There are multiple advantages over value-based methods. Let's see some of them: ### The simplicity of integration -Indeed, **we can estimate the policy directly without storing additional data (action values).** +We can estimate the policy directly without storing additional data (action values). ### Policy-gradient methods can learn a stochastic policy @@ -46,7 +46,7 @@ On the other hand, an optimal stochastic policy **will randomly move left or rig ### Policy-gradient methods are more effective in high-dimensional action spaces and continuous actions spaces -Indeed, the problem with Deep Q-learning is that their **predictions assign a score (maximum expected future reward) for each possible action**, at each time step, given the current state. +The problem with Deep Q-learning is that their **predictions assign a score (maximum expected future reward) for each possible action**, at each time step, given the current state. But what if we have an infinite possibility of actions? @@ -67,8 +67,8 @@ On the other hand, in policy-gradient methods, stochastic policy action preferen Naturally, policy-gradient methods also have some disadvantages: -- **Policy-gradient converges a lot of time on a local maximum instead of a global optimum.** -- Policy-gradient goes less fast, **step by step: it can take longer to train (inefficient).** +- **Frequently, policy-gradient converges on a local maximum instead of a global optimum.** +- Policy-gradient goes slower, **step by step: it can take longer to train (inefficient).** - Policy-gradient can have high variance. We'll see in actor-critic unit why and how we can solve this problem. 👉 If you want to go deeper into the advantages and disadvantages of policy-gradient methods, [you can check this video](https://youtu.be/y3oqOjHilio). diff --git a/units/en/unit4/introduction.mdx b/units/en/unit4/introduction.mdx index f45a6b8..4d6c8cd 100644 --- a/units/en/unit4/introduction.mdx +++ b/units/en/unit4/introduction.mdx @@ -8,11 +8,11 @@ Indeed, since the beginning of the course, we only studied value-based methods, Link value policy -Because, in value-based, policy ** \\(π\\) exists only because of the action value estimates, since the policy is just a function** (for instance, greedy-policy) that will select the action with the highest value given a state. +In value-based methods, the policy ** \\(π\\) only exists because of the action value estimates since the policy is just a function** (for instance, greedy-policy) that will select the action with the highest value given a state. But, with policy-based methods, we want to optimize the policy directly **without having an intermediate step of learning a value function.** -So today, **we'll learn about policy-based methods, and we'll study a subset of these methods called policy gradient**. Then we'll implement our first policy gradient algorithm called Monte Carlo **Reinforce** from scratch using PyTorch. +So today, **we'll learn about policy-based methods and study a subset of these methods called policy gradient**. Then we'll implement our first policy gradient algorithm called Monte Carlo **Reinforce** from scratch using PyTorch. Before testing its robustness using CartPole-v1, and PixelCopter environments. You'll then be able to iterate and improve this implementation for more advanced environments. diff --git a/units/en/unit4/policy-gradient.mdx b/units/en/unit4/policy-gradient.mdx index 0ca624f..549d212 100644 --- a/units/en/unit4/policy-gradient.mdx +++ b/units/en/unit4/policy-gradient.mdx @@ -15,7 +15,7 @@ If we take the example of CartPole-v1: Our goal with policy-gradient is to **control the probability distribution of actions** by tuning the policy such that **good actions (that maximize the return) are sampled more frequently in the future.** Each time the agent interacts with the environment, we tweak the parameters such that good actions will be sampled more likely in the future. -But **how we're going to optimize the weights using the expected return**? +But **how are we going to optimize the weights using the expected return**? The idea is that we're going to **let the agent interact during an episode**. And if we win the episode, we consider that each action taken was good and must be more sampled in the future since they lead to win. @@ -80,7 +80,7 @@ However, we have two problems to derivate \\(J(\theta)\\): 1. We can't calculate the true gradient of the objective function since it would imply calculating the probability of each possible trajectory which is computationally super expensive. We want then to **calculate a gradient estimation with a sample-based estimate (collect some trajectories)**. -2. We have another problem that I detail in the optional next section. That is, to differentiate this objective function, we need to differentiate the state distribution, also called Markov Decision Process dynamics. This is attached to the environment. It gives us the probability of the environment going into the next state, given the current state and the action taken by the agent. The problem is that we can't differentiate it because we might not know about it. +2. We have another problem that I detail in the next optional section. To differentiate this objective function, we need to differentiate the state distribution, called Markov Decision Process dynamics. This is attached to the environment. It gives us the probability of the environment going into the next state, given the current state and the action taken by the agent. The problem is that we can't differentiate it because we might not know about it. Probability @@ -91,7 +91,7 @@ Fortunately we're going to use a solution called the Policy Gradient Theorem tha If you want to understand how we derivate this formula that we will use to approximate the gradient, check the next (optional) section. ## The Reinforce algorithm (Monte Carlo Reinforce) -The Reinforce algorithm also called Monte-Carlo policy-gradient is a policy-gradient algorithm that **uses an estimated return from an entire episode to update the policy parameter** \\(\theta\\): +The Reinforce algorithm, also called Monte-Carlo policy-gradient, is a policy-gradient algorithm that **uses an estimated return from an entire episode to update the policy parameter** \\(\theta\\): In a loop: - Use the policy \\(\pi_\theta\\) to collect an episode \\(\tau\\) diff --git a/units/en/unit4/what-are-policy-based-methods.mdx b/units/en/unit4/what-are-policy-based-methods.mdx index 2ebed11..1d30027 100644 --- a/units/en/unit4/what-are-policy-based-methods.mdx +++ b/units/en/unit4/what-are-policy-based-methods.mdx @@ -12,13 +12,13 @@ For instance, in a soccer game (where you're going to train the agents in two un We studied in the first unit, that we had two methods to find (most of the time approximate) this optimal policy \\(\pi*\\). -- In *value-Based methods*, we learn a value function. - - The idea then is that an optimal value function leads to an optimal policy \\(\pi^{*}\\). +- In *value-based methods*, we learn a value function. + - The idea is that an optimal value function leads to an optimal policy \\(\pi^{*}\\). - Our objective is to **minimize the loss between the predicted and target value** to approximate the true action-value function. - We have a policy, but it's implicit since it **was generated directly from the Value function**. For instance, in Q-Learning, we defined an epsilon-greedy policy. - On the other hand, in *policy-based methods*, we directly learn to approximate \\(\pi^{*}\\) without having to learn a value function. - - The idea then is **to parameterize the policy**. For instance, using a neural network \\(\pi_\theta\\), this policy will output a probability distribution over actions (stochastic policy). + - The idea is **to parameterize the policy**. For instance, using a neural network \\(\pi_\theta\\), this policy will output a probability distribution over actions (stochastic policy). stochastic policy