From 88fded6cf35c37d2bb3dd723ec89ff8fb072877d Mon Sep 17 00:00:00 2001 From: simoninithomas Date: Mon, 2 Jan 2023 22:05:36 +0100 Subject: [PATCH] Update intro and what are policy based mtd --- units/en/_toctree.yml | 4 +-- units/en/unit4/introduction.mdx | 7 ++-- .../unit4/what-are-policy-based-methods.mdx | 36 +++++++++---------- 3 files changed, 24 insertions(+), 23 deletions(-) diff --git a/units/en/_toctree.yml b/units/en/_toctree.yml index 35bfa4e..87e5fa0 100644 --- a/units/en/_toctree.yml +++ b/units/en/_toctree.yml @@ -110,12 +110,12 @@ title: Optuna - local: unitbonus2/hands-on title: Hands-on -- title: Unit 4. Policy Gradient with Robotics +- title: Unit 4. Policy Gradient with PyTorch sections: - local: unit4/introduction title: Introduction - local: unit4/what-are-policy-based-methods - title: What are the Policy Based methods? + title: What are the policy-based methods? - local: unit4/advantages-disadvantages title: The advantages and disadvantages of Policy-based methods - local: unit4/policy-gradient diff --git a/units/en/unit4/introduction.mdx b/units/en/unit4/introduction.mdx index aa96750..f45a6b8 100644 --- a/units/en/unit4/introduction.mdx +++ b/units/en/unit4/introduction.mdx @@ -8,13 +8,14 @@ Indeed, since the beginning of the course, we only studied value-based methods, Link value policy -Because, in value-based, ** \\(π\\) exists only because of the action value estimates, since policy is just a function** (for instance, greedy-policy) that will select the action with the highest value given a state. +Because, in value-based, policy ** \\(π\\) exists only because of the action value estimates, since the policy is just a function** (for instance, greedy-policy) that will select the action with the highest value given a state. But, with policy-based methods, we want to optimize the policy directly **without having an intermediate step of learning a value function.** -So today, **we'll learn about policy-based methods, and we'll study a subset of these methods called Policy Gradients**. Then we'll implement our first policy gradient algorithm called Monte Carlo **Reinforce** from scratch using PyTorch. Before testing its robustness using CartPole-v1, and PixelCopter. +So today, **we'll learn about policy-based methods, and we'll study a subset of these methods called policy gradient**. Then we'll implement our first policy gradient algorithm called Monte Carlo **Reinforce** from scratch using PyTorch. +Before testing its robustness using CartPole-v1, and PixelCopter environments. -You'll be then able to iterate and improve this implementation for more advanced environments. +You'll then be able to iterate and improve this implementation for more advanced environments.
Environments diff --git a/units/en/unit4/what-are-policy-based-methods.mdx b/units/en/unit4/what-are-policy-based-methods.mdx index b86e647..5f0fa4e 100644 --- a/units/en/unit4/what-are-policy-based-methods.mdx +++ b/units/en/unit4/what-are-policy-based-methods.mdx @@ -1,44 +1,44 @@ # What are the policy-based methods? -The main goal of Reinforcement learning is to **find the optimal policy \\(\pi*\\) that will maximize the expected cumulative reward**. -Because Reinforcement Learning is based on the *reward hypothesis* that is **all goals can be described as the maximization of the expected cumulative reward.** +The main goal of Reinforcement learning is to **find the optimal policy \\(\pi^{*}\\) that will maximize the expected cumulative reward**. +Because Reinforcement Learning is based on the *reward hypothesis*: **all goals can be described as the maximization of the expected cumulative reward.** -For instance, in a soccer game (that you're going to train the agents in two units), the goal is to win the game. We can describe this goal in reinforcement learning as -maximizing the number of goal scored (when the ball cross the goal line) into your opponent soccer goals. And minimize the number of goals into yours soccer goals. +For instance, in a soccer game (where you're going to train the agents in two units), the goal is to win the game. We can describe this goal in reinforcement learning as +**maximizing the number of goals scored** (when the ball crosses the goal line) into your opponent's soccer goals. And **minimizing the number of goals in your soccer goals**. Soccer -## Value-based, Policy-based and Actor-critic methods +## Value-based, Policy-based, and Actor-critic methods -We studied in the Unit 1, that we had two methods to find (most of the time approximate) this optimal policy \\(\pi*\\). +We studied in the first unit, that we had two methods to find (most of the time approximate) this optimal policy \\(\pi*\\). - In *value-Based methods*, we learn a value function. - - The idea then is that an optimal value function leads to an optimal policy \\(\pi*\\). - - Our objective is to **minimize the loss between, predicted and target value to match the true action-value function. - - We have a policy, but it's implicit since it **was generated directly from the Value function**. For instance, in Q-Learning we defined an epsilon-greedy policy. + - The idea then is that an optimal value function leads to an optimal policy \\(\pi^{*}\\). + - Our objective is to **minimize the loss between the predicted and target value to approximate the true action-value function. + - We have a policy, but it's implicit since it **was generated directly from the Value function**. For instance, in Q-Learning, we defined an epsilon-greedy policy. -- On the other hand, in *policy-based methods*, we directly learn to approximate \\(\pi*\\) without having to learn a value function. - - The idea then is to parameterize policy, for instance using a neural network \\(\pi_\theta\\), this policy will output a probability distribution over actions (stochastic policy). +- On the other hand, in *policy-based methods*, we directly learn to approximate \\(\pi^{*}\\) without having to learn a value function. + - The idea then is **to parameterize the policy**. For instance, using a neural network \\(\pi_\theta\\), this policy will output a probability distribution over actions (stochastic policy). stochastic policy - - Our objective then is *to maximize the performance of the parameterized policy using gradient ascent*. - - To do that we control the parameter \\(\theta\\) that will affect the distribution of actions over a state. + - Our objective then is **to maximize the performance of the parameterized policy using gradient ascent**. + - To do that, we control the parameter \\(\theta\\) that will affect the distribution of actions over a state. -- Finally, we'll study next time *actor-critic* which is a combination of value-based and policy-based methods. +- Finally, we'll study the next time *actor-critic* which is a combination of value-based and policy-based methods. Policy based Consequently, thanks to policy-based methods, we can directly optimize our policy \\(\pi_theta\\) to output a probability distribution over actions \\(\pi_\theta(a|s)\\) that leads to the best cumulative return. -To do that we define an objective function \\(J(\theta)\\), that is the expected cumulative reward and we **want to find \\(\theta\\) that maximize this objective function**. +To do that, we define an objective function \\(J(\theta)\\), that is, the expected cumulative reward, and we **want to find \\(\theta\\) that maximizes this objective function**. ## The difference between policy-based and policy-gradient methods -Policy-gradient methods, what we're going to study in this unit, is a subclass of policy-based methods. In policy-based methods, the optimization is most of the time *on-policy* since for each update we only use data (trajectories) collected **by our most recent version of** \\(\pi_\theta\\). +Policy-gradient methods, what we're going to study in this unit, is a subclass of policy-based methods. In policy-based methods, the optimization is most of the time *on-policy* since for each update, we only use data (trajectories) collected **by our most recent version of** \\(\pi_\theta\\). The difference between these two methods **lies on how we optimize the parameter** //(/theta//): -- In *policy-based methods*, we search directly for the optimal policy. We can optimize the parameter //(/theta//) **indirectly** by maximize the local approximation of the objective function with techniques like hill climbing, simulated annealing or evolution strategies. +- In *policy-based methods*, we search directly for the optimal policy. We can optimize the parameter //(/theta//) **indirectly** by maximizing the local approximation of the objective function with techniques like hill climbing, simulated annealing, or evolution strategies. - In *policy-gradient methods*, because we're a subclass of the policy-based methods, we search directly for the optimal policy. But we optimize the parameter //(/theta//) **directly** by performing the gradient ascent on the performance of the objective function \\(J(\theta)\\). -Before diving more into how works policy-gradient methods (the objective function, policy gradient theorem, gradient ascent etc.) let's study the advantages and disadvantages of policy-based methods. +Before diving more into how works policy-gradient methods (the objective function, policy gradient theorem, gradient ascent, etc.), let's study the advantages and disadvantages of policy-based methods.