Update intro and what are policy based mtd

This commit is contained in:
simoninithomas
2023-01-02 22:05:36 +01:00
parent c0c4f9b565
commit 88fded6cf3
3 changed files with 24 additions and 23 deletions

View File

@@ -110,12 +110,12 @@
title: Optuna
- local: unitbonus2/hands-on
title: Hands-on
- title: Unit 4. Policy Gradient with Robotics
- title: Unit 4. Policy Gradient with PyTorch
sections:
- local: unit4/introduction
title: Introduction
- local: unit4/what-are-policy-based-methods
title: What are the Policy Based methods?
title: What are the policy-based methods?
- local: unit4/advantages-disadvantages
title: The advantages and disadvantages of Policy-based methods
- local: unit4/policy-gradient

View File

@@ -8,13 +8,14 @@ Indeed, since the beginning of the course, we only studied value-based methods,
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy" />
Because, in value-based, ** \\(π\\) exists only because of the action value estimates, since policy is just a function** (for instance, greedy-policy) that will select the action with the highest value given a state.
Because, in value-based, policy ** \\(π\\) exists only because of the action value estimates, since the policy is just a function** (for instance, greedy-policy) that will select the action with the highest value given a state.
But, with policy-based methods, we want to optimize the policy directly **without having an intermediate step of learning a value function.**
So today, **we'll learn about policy-based methods, and we'll study a subset of these methods called Policy Gradients**. Then we'll implement our first policy gradient algorithm called Monte Carlo **Reinforce** from scratch using PyTorch. Before testing its robustness using CartPole-v1, and PixelCopter.
So today, **we'll learn about policy-based methods, and we'll study a subset of these methods called policy gradient**. Then we'll implement our first policy gradient algorithm called Monte Carlo **Reinforce** from scratch using PyTorch.
Before testing its robustness using CartPole-v1, and PixelCopter environments.
You'll be then able to iterate and improve this implementation for more advanced environments.
You'll then be able to iterate and improve this implementation for more advanced environments.
<figure class="image table text-center m-0 w-full">
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/envs.gif" alt="Environments"/>

View File

@@ -1,44 +1,44 @@
# What are the policy-based methods?
The main goal of Reinforcement learning is to **find the optimal policy \\(\pi*\\) that will maximize the expected cumulative reward**.
Because Reinforcement Learning is based on the *reward hypothesis* that is **all goals can be described as the maximization of the expected cumulative reward.**
The main goal of Reinforcement learning is to **find the optimal policy \\(\pi^{*}\\) that will maximize the expected cumulative reward**.
Because Reinforcement Learning is based on the *reward hypothesis*: **all goals can be described as the maximization of the expected cumulative reward.**
For instance, in a soccer game (that you're going to train the agents in two units), the goal is to win the game. We can describe this goal in reinforcement learning as
maximizing the number of goal scored (when the ball cross the goal line) into your opponent soccer goals. And minimize the number of goals into yours soccer goals.
For instance, in a soccer game (where you're going to train the agents in two units), the goal is to win the game. We can describe this goal in reinforcement learning as
**maximizing the number of goals scored** (when the ball crosses the goal line) into your opponent's soccer goals. And **minimizing the number of goals in your soccer goals**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/soccer.jpg" alt="Soccer" />
## Value-based, Policy-based and Actor-critic methods
## Value-based, Policy-based, and Actor-critic methods
We studied in the Unit 1, that we had two methods to find (most of the time approximate) this optimal policy \\(\pi*\\).
We studied in the first unit, that we had two methods to find (most of the time approximate) this optimal policy \\(\pi*\\).
- In *value-Based methods*, we learn a value function.
- The idea then is that an optimal value function leads to an optimal policy \\(\pi*\\).
- Our objective is to **minimize the loss between, predicted and target value to match the true action-value function.
- We have a policy, but it's implicit since it **was generated directly from the Value function**. For instance, in Q-Learning we defined an epsilon-greedy policy.
- The idea then is that an optimal value function leads to an optimal policy \\(\pi^{*}\\).
- Our objective is to **minimize the loss between the predicted and target value to approximate the true action-value function.
- We have a policy, but it's implicit since it **was generated directly from the Value function**. For instance, in Q-Learning, we defined an epsilon-greedy policy.
- On the other hand, in *policy-based methods*, we directly learn to approximate \\(\pi*\\) without having to learn a value function.
- The idea then is to parameterize policy, for instance using a neural network \\(\pi_\theta\\), this policy will output a probability distribution over actions (stochastic policy).
- On the other hand, in *policy-based methods*, we directly learn to approximate \\(\pi^{*}\\) without having to learn a value function.
- The idea then is **to parameterize the policy**. For instance, using a neural network \\(\pi_\theta\\), this policy will output a probability distribution over actions (stochastic policy).
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/stochastic_policy.png" alt="stochastic policy" />
- Our objective then is *to maximize the performance of the parameterized policy using gradient ascent*.
- To do that we control the parameter \\(\theta\\) that will affect the distribution of actions over a state.
- Our objective then is **to maximize the performance of the parameterized policy using gradient ascent**.
- To do that, we control the parameter \\(\theta\\) that will affect the distribution of actions over a state.
- Finally, we'll study next time *actor-critic* which is a combination of value-based and policy-based methods.
- Finally, we'll study the next time *actor-critic* which is a combination of value-based and policy-based methods.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy-based.png" alt="Policy based" />
Consequently, thanks to policy-based methods, we can directly optimize our policy \\(\pi_theta\\) to output a probability distribution over actions \\(\pi_\theta(a|s)\\) that leads to the best cumulative return.
To do that we define an objective function \\(J(\theta)\\), that is the expected cumulative reward and we **want to find \\(\theta\\) that maximize this objective function**.
To do that, we define an objective function \\(J(\theta)\\), that is, the expected cumulative reward, and we **want to find \\(\theta\\) that maximizes this objective function**.
## The difference between policy-based and policy-gradient methods
Policy-gradient methods, what we're going to study in this unit, is a subclass of policy-based methods. In policy-based methods, the optimization is most of the time *on-policy* since for each update we only use data (trajectories) collected **by our most recent version of** \\(\pi_\theta\\).
Policy-gradient methods, what we're going to study in this unit, is a subclass of policy-based methods. In policy-based methods, the optimization is most of the time *on-policy* since for each update, we only use data (trajectories) collected **by our most recent version of** \\(\pi_\theta\\).
The difference between these two methods **lies on how we optimize the parameter** //(/theta//):
- In *policy-based methods*, we search directly for the optimal policy. We can optimize the parameter //(/theta//) **indirectly** by maximize the local approximation of the objective function with techniques like hill climbing, simulated annealing or evolution strategies.
- In *policy-based methods*, we search directly for the optimal policy. We can optimize the parameter //(/theta//) **indirectly** by maximizing the local approximation of the objective function with techniques like hill climbing, simulated annealing, or evolution strategies.
- In *policy-gradient methods*, because we're a subclass of the policy-based methods, we search directly for the optimal policy. But we optimize the parameter //(/theta//) **directly** by performing the gradient ascent on the performance of the objective function \\(J(\theta)\\).
Before diving more into how works policy-gradient methods (the objective function, policy gradient theorem, gradient ascent etc.) let's study the advantages and disadvantages of policy-based methods.
Before diving more into how works policy-gradient methods (the objective function, policy gradient theorem, gradient ascent, etc.), let's study the advantages and disadvantages of policy-based methods.