mirror of
https://github.com/huggingface/deep-rl-class.git
synced 2026-04-01 17:51:01 +08:00
Update intro and what are policy based mtd
This commit is contained in:
@@ -110,12 +110,12 @@
|
||||
title: Optuna
|
||||
- local: unitbonus2/hands-on
|
||||
title: Hands-on
|
||||
- title: Unit 4. Policy Gradient with Robotics
|
||||
- title: Unit 4. Policy Gradient with PyTorch
|
||||
sections:
|
||||
- local: unit4/introduction
|
||||
title: Introduction
|
||||
- local: unit4/what-are-policy-based-methods
|
||||
title: What are the Policy Based methods?
|
||||
title: What are the policy-based methods?
|
||||
- local: unit4/advantages-disadvantages
|
||||
title: The advantages and disadvantages of Policy-based methods
|
||||
- local: unit4/policy-gradient
|
||||
|
||||
@@ -8,13 +8,14 @@ Indeed, since the beginning of the course, we only studied value-based methods,
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy" />
|
||||
|
||||
Because, in value-based, ** \\(π\\) exists only because of the action value estimates, since policy is just a function** (for instance, greedy-policy) that will select the action with the highest value given a state.
|
||||
Because, in value-based, policy ** \\(π\\) exists only because of the action value estimates, since the policy is just a function** (for instance, greedy-policy) that will select the action with the highest value given a state.
|
||||
|
||||
But, with policy-based methods, we want to optimize the policy directly **without having an intermediate step of learning a value function.**
|
||||
|
||||
So today, **we'll learn about policy-based methods, and we'll study a subset of these methods called Policy Gradients**. Then we'll implement our first policy gradient algorithm called Monte Carlo **Reinforce** from scratch using PyTorch. Before testing its robustness using CartPole-v1, and PixelCopter.
|
||||
So today, **we'll learn about policy-based methods, and we'll study a subset of these methods called policy gradient**. Then we'll implement our first policy gradient algorithm called Monte Carlo **Reinforce** from scratch using PyTorch.
|
||||
Before testing its robustness using CartPole-v1, and PixelCopter environments.
|
||||
|
||||
You'll be then able to iterate and improve this implementation for more advanced environments.
|
||||
You'll then be able to iterate and improve this implementation for more advanced environments.
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/envs.gif" alt="Environments"/>
|
||||
|
||||
@@ -1,44 +1,44 @@
|
||||
# What are the policy-based methods?
|
||||
|
||||
The main goal of Reinforcement learning is to **find the optimal policy \\(\pi*\\) that will maximize the expected cumulative reward**.
|
||||
Because Reinforcement Learning is based on the *reward hypothesis* that is **all goals can be described as the maximization of the expected cumulative reward.**
|
||||
The main goal of Reinforcement learning is to **find the optimal policy \\(\pi^{*}\\) that will maximize the expected cumulative reward**.
|
||||
Because Reinforcement Learning is based on the *reward hypothesis*: **all goals can be described as the maximization of the expected cumulative reward.**
|
||||
|
||||
For instance, in a soccer game (that you're going to train the agents in two units), the goal is to win the game. We can describe this goal in reinforcement learning as
|
||||
maximizing the number of goal scored (when the ball cross the goal line) into your opponent soccer goals. And minimize the number of goals into yours soccer goals.
|
||||
For instance, in a soccer game (where you're going to train the agents in two units), the goal is to win the game. We can describe this goal in reinforcement learning as
|
||||
**maximizing the number of goals scored** (when the ball crosses the goal line) into your opponent's soccer goals. And **minimizing the number of goals in your soccer goals**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/soccer.jpg" alt="Soccer" />
|
||||
|
||||
## Value-based, Policy-based and Actor-critic methods
|
||||
## Value-based, Policy-based, and Actor-critic methods
|
||||
|
||||
We studied in the Unit 1, that we had two methods to find (most of the time approximate) this optimal policy \\(\pi*\\).
|
||||
We studied in the first unit, that we had two methods to find (most of the time approximate) this optimal policy \\(\pi*\\).
|
||||
|
||||
- In *value-Based methods*, we learn a value function.
|
||||
- The idea then is that an optimal value function leads to an optimal policy \\(\pi*\\).
|
||||
- Our objective is to **minimize the loss between, predicted and target value to match the true action-value function.
|
||||
- We have a policy, but it's implicit since it **was generated directly from the Value function**. For instance, in Q-Learning we defined an epsilon-greedy policy.
|
||||
- The idea then is that an optimal value function leads to an optimal policy \\(\pi^{*}\\).
|
||||
- Our objective is to **minimize the loss between the predicted and target value to approximate the true action-value function.
|
||||
- We have a policy, but it's implicit since it **was generated directly from the Value function**. For instance, in Q-Learning, we defined an epsilon-greedy policy.
|
||||
|
||||
- On the other hand, in *policy-based methods*, we directly learn to approximate \\(\pi*\\) without having to learn a value function.
|
||||
- The idea then is to parameterize policy, for instance using a neural network \\(\pi_\theta\\), this policy will output a probability distribution over actions (stochastic policy).
|
||||
- On the other hand, in *policy-based methods*, we directly learn to approximate \\(\pi^{*}\\) without having to learn a value function.
|
||||
- The idea then is **to parameterize the policy**. For instance, using a neural network \\(\pi_\theta\\), this policy will output a probability distribution over actions (stochastic policy).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/stochastic_policy.png" alt="stochastic policy" />
|
||||
|
||||
- Our objective then is *to maximize the performance of the parameterized policy using gradient ascent*.
|
||||
- To do that we control the parameter \\(\theta\\) that will affect the distribution of actions over a state.
|
||||
- Our objective then is **to maximize the performance of the parameterized policy using gradient ascent**.
|
||||
- To do that, we control the parameter \\(\theta\\) that will affect the distribution of actions over a state.
|
||||
|
||||
- Finally, we'll study next time *actor-critic* which is a combination of value-based and policy-based methods.
|
||||
- Finally, we'll study the next time *actor-critic* which is a combination of value-based and policy-based methods.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy-based.png" alt="Policy based" />
|
||||
|
||||
Consequently, thanks to policy-based methods, we can directly optimize our policy \\(\pi_theta\\) to output a probability distribution over actions \\(\pi_\theta(a|s)\\) that leads to the best cumulative return.
|
||||
To do that we define an objective function \\(J(\theta)\\), that is the expected cumulative reward and we **want to find \\(\theta\\) that maximize this objective function**.
|
||||
To do that, we define an objective function \\(J(\theta)\\), that is, the expected cumulative reward, and we **want to find \\(\theta\\) that maximizes this objective function**.
|
||||
|
||||
## The difference between policy-based and policy-gradient methods
|
||||
|
||||
Policy-gradient methods, what we're going to study in this unit, is a subclass of policy-based methods. In policy-based methods, the optimization is most of the time *on-policy* since for each update we only use data (trajectories) collected **by our most recent version of** \\(\pi_\theta\\).
|
||||
Policy-gradient methods, what we're going to study in this unit, is a subclass of policy-based methods. In policy-based methods, the optimization is most of the time *on-policy* since for each update, we only use data (trajectories) collected **by our most recent version of** \\(\pi_\theta\\).
|
||||
|
||||
The difference between these two methods **lies on how we optimize the parameter** //(/theta//):
|
||||
|
||||
- In *policy-based methods*, we search directly for the optimal policy. We can optimize the parameter //(/theta//) **indirectly** by maximize the local approximation of the objective function with techniques like hill climbing, simulated annealing or evolution strategies.
|
||||
- In *policy-based methods*, we search directly for the optimal policy. We can optimize the parameter //(/theta//) **indirectly** by maximizing the local approximation of the objective function with techniques like hill climbing, simulated annealing, or evolution strategies.
|
||||
- In *policy-gradient methods*, because we're a subclass of the policy-based methods, we search directly for the optimal policy. But we optimize the parameter //(/theta//) **directly** by performing the gradient ascent on the performance of the objective function \\(J(\theta)\\).
|
||||
|
||||
Before diving more into how works policy-gradient methods (the objective function, policy gradient theorem, gradient ascent etc.) let's study the advantages and disadvantages of policy-based methods.
|
||||
Before diving more into how works policy-gradient methods (the objective function, policy gradient theorem, gradient ascent, etc.), let's study the advantages and disadvantages of policy-based methods.
|
||||
|
||||
Reference in New Issue
Block a user