mirror of
https://github.com/huggingface/deep-rl-class.git
synced 2026-06-15 06:27:24 +08:00
Typos Unit4
This commit is contained in:
@@ -34,7 +34,7 @@ The problem is that the **two rose cases are aliased states because the agent pe
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/hamster2.jpg" alt="Hamster 1"/>
|
||||
</figure>
|
||||
|
||||
Under a deterministic policy, the policy either will move right when in a red state or move left. **Either case will cause our agent to get stuck and never suck the dust**.
|
||||
Under a deterministic policy, the policy will either always move right when in a red state or always move left. **Either case will cause our agent to get stuck and never suck the dust**.
|
||||
|
||||
Under a value-based Reinforcement learning algorithm, we learn a **quasi-deterministic policy** ("greedy epsilon strategy"). Consequently, our agent can **spend a lot of time before finding the dust**.
|
||||
|
||||
@@ -67,8 +67,8 @@ On the other hand, in policy-gradient methods, stochastic policy action preferen
|
||||
|
||||
Naturally, policy-gradient methods also have some disadvantages:
|
||||
|
||||
- **Frequently, policy-gradient converges on a local maximum instead of a global optimum.**
|
||||
- **Frequently, policy-gradient methods converges to a local maximum instead of a global optimum.**
|
||||
- Policy-gradient goes slower, **step by step: it can take longer to train (inefficient).**
|
||||
- Policy-gradient can have high variance. We'll see in actor-critic unit why and how we can solve this problem.
|
||||
- Policy-gradient can have high variance. We'll see in the actor-critic unit why, and how we can solve this problem.
|
||||
|
||||
👉 If you want to go deeper into the advantages and disadvantages of policy-gradient methods, [you can check this video](https://youtu.be/y3oqOjHilio).
|
||||
|
||||
@@ -10,8 +10,8 @@ frames as observation)?
|
||||
In the next unit, **we're going to learn more about Unity MLAgents**, by training agents in Unity environments. This way, you will be ready to participate in the **AI vs AI challenges where you'll train your agents
|
||||
to compete against other agents in a snowball fight and a soccer game.**
|
||||
|
||||
Sounds fun? See you next time!
|
||||
Sound fun? See you next time!
|
||||
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
|
||||
### Keep Learning, stay awesome 🤗
|
||||
|
||||
@@ -10,7 +10,7 @@
|
||||
|
||||
|
||||
|
||||
Now that we studied the theory behind Reinforce, **you’re ready to code your Reinforce agent with PyTorch**. And you'll test its robustness using CartPole-v1 and PixelCopter,.
|
||||
Now that we've studied the theory behind Reinforce, **you’re ready to code your Reinforce agent with PyTorch**. And you'll test its robustness using CartPole-v1 and PixelCopter,.
|
||||
|
||||
You'll then be able to iterate and improve this implementation for more advanced environments.
|
||||
|
||||
@@ -19,9 +19,9 @@ You'll then be able to iterate and improve this implementation for more advanced
|
||||
</figure>
|
||||
|
||||
|
||||
To validate this hands-on for the certification process, you need to push your trained models to the Hub.
|
||||
To validate this hands-on for the certification process, you need to push your trained models to the Hub and:
|
||||
|
||||
- Get a result of >= 350 for `Cartpole-v1`.
|
||||
- Get a result of >= 350 for `Cartpole-v1`
|
||||
- Get a result of >= 5 for `PixelCopter`.
|
||||
|
||||
To find your result, go to the leaderboard and find your model, **the result = mean_reward - std of reward**. **If you don't see your model on the leaderboard, go at the bottom of the leaderboard page and click on the refresh button**.
|
||||
@@ -75,7 +75,7 @@ We're constantly trying to improve our tutorials, so **if you find some issues i
|
||||
|
||||
At the end of the notebook, you will:
|
||||
|
||||
- Be able to **code from scratch a Reinforce algorithm using PyTorch.**
|
||||
- Be able to **code a Reinforce algorithm from scratch using PyTorch.**
|
||||
- Be able to **test the robustness of your agent using simple environments.**
|
||||
- Be able to **push your trained agent to the Hub** with a nice video replay and an evaluation score 🔥.
|
||||
|
||||
@@ -87,9 +87,9 @@ Before diving into the notebook, you need to:
|
||||
|
||||
# Let's code Reinforce algorithm from scratch 🔥
|
||||
|
||||
## An advice 💡
|
||||
## Some advice 💡
|
||||
|
||||
It's better to run this colab in a copy on your Google Drive, so that **if it timeouts** you still have the saved notebook on your Google Drive and do not need to fill everything from scratch.
|
||||
It's better to run this colab in a copy on your Google Drive, so that **if it times out** you still have the saved notebook on your Google Drive and do not need to fill everything in from scratch.
|
||||
|
||||
To do that you can either do `Ctrl + S` or `File > Save a copy in Google Drive.`
|
||||
|
||||
@@ -107,7 +107,7 @@ To do that you can either do `Ctrl + S` or `File > Save a copy in Google Drive.`
|
||||
|
||||
During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).
|
||||
|
||||
Hence the following cell will install the librairies and create and run a virtual screen 🖥
|
||||
The following cell will install the librairies and create and run a virtual screen 🖥
|
||||
|
||||
```python
|
||||
%%capture
|
||||
@@ -145,7 +145,7 @@ And you can find all the Deep Reinforcement Learning models here 👉 https://hu
|
||||
|
||||
## Import the packages 📦
|
||||
|
||||
In addition to import the installed libraries, we also import:
|
||||
In addition to importing the installed libraries, we also import:
|
||||
|
||||
- `imageio`: A library that will help us to generate a replay video
|
||||
|
||||
@@ -217,10 +217,10 @@ So, we start with CartPole-v1. The goal is to push the cart left or right **so t
|
||||
|
||||
The episode ends if:
|
||||
- The pole Angle is greater than ±12°
|
||||
- Cart Position is greater than ±2.4
|
||||
- Episode length is greater than 500
|
||||
- The Cart Position is greater than ±2.4
|
||||
- The episode length is greater than 500
|
||||
|
||||
We get a reward 💰 of +1 every timestep the Pole stays in the equilibrium.
|
||||
We get a reward 💰 of +1 every timestep that the Pole stays in the equilibrium.
|
||||
|
||||
```python
|
||||
env_id = "CartPole-v1"
|
||||
@@ -258,8 +258,8 @@ This implementation is based on three implementations:
|
||||
|
||||
So we want:
|
||||
- Two fully connected layers (fc1 and fc2).
|
||||
- Using ReLU as activation function of fc1
|
||||
- Using Softmax to output a probability distribution over actions
|
||||
- To use ReLU as activation function of fc1
|
||||
- To use Softmax to output a probability distribution over actions
|
||||
|
||||
```python
|
||||
class Policy(nn.Module):
|
||||
@@ -310,7 +310,7 @@ class Policy(nn.Module):
|
||||
return action.item(), m.log_prob(action)
|
||||
```
|
||||
|
||||
I make a mistake, can you guess where?
|
||||
I made a mistake, can you guess where?
|
||||
|
||||
- To find out let's make a forward pass:
|
||||
|
||||
@@ -325,7 +325,7 @@ debug_policy.act(env.reset())
|
||||
|
||||
- Do you know why? Check the act function and try to see why it does not work.
|
||||
|
||||
Advice 💡: Something is wrong in this implementation. Remember that we act function **we want to sample an action from the probability distribution over actions**.
|
||||
Advice 💡: Something is wrong in this implementation. Remember that for the act function **we want to sample an action from the probability distribution over actions**.
|
||||
|
||||
|
||||
### (Real) Solution
|
||||
@@ -352,9 +352,9 @@ class Policy(nn.Module):
|
||||
|
||||
By using CartPole, it was easier to debug since **we know that the bug comes from our integration and not from our simple environment**.
|
||||
|
||||
- Since **we want to sample an action from the probability distribution over actions**, we can't use `action = np.argmax(m)` since it will always output the action that have the highest probability.
|
||||
- Since **we want to sample an action from the probability distribution over actions**, we can't use `action = np.argmax(m)` since it will always output the action that has the highest probability.
|
||||
|
||||
- We need to replace with `action = m.sample()` that will sample an action from the probability distribution P(.|s)
|
||||
- We need to replace this with `action = m.sample()` which will sample an action from the probability distribution P(.|s)
|
||||
|
||||
### Let's build the Reinforce Training Algorithm
|
||||
This is the Reinforce algorithm pseudocode:
|
||||
@@ -371,7 +371,7 @@ This is the Reinforce algorithm pseudocode:
|
||||
We use an interesting technique coded by [Chris1nexus](https://github.com/Chris1nexus) to **compute the return at each timestep efficiently**. The comments explained the procedure. Don't hesitate also [to check the PR explanation](https://github.com/huggingface/deep-rl-class/pull/95)
|
||||
But overall the idea is to **compute the return at each timestep efficiently**.
|
||||
|
||||
The second question you may ask is **why do we minimize the loss**? Did you talk about Gradient Ascent, not Gradient Descent?
|
||||
The second question you may ask is **why do we minimize the loss**? Didn't we talk about Gradient Ascent, not Gradient Descent earlier?
|
||||
|
||||
- We want to maximize our utility function $J(\theta)$, but in PyTorch and TensorFlow, it's better to **minimize an objective function.**
|
||||
- So let's say we want to reinforce action 3 at a certain timestep. Before training this action P is 0.25.
|
||||
@@ -797,12 +797,12 @@ def push_to_hub(repo_id,
|
||||
print(f"Your model is pushed to the Hub. You can view your model here: {repo_url}")
|
||||
```
|
||||
|
||||
By using `push_to_hub` **you evaluate, record a replay, generate a model card of your agent and push it to the Hub**.
|
||||
By using `push_to_hub`, **you evaluate, record a replay, generate a model card of your agent, and push it to the Hub**.
|
||||
|
||||
This way:
|
||||
- You can **showcase our work** 🔥
|
||||
- You can **visualize your agent playing** 👀
|
||||
- You can **share with the community an agent that others can use** 💾
|
||||
- You can **share an agent with the community that others can use** 💾
|
||||
- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
|
||||
|
||||
|
||||
@@ -821,7 +821,7 @@ To be able to share your model with the community there are three more steps to
|
||||
notebook_login()
|
||||
```
|
||||
|
||||
If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login` (or `login`)
|
||||
If you don't want to use Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login` (or `login`)
|
||||
|
||||
3️⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥 using `package_to_hub()` function
|
||||
|
||||
@@ -836,7 +836,7 @@ push_to_hub(
|
||||
)
|
||||
```
|
||||
|
||||
Now that we try the robustness of our implementation, let's try a more complex environment: PixelCopter 🚁
|
||||
Now that we tested the robustness of our implementation, let's try a more complex environment: PixelCopter 🚁
|
||||
|
||||
|
||||
|
||||
@@ -881,7 +881,7 @@ The action space(2) 🎮:
|
||||
- Down
|
||||
|
||||
The reward function 💰:
|
||||
- For each vertical block it passes through it gains a positive reward of +1. Each time a terminal state reached it receives a negative reward of -1.
|
||||
- For each vertical block it passes, it gains a positive reward of +1. Each time a terminal state is reached it receives a negative reward of -1.
|
||||
|
||||
### Define the new Policy 🧠
|
||||
- We need to have a deeper neural network since the environment is more complex
|
||||
@@ -986,11 +986,11 @@ push_to_hub(
|
||||
|
||||
## Some additional challenges 🏆
|
||||
|
||||
The best way to learn **is to try things on your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. But also trying to find better parameters.
|
||||
The best way to learn **is to try things on your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. But also try to find better parameters.
|
||||
|
||||
In the [Leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?
|
||||
|
||||
Here are some ideas to achieve so:
|
||||
Here are some ideas to climb up the leaderboard:
|
||||
* Train more steps
|
||||
* Try different hyperparameters by looking at what your classmates have done 👉 https://huggingface.co/models?other=reinforce
|
||||
* **Push your new trained model** on the Hub 🔥
|
||||
@@ -1008,9 +1008,9 @@ frames as observation)?
|
||||
In the next unit, **we're going to learn more about Unity MLAgents**, by training agents in Unity environments. This way, you will be ready to participate in the **AI vs AI challenges where you'll train your agents
|
||||
to compete against other agents in a snowball fight and a soccer game.**
|
||||
|
||||
Sounds fun? See you next time!
|
||||
Sound fun? See you next time!
|
||||
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
|
||||
See you in Unit 5! 🔥
|
||||
|
||||
|
||||
@@ -4,13 +4,13 @@
|
||||
|
||||
In the last unit, we learned about Deep Q-Learning. In this value-based deep reinforcement learning algorithm, we **used a deep neural network to approximate the different Q-values for each possible action at a state.**
|
||||
|
||||
Since the beginning of the course, we only studied value-based methods, **where we estimate a value function as an intermediate step towards finding an optimal policy.**
|
||||
Since the beginning of the course, we have only studied value-based methods, **where we estimate a value function as an intermediate step towards finding an optimal policy.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy" />
|
||||
|
||||
In value-based methods, the policy ** \\(π\\) only exists because of the action value estimates since the policy is just a function** (for instance, greedy-policy) that will select the action with the highest value given a state.
|
||||
In value-based methods, the policy ** \\(π\\) is determined by the action value estimates by a function** (for instance, the greedy-policy, which selects the action with the highest value given a state).
|
||||
|
||||
But, with policy-based methods, we want to optimize the policy directly **without having an intermediate step of learning a value function.**
|
||||
With policy-based methods, we want to optimize the policy directly **without having an intermediate step of learning a value function.**
|
||||
|
||||
So today, **we'll learn about policy-based methods and study a subset of these methods called policy gradient**. Then we'll implement our first policy gradient algorithm called Monte Carlo **Reinforce** from scratch using PyTorch.
|
||||
Then, we'll test its robustness using the CartPole-v1 and PixelCopter environments.
|
||||
@@ -21,4 +21,4 @@ You'll then be able to iterate and improve this implementation for more advanced
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/envs.gif" alt="Environments"/>
|
||||
</figure>
|
||||
|
||||
Let's get started,
|
||||
Let's get started!
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
|
||||
We just learned that policy-gradient methods aim to find parameters \\( \theta \\) that **maximize the expected return**.
|
||||
|
||||
The idea is that we have a *parameterized stochastic policy*. In our case, a neural network outputs a probability distribution over actions. The probability of taking each action is also called *action preference*.
|
||||
The idea is that we have a *parameterized stochastic policy*. In our case, a neural network outputs a probability distribution over actions. The probability of taking each action is also called the *action preference*.
|
||||
|
||||
If we take the example of CartPole-v1:
|
||||
- As input, we have a state.
|
||||
@@ -47,20 +47,20 @@ The *objective function* gives us the **performance of the agent** given a traje
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/objective.jpg" alt="Return"/>
|
||||
|
||||
Let's detail a little bit more this formula:
|
||||
- The *expected return* (also called expected cumulative reward), is the weighted average (where the weights are given by \\(P(\tau;\theta)\\) of all possible values that the return \\(R(\tau)\\) can take.
|
||||
Let's give some more details on this formula:
|
||||
- The *expected return* (also called expected cumulative reward), is the weighted average (where the weights are given by \\(P(\tau;\theta)\\) of all possible values that the return \\(R(\tau)\\) can take).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/expected_reward.png" alt="Return"/>
|
||||
|
||||
|
||||
- \\(R(\tau)\\) : Return from an arbitrary trajectory. To take this quantity and use it to calculate the expected return, we need to multiply it by the probability of each possible trajectory.
|
||||
- \\(P(\tau;\theta)\\) : Probability of each possible trajectory \\(\tau\\) (that probability depends on \\( \theta\\) since it defines the policy that it uses to select the actions of the trajectory which as an impact of the states visited).
|
||||
- \\(P(\tau;\theta)\\) : Probability of each possible trajectory \\(\tau\\) (that probability depends on \\( \theta\\) since it defines the policy that it uses to select the actions of the trajectory which has an impact of the states visited).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/probability.png" alt="Probability"/>
|
||||
|
||||
- \\(J(\theta)\\) : Expected return, we calculate it by summing for all trajectories, the probability of taking that trajectory given \\(\theta \\), and the return of this trajectory.
|
||||
- \\(J(\theta)\\) : Expected return, we calculate it by summing for all trajectories, the probability of taking that trajectory given \\(\theta \\) multiplied by the return of this trajectory.
|
||||
|
||||
Our objective then is to maximize the expected cumulative reward by finding \\(\theta \\) that will output the best action probability distributions:
|
||||
Our objective then is to maximize the expected cumulative reward by finding the \\(\theta \\) that will output the best action probability distributions:
|
||||
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/max_objective.png" alt="Max objective"/>
|
||||
@@ -68,7 +68,7 @@ Our objective then is to maximize the expected cumulative reward by finding \\(\
|
||||
|
||||
## Gradient Ascent and the Policy-gradient Theorem
|
||||
|
||||
Policy-gradient is an optimization problem: we want to find the values of \\(\theta\\) that maximize our objective function \\(J(\theta)\\), we need to use **gradient-ascent**. It's the inverse of *gradient-descent* since it gives the direction of the steepest increase of \\(J(\theta)\\).
|
||||
Policy-gradient is an optimization problem: we want to find the values of \\(\theta\\) that maximize our objective function \\(J(\theta)\\), so we need to use **gradient-ascent**. It's the inverse of *gradient-descent* since it gives the direction of the steepest increase of \\(J(\theta)\\).
|
||||
|
||||
(If you need a refresher on the difference between gradient descent and gradient ascent [check this](https://www.baeldung.com/cs/gradient-descent-vs-ascent) and [this](https://stats.stackexchange.com/questions/258721/gradient-ascent-vs-gradient-descent-in-logistic-regression)).
|
||||
|
||||
@@ -76,13 +76,13 @@ Our update step for gradient-ascent is:
|
||||
|
||||
\\( \theta \leftarrow \theta + \alpha * \nabla_\theta J(\theta) \\)
|
||||
|
||||
We can repeatedly apply this update state in the hope that \\(\theta \\) converges to the value that maximizes \\(J(\theta)\\).
|
||||
We can repeatedly apply this update in the hopes that \\(\theta \\) converges to the value that maximizes \\(J(\theta)\\).
|
||||
|
||||
However, we have two problems to obtain the derivative of \\(J(\theta)\\):
|
||||
1. We can't calculate the true gradient of the objective function since it would imply calculating the probability of each possible trajectory which is computationally super expensive.
|
||||
We want then to **calculate a gradient estimation with a sample-based estimate (collect some trajectories)**.
|
||||
However, there are two problems with computing the derivative of \\(J(\theta)\\):
|
||||
1. We can't calculate the true gradient of the objective function since it requires calculating the probability of each possible trajectory, which is computationally super expensive.
|
||||
So we want to **calculate a gradient estimation with a sample-based estimate (collect some trajectories)**.
|
||||
|
||||
2. We have another problem that I detail in the next optional section. To differentiate this objective function, we need to differentiate the state distribution, called Markov Decision Process dynamics. This is attached to the environment. It gives us the probability of the environment going into the next state, given the current state and the action taken by the agent. The problem is that we can't differentiate it because we might not know about it.
|
||||
2. We have another problem that I explain in the next optional section. To differentiate this objective function, we need to differentiate the state distribution, called the Markov Decision Process dynamics. This is attached to the environment. It gives us the probability of the environment going into the next state, given the current state and the action taken by the agent. The problem is that we can't differentiate it because we might not know about it.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/probability.png" alt="Probability"/>
|
||||
|
||||
@@ -90,7 +90,7 @@ Fortunately we're going to use a solution called the Policy Gradient Theorem tha
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_gradient_theorem.png" alt="Policy Gradient"/>
|
||||
|
||||
If you want to understand how we derivate this formula that we will use to approximate the gradient, check the next (optional) section.
|
||||
If you want to understand how we derive this formula for approximating the gradient, check out the next (optional) section.
|
||||
|
||||
## The Reinforce algorithm (Monte Carlo Reinforce)
|
||||
|
||||
@@ -106,12 +106,12 @@ In a loop:
|
||||
|
||||
- Update the weights of the policy: \\(\theta \leftarrow \theta + \alpha \hat{g}\\)
|
||||
|
||||
The interpretation we can make is this one:
|
||||
We can interpret this update as follows:
|
||||
- \\(\nabla_\theta log \pi_\theta(a_t|s_t)\\) is the direction of **steepest increase of the (log) probability** of selecting action at from state st.
|
||||
This tells us **how we should change the weights of policy** if we want to increase/decrease the log probability of selecting action \\(a_t\\) at state \\(s_t\\).
|
||||
- \\(R(\tau)\\): is the scoring function:
|
||||
- If the return is high, it will **push up the probabilities** of the (state, action) combinations.
|
||||
- Else, if the return is low, it will **push down the probabilities** of the (state, action) combinations.
|
||||
- Otherwise, if the return is low, it will **push down the probabilities** of the (state, action) combinations.
|
||||
|
||||
|
||||
We can also **collect multiple episodes (trajectories)** to estimate the gradient:
|
||||
|
||||
@@ -10,12 +10,12 @@ For instance, in a soccer game (where you're going to train the agents in two un
|
||||
|
||||
## Value-based, Policy-based, and Actor-critic methods
|
||||
|
||||
We studied in the first unit, that we had two methods to find (most of the time approximate) this optimal policy \\(\pi^{*}\\).
|
||||
In the first unit, we saw two methods to find (or, most of the time, approximate) this optimal policy \\(\pi^{*}\\).
|
||||
|
||||
- In *value-based methods*, we learn a value function.
|
||||
- The idea is that an optimal value function leads to an optimal policy \\(\pi^{*}\\).
|
||||
- Our objective is to **minimize the loss between the predicted and target value** to approximate the true action-value function.
|
||||
- We have a policy, but it's implicit since it **was generated directly from the value function**. For instance, in Q-Learning, we defined an epsilon-greedy policy.
|
||||
- We have a policy, but it's implicit since it **is generated directly from the value function**. For instance, in Q-Learning, we used an (epsilon-)greedy policy.
|
||||
|
||||
- On the other hand, in *policy-based methods*, we directly learn to approximate \\(\pi^{*}\\) without having to learn a value function.
|
||||
- The idea is **to parameterize the policy**. For instance, using a neural network \\(\pi_\theta\\), this policy will output a probability distribution over actions (stochastic policy).
|
||||
@@ -25,10 +25,10 @@ We studied in the first unit, that we had two methods to find (most of the time
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_based.png" alt="Policy based" />
|
||||
|
||||
- Finally, we'll study the next time *actor-critic* which is a combination of value-based and policy-based methods.
|
||||
- Next time, we'll study the *actor-critic* method, which is a combination of value-based and policy-based methods.
|
||||
|
||||
Consequently, thanks to policy-based methods, we can directly optimize our policy \\(\pi_\theta\\) to output a probability distribution over actions \\(\pi_\theta(a|s)\\) that leads to the best cumulative return.
|
||||
To do that, we define an objective function \\(J(\theta)\\), that is, the expected cumulative reward, and we **want to find \\(\theta\\) that maximizes this objective function**.
|
||||
To do that, we define an objective function \\(J(\theta)\\), that is, the expected cumulative reward, and we **want to find the value \\(\theta\\) that maximizes this objective function**.
|
||||
|
||||
## The difference between policy-based and policy-gradient methods
|
||||
|
||||
|
||||
Reference in New Issue
Block a user