Update advantages-disadvantages and policy gradient

This commit is contained in:
simoninithomas
2023-01-02 22:23:27 +01:00
parent 88fded6cf3
commit e1cf375c36
2 changed files with 43 additions and 39 deletions

View File

@@ -1,6 +1,6 @@
# The advantages and disadvantages of policy-gradient methods
At this point, you might ask "But Deep Q-Learning is excellent! Why use policy-gradient methods?", let's study then the **advantages and disadvantages of policy-gradient methods**.
At this point, you might ask, "but Deep Q-Learning is excellent! Why use policy-gradient methods?". To answer this question, let's study the **advantages and disadvantages of policy-gradient methods**.
## Advantages
@@ -34,7 +34,7 @@ The problem is that the **two rose cases are aliased states because the agent pe
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/hamster2.jpg" alt="Hamster 1"/>
</figure>
Under a deterministic policy, the policy will either move right when in a red state or move left. **Either case will cause our agent to get stuck and never suck the dust**.
Under a deterministic policy, the policy either will move right when in a red state or move left. **Either case will cause our agent to get stuck and never suck the dust**.
Under a value-based Reinforcement learning algorithm, we learn a **quasi-deterministic policy** ("greedy epsilon strategy"). Consequently, our agent can **spend a lot of time before finding the dust**.
@@ -44,29 +44,31 @@ On the other hand, an optimal stochastic policy **will randomly move left or rig
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/hamster3.jpg" alt="Hamster 1"/>
</figure>
### Policy-gradient methods are more effective in in high-dimensional action spaces and continuous actions spaces
### Policy-gradient methods are more effective in high-dimensional action spaces and continuous actions spaces
Indeed, the problem with Deep Q-learning is that their **predictions assign a score (maximum expected future reward) for each possible action**, at each time step, given the current state.
But what if we have an infinite possibility of actions?
For instance, with a self-driving car, at each state, you can have a (near) infinite choice of actions (turning the wheel at 15°, 17.2°, 19,4°, honking, etc.). We'll need to output a Q-value for each possible action! And taking the max action of a continuous output is an optimization problem itself!
For instance, with a self-driving car, at each state, you can have a (near) infinite choice of actions (turning the wheel at 15°, 17.2°, 19,4°, honking, etc.). **We'll need to output a Q-value for each possible action**! And **taking the max action of a continuous output is an optimization problem itself**!
Instead, with a policy-gradient methods, we output a **probability distribution over actions.**
Instead, with policy-gradient methods, we output a **probability distribution over actions.**
### Policy-gradient methods have better convergence properties
In value-based methods, we use an aggressive operator to **change the value function: we take the maximum over Q-estimates**.
Consequently, the action probabilities may change dramatically for an arbitrary small change in the estimated action-values if that change results in a different action having the maximal value.
For instance if during the training the best action was left (with Q-value of 0.22) and the training step after it's right (since the right Q-value become 0.23) we dramatically changed the policy since now the policy will take most of the time right instead of left.
On the other hand, in policy-gradient methods, stochastic policy action preferences (probability of taking action) **changes smoothly over time**.
In value-based methods, we use an aggressive operator to **change the value function: we take the maximum over Q-estimates**.
Consequently, the action probabilities may change dramatically for an arbitrarily small change in the estimated action values if that change results in a different action having the maximal value.
For instance, if during the training, the best action was left (with a Q-value of 0.22) and the training step after it's right (since the right Q-value becomes 0.23), we dramatically changed the policy since now the policy will take most of the time right instead of left.
On the other hand, in policy-gradient methods, stochastic policy action preferences (probability of taking action) **change smoothly over time**.
## Disadvantages
Naturally, policy-gradient methods have also some disadvantages:
Naturally, policy-gradient methods also have some disadvantages:
- **Policy-gradient converge a lot of time on a local maximum instead of a global optimum.**
- Policy-gradient goes less faster, **step by step: it can take longer to train (inefficient).**
- Policy-gradient can have high variance, we'll see in Actor Critic unit why and how we can solve this problem.
- **Policy-gradient converges a lot of time on a local maximum instead of a global optimum.**
- Policy-gradient goes less fast, **step by step: it can take longer to train (inefficient).**
- Policy-gradient can have high variance. We'll see in actor-critic unit why and how we can solve this problem.
👉 If you want to go deeper on the advantages and disadvantages of policy-gradient methods, [you can check this video](https://youtu.be/y3oqOjHilio).
👉 If you want to go deeper into the advantages and disadvantages of policy-gradient methods, [you can check this video](https://youtu.be/y3oqOjHilio).

View File

@@ -2,9 +2,9 @@
## Getting the big picture
We just learned that the goal of policy-gradient methods is to find parameters //(/theta //) that maximize the expected return.
We just learned that policy-gradient methods aim to find parameters ////theta //) that **maximize the expected return**.
The idea is that we have a *parameterized stochastic policy*. In our case, a neural network that output a probability distribution over actions. The probability of taking each action is also called *action preference*.
The idea is that we have a *parameterized stochastic policy*. In our case, a neural network outputs a probability distribution over actions. The probability of taking each action is also called *action preference*.
If we take the example of CartPole-v1:
- As input, we have a state.
@@ -15,21 +15,21 @@ If we take the example of CartPole-v1:
Our goal with policy-gradient is to **control the probability distribution of actions** by tuning the policy such that **good actions (that maximize the return) are sampled more frequently in the future.**
Each time the agent interacts with the environment, we tweak the parameters such that good actions will be sampled more likely in the future.
But how we're going to optimize the weights using the expected return?
But **how we're going to optimize the weights using the expected return**?
The idea, is that we're going to **let the agent interact during an episode**. And if we win the episode, we consider that each action taken were good and must be more sampled in the future
The idea is that we're going to **let the agent interact during an episode**. And if we win the episode, we consider that each action taken was good and must be more sampled in the future
since they lead to win.
So for each state, action pair, we want to increase the //(P(a|s)//): probability of taking that action at that state. Or decrease if we lost.
So for each state-action pair, we want to increase the //(P(a|s)//): the probability of taking that action at that state. Or decrease if we lost.
The Policy-gradient algorithm (simplified) looks like this:
<figure class="image table text-center m-0 w-full">
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/pg_bigpicture.jpg" alt="Policy Gradient Big Picture"/>
</figure>
Now that we got the big picture, let's dive deeper into policy-gradient.
Now that we got the big picture, let's dive deeper into policy-gradient methods.
## Diving deeper into policy-gradient
## Diving deeper into policy-gradient methods
We have our stochastic policy \\(\pi\\) which has a parameter \\(\theta\\). This \\(\pi\\), given a state, **outputs a probability distribution of actions**.
@@ -37,26 +37,25 @@ We have our stochastic policy \\(\pi\\) which has a parameter \\(\theta\\). This
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/stochastic_policy.png" alt="Policy"/>
</figure>
Where \\(\pi_\theta(a_t|s_t)\\) is the probability of the agent selecting action at from state st, given our policy.
Where \\(\pi_\theta(a_t|s_t)\\) is the probability of the agent selecting action \\(a_t\\) from state \\(s_t\\) given our policy.
**But how do we know if our policy is good?** We need to have a way to measure it. To know that we define a score/objective function called \\(J(\theta)\\).
**But how do we know if our policy is good?** We need to have a way to measure it. To know that, we define a score/objective function called \\(J(\theta)\\).
### The Objective Function
### The objective function
The Objective function gives us the **performance of the agent** given a trajectory (state action sequence without taking into account reward (contrary to an episode)), it outputs the *expected cumulative reward*.
The *objective function* gives us the **performance of the agent** given a trajectory (state action sequence without considering reward (contrary to an episode)), and it outputs the *expected cumulative reward*.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/objective.jpg" alt="Return"/>
Let's detail a little bit more this formula:
- The *expected return* (also called expected cumulative reward), is the weighted average (where the weights are given by \\(P(\tau;\theta)\\) of all possible values that the return \\(R(\tau)\\) can take.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/expected_reward.png" alt="Return"/>
- \\(R(\tau)\\) : Return from an arbitrary trajectory. To take this quantity and use it to calculate the expected return, we need to multiply it by the probability of each possible trajectory.
- \\(P(\tau;\theta)\\) : Probability of each possible trajectory \\(tau\\) (that probability depends on (\\theta\\) since it defines the policy that it uses to select the actions of the trajectory which as an impact of the states visited).
- \\(J(\theta)\\) : Expected return, we calculate it by summing for all trajectories, the probability of taking that trajectory given $\theta$ and the return of this trajectory.
- \\(J(\theta)\\) : Expected return, we calculate it by summing for all trajectories, the probability of taking that trajectory given $\theta$, and the return of this trajectory.
Our objective then is to maximize the expected cumulative rewards by finding \\(\theta \\) that will output the best action probability distributions:
@@ -65,28 +64,31 @@ Our objective then is to maximize the expected cumulative rewards by finding \\(
## Gradient Ascent and the Policy-gradient Theorem
Policy-gradient is an optimization problem: we want to find the values of \\(\theta\\) that maximize our objective function \\(J(\theta)\\), we need to use **gradient-ascent**. It's the inverse of *gradient-descent* since it gives the direction of the steepest increase of \\(J(\theta)\\).
Our update step for gradient-ascent is:
\\ \theta \leftarrow \theta + \alpha * \nabla_\theta J(\theta) \\)
We can repeatidly apply this update state in the hope that \\(\theta)\\ converges to the value that maximize \\J(\theta)\\).
We can repeatedly apply this update state in the hope that \\(\theta)\\ converges to the value that maximizes \\J(\theta)\\).
However, we have two problems to derivate \\(J(\theta)\\):
1. We can't calculate the true gradient of the objective function, since it would imply to calculate the probability of each possible trajectory which is computotially super expensive.
We want then to **calculate an estimation of the gradient with a sample-based estimate (collect some trajectories)**.
1. We can't calculate the true gradient of the objective function since it would imply calculating the probability of each possible trajectory which is computationally super expensive.
We want then to **calculate a gradient estimation with a sample-based estimate (collect some trajectories)**.
2. We have another problem, that I detail in the optional next section. That is to differentiate this objective function we need to differentiate the state distribution (attached to the environment it gives us the probability of the environment goes into next state given the current state and the action taken) but we might not know about it.
2. We have another problem that I detail in the optional next section. That is, to differentiate this objective function, we need to differentiate the state distribution, also called Markov Decision Process dynamics. This is attached to the environment. It gives us the probability of the environment going into the next state, given the current state and the action taken by the agent. The problem is that we can't differentiate it because we might not know about it.
TODO: add maths of jtheta
TODO: Add markov decision dynamics
Fortunately we're going to use a solution called the Policy Gradient Theorem that will help us to reformulate the objective function into a differentiable function that does not involve the differentiation of the state distribution.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_gradient_theorem.png" alt="Policy Gradient"/>
## The policy-gradient algorithm
## The Reinforce algorithm
The Reinforce algorithm is a policy-gradient algorithm that works like this:
The Reinforce algorithm works like this:
Loop:
In a loop:
- Use the policy \\(\pi_\theta\\) to collect an episode \\(\tau\\)
- Use the episode to estimate the gradient \\(\hat{g} = \nabla_\theta J(\theta)\\)
@@ -98,13 +100,13 @@ Loop:
The interpretation we can make is this one:
- \\(\nabla_\theta log \pi_\theta(a_t|s_t)\\) is the direction of **steepest increase of the (log) probability** of selecting action at from state st.
=> This tells use **how we should change the weights of policy** if we want to increase/decrease the log probability of selecting action at at state st.
This tells us **how we should change the weights of policy** if we want to increase/decrease the log probability of selecting action \\(a_t\\) at state \\(s_t\\).
- \\(R(\tau)\\): is the scoring function:
- If the return is high, it will push up the probabilities of the (state, action) combinations.
- Else, if the return is low, it will push down the probabilities of the (state, action) combinations.
- If the return is high, it will **push up the probabilities** of the (state, action) combinations.
- Else, if the return is low, it will **push down the probabilities** of the (state, action) combinations.
We can also collect multiple episodes to estimate the gradient:
We can also **collect multiple episodes (trajectories)** to estimate the gradient:
<figure class="image table text-center m-0 w-full">
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_gradient_multiple.png" alt="Policy Gradient"/>
</figure>